Preprint
Article

This version is not peer-reviewed.

Transforming Borrower Profiles into Risk Signals: A Hybrid Convolutional Framework for Peer-to-Peer Credit Default Prediction

Submitted:

10 June 2026

Posted:

11 June 2026

You are already at the latest version

Abstract
Purpose: Accurate credit default prediction is essential for sustainable risk management in peer-to-peer (P2P) lending platforms. However, the presence of class imbalance, high-dimensional borrower information, and complex nonlinear relationships among financial variables continues to limit the effectiveness of conventional prediction models. This study proposes a hybrid deep learning framework that transforms borrower attributes into structured grayscale images, enabling convolutional feature extraction for enhanced credit risk assessment. Material and Methods: A LendingClub dataset containing 115,487 completed loan records was analyzed. Data preprocessing involved feature cleaning, transformation, and statistical feature selection, reducing 151 original variables to 64 informative predictors through a composite scoring framework integrating five complementary statistical tests. The selected features were encoded into 64×64 grayscale images and processed using a Convolutional Neural Network (CNN). Seven predictive models were evaluated under five-fold stratified cross-validation across three dataset configurations: the original imbalanced dataset, SMOTE-balanced training data, and randomly undersampled training data. The evaluated models included conventional machine learning classifiers (Support Vector Machine, Random Forest, and Decision Tree), a Deep Neural Network (DNN), a standalone CNN, and three hybrid architectures combining CNN-based feature extraction with downstream classifiers (CNN+SVM, CNN+RF, and CNN+DT). Statistical significance was assessed using the Friedman test and Wilcoxon signed-rank tests at α = 0.05. Results: The CNN+RF model consistently achieved the best performance across all experimental settings. On the original dataset, it attained an AUC-ROC of 1.000, Accuracy of 0.987, F1-Score of 0.991, and MCC of 0.977, significantly outperforming all competing approaches (p < 0.05). The standalone CNN also demonstrated strong predictive capability, supporting the effectiveness of the proposed feature-to-image representation strategy. Furthermore, SMOTE-based balancing yielded superior performance compared with random undersampling across all models. Among conventional classifiers, SVM exhibited the greatest sensitivity to class imbalance, whereas DNN produced comparatively balanced class-level predictions. Conclusion: The proposed CNN–Random Forest hybrid framework provides a highly effective and statistically validated solution for P2P credit default prediction. By converting borrower information into image-based representations and leveraging convolutional feature learning, the model achieves superior predictive accuracy, robustness, and stability compared with both traditional machine learning and standalone deep learning methods. The framework offers practical value for loan approval decisions, borrower screening, and portfolio risk management in digital lending environments.
Keywords: 
;  ;  ;  ;  ;  ;  ;  

1. Introduction

The rapid expansion of peer-to-peer (P2P) lending has fundamentally transformed the consumer credit landscape by enabling borrowers and investors to transact directly through digital platforms, bypassing traditional financial intermediaries [1]. From a global market valuation of approximately USD 209.4 billion in 2023, the P2P lending sector is projected to grow at a compound annual growth rate exceeding 25%, reaching an estimated USD 1,423 billion by 2033 [1,2], driven by increasing fintech adoption, broader digital financial inclusion, and sustained demand for alternative credit channels among underserved borrowers who lack access to conventional bank financing [2]. LendingClub, founded in 2006, established itself as the largest P2P lending marketplace in the United States, originating over USD 85 billion in loans since inception and providing one of the most extensively studied publicly available credit datasets in the financial machine learning literature [3]. Despite the operational efficiencies and accessibility advantages offered by P2P platforms, their business model introduces a structurally heightened credit risk environment relative to traditional banking: borrowers accepted onto P2P platforms frequently present non-standard credit profiles, the default risk is borne entirely by individual investors rather than institutional intermediaries, and the platform's primary incentive to maximize loan volume may create misalignment with lenders' interest in accurate default prediction [4,5]. Loan default, defined as the failure of a borrower to meet contractual repayment obligations, represents the central financial risk in P2P lending, directly eroding investor returns, undermining platform credibility, and in aggregate posing systemic risks to the alternative finance sector [6]. Accurate and timely identification of high-risk borrowers at the point of loan origination is therefore not merely an operational optimization objective but a fundamental requirement for the long-term financial viability and regulatory trustworthiness of P2P lending platforms [7].
The application of machine learning to credit default prediction has expanded substantially over the past decade, evolving from conventional statistical classifiers toward increasingly sophisticated ensemble and deep learning architectures. Early efforts established logistic regression and support vector machines as foundational baselines for binary credit classification. Malagon et al. [8] demonstrated that a linear SVM applied to the LendingClub dataset (2007–2017) achieved approximately 93% accuracy in identifying default outcomes, while a systematic review by Shi et al. [9] of 76 credit risk studies confirmed that ensemble methods, particularly bagging and Random Forest, achieve the highest AUC values across benchmark datasets, and that deep learning models universally outperform classical statistical methods in discriminative capacity. Serrano-Cinca et al. [6] provided an early empirical foundation for LendingClub-based modeling by identifying the loan grade assigned by the platform as the single strongest predictor of default, a finding corroborated by subsequent multi-feature analyses. Building on this, Zhu et al. [10] trained a Random Forest classifier on 15 LendingClub credit features using SMOTE to address class imbalance, demonstrating that RF outperforms SVM, Decision Tree, and logistic regression for binary default classification, while Núñez Mora et al. [11] applied RF with SMOTE to the complete LendingClub transaction history and reported an F1 Macro Score exceeding 90%, using only nine borrower-level predictors and confirming that a compact, well-selected feature set can yield competitive performance without exhaustive feature engineering. Turiel and Aste [3] evaluated logistic regression, SVM, and deep neural networks on approximately 800,000 LendingClub accepted loans and reported AUC values reaching 0.72, highlighting the challenge of discriminating default risk from pre-origination data alone and motivating the development of richer feature representations and nonlinear modeling strategies. Gradient boosting methods have since emerged as the strongest-performing tabular classifiers in this domain. Chang et al. [5] compared seven models including logistic regression, SVM, Decision Tree, Random Forest, XGBoost, LightGBM, and ANN on multi-year LendingClub data, finding XGBoost to be the best-performing model with AUC of approximately 0.70 under standard training conditions, while Monje et al. [12] obtained XGBoost-based early default prediction on LendingClub with high discriminative performance and demonstrated that linguistic interpretability via surrogate models can be preserved alongside predictive accuracy. Kara et al. [13] conducted a structured machine learning benchmark on loan default prediction data, reporting that Gradient Boosting achieved the highest overall classification performance with accuracy of 88.87%, F1-score of 80.84%, and recall of 80.21%, outperforming XGBoost, Random Forest, and LightGBM under the same SMOTE-corrected training conditions.
The limitations of single-model tabular approaches have motivated the development of ensemble stacking and hybrid deep learning architectures. Akinjole et al. [14] combined RF, XGBoost, AdaBoost, and MLP in a stacking ensemble trained on LendingClub data with SMOTE+ENN resampling, achieving accuracy of 93.7%, precision of 95.6%, and AUC of 97.8%, establishing stacking as a highly competitive strategy for credit risk classification. Suram [15] applied deep learning models including ANN and ResNet to the full LendingClub dataset with SMOTE preprocessing and reported that ANN achieved 98.92% accuracy, 98.08% precision, 99.2% recall, and AUC of 0.99, outperforming all compared baselines and underscoring the sensitivity of performance estimates to dataset filtering strategy and evaluation protocol. In the context of hybrid CNN architectures, Berhane et al. [16] proposed a CNN combined with logistic regression, gradient boosting, and k-nearest neighbor for P2P credit risk prediction on LendingClub data, demonstrating that hybrid convolutional models improve upon standalone tabular classifiers in recall and AUC. The broader validity of CNN-based feature extraction for structured financial data was further demonstrated by Kvamme et al. [17], who showed that deep convolutional networks applied to borrower transaction sequences outperform Random Forest on mortgage default prediction, achieving a meaningful improvement in AUC on a Norwegian lending dataset of approximately 21,000 records. Gür [18] introduced a novel 2D CNN approach to credit score classification by transforming sequential tabular data into grayscale images and applying pretrained architectures including DenseNet201, GoogLeNet, MobileNetV2, ResNet18, ShuffleNet, and SqueezeNet, finding that a hybrid CNN model augmented with a new fully connected layer produced the highest classification accuracy among all tested configurations, and establishing the conceptual precedent for image-encoded tabular credit data. Parallel developments in model interpretability have further shaped the state of the art. Sharma et al. [19] trained a LightGBM model on LendingClub data (2007–2020) with recursive feature elimination and reported accuracy of 0.87, enhanced by SHAP and LIME post-hoc explanations, reinforcing the value of gradient boosting for regulatory-compliant credit scoring. Collectively, these studies confirm that while traditional machine learning models provide robust baselines, hybrid and deep learning architectures that effectively capture nonlinear feature interactions offer measurable and statistically significant performance advantages. However, a systematic evaluation of image-encoded tabular features combined with convolutional feature extraction across multiple balancing strategies and classifier configurations on the LendingClub dataset has not previously been reported, constituting the primary methodological gap addressed by the present study.
Despite the notable advances documented above, several methodological gaps remain unaddressed in the existing literature. The majority of prior studies evaluate models within a single paradigm, either conventional tabular classifiers or deep learning architectures, without systematically benchmarking both under identical experimental conditions and across multiple class-balancing strategies, making direct performance comparisons across model families difficult to interpret. While ensemble methods such as Random Forest and gradient boosting have demonstrated strong and consistent results on LendingClub data, the potential of convolutional neural networks to extract spatially structured representations from tabular financial features through image-encoding transformations has received limited attention in the P2P credit risk domain. Hybrid architectures that couple a convolutional backbone with downstream classical classifiers, a strategy shown to be highly effective in biomedical and other structured-data domains, remain largely unexplored for credit default prediction. Furthermore, a substantial portion of published studies omit formal statistical significance testing across cross-validation folds, which limits the strength of conclusions drawn from observed performance differences between models. Collectively, these gaps motivate a more integrated and statistically rigorous evaluation framework that spans tabular, image-based, and hybrid modeling strategies simultaneously. To address these limitations, the present study designs and evaluates a unified comparative framework for credit default prediction on the LendingClub dataset, incorporating seven classification models across three distinct pipeline architectures, evaluated under three independently balanced dataset conditions, with all pairwise performance differences subjected to formal statistical testing. The overarching objective is to determine whether image-encoded tabular representations combined with convolutional feature extraction and hybrid classifier architectures offer measurable and statistically supported advantages over conventional tabular modeling approaches, and to identify the conditions under which each modeling strategy is best suited for deployment in real-world credit risk applications.

2. Material and Methods

2.1. Data Collection

The dataset employed in this study was sourced from LendingClub, one of the largest peer-to-peer (P2P) lending platforms in the United States. Founded in 2006, LendingClub operates as an online marketplace connecting borrowers seeking personal loans with investors willing to fund them. As part of its transparency commitment, LendingClub has historically made historical loan data publicly available, making it a widely adopted benchmark in credit risk modeling and financial machine learning research.
The platform provides separate files for accepted and rejected loan applications. Only the accepted loans dataset was utilized in this study, as it contains the complete feature set required for default prediction, including FICO credit scores, detailed repayment history, and borrower financial attributes. FICO score fields are exclusively available when data is downloaded through an authenticated LendingClub account, further enhancing the reliability of the credit profile information. The rejected loans dataset, while available, contains only a limited subset of features without post-origination repayment records, rendering it unsuitable for the supervised learning tasks undertaken here.
The raw data distributed by LendingClub is fragmented across multiple time-partitioned files. To address this, all available files were consolidated into a single unified dataset, ensuring temporal completeness and eliminating inconsistencies that might arise from analyzing individual segments in isolation. The resulting dataset spans multiple years of lending activity, capturing the evolution of borrower profiles, credit grades, and repayment behaviors over time.
The consolidated dataset, referred to hereafter as ACCEPTED_LOANS, comprises over 2.26 million individual loan records across 151 distinct features. Each record represents a unique loan instance and captures information across the entire loan lifecycle, spanning the initial application and underwriting phase through to final repayment, default, charge-off, or settlement. This longitudinal coverage makes the dataset particularly well-suited for supervised learning tasks targeting loan default prediction, credit grade classification, and risk scoring. Table 1 summarizes the structural characteristics of the dataset alongside the eight thematic feature categories into which the 151 variables are organized [20].
Figure 1 provides a complete visual overview of all 151 feature names, arranged in a grid layout and color-coded by category. Each cell in the figure corresponds to one feature, and features belonging to the same thematic group share the same background color. This visualization allows the reader to appreciate both the breadth of the feature space and the relative size of each category at a glance. As is evident from Figure 1, the Advanced Credit Metrics group is by far the largest, accounting for approximately 40% of all features and encompassing granular credit bureau indicators such as trade line counts, utilization ratios, and delinquency metrics. The Hardship and Settlement group, visible in the lower portion of the figure, highlights the presence of late-stage loan distress features, which are often absent in other publicly available credit datasets. In contrast, demographic and loan-level attributes occupy a comparatively smaller portion of the feature space, reflecting the dataset's primary analytical depth in credit behavioral history.

2.2. Data Preprocessing

This section describes the complete preprocessing pipeline applied to the raw LendingClub dataset prior to model training. The pipeline consists of six sequential stages: label creation and record filtering, removal of data leakage features, feature engineering, missing value imputation, categorical encoding, and min-max normalization. A summary of the pipeline outcomes is provided in Table 1, and the distribution of the resulting binary target variable is illustrated in Figure 1.

2.2.1. Label Creation and Record Filtering

The raw dataset contains multiple distinct loan status categories, including records with ambiguous or unresolved outcomes such as "Current" and "In Grace Period." To construct a well-defined binary classification target, only records with conclusive repayment outcomes were retained [21]. Specifically, loans with a status of "Fully Paid" were assigned a non-default label ( y = 0 ), while loans classified as "Charged Off" or "Default" were assigned a default label ( y = 1 ).
Preprints 217916 i001
where y i denotes the binary default label for the i -th loan record, and loan _ status i refers to its observed repayment outcome.
This filtering step reduced the dataset from 260,701 raw records to 115,487 records with unambiguous labels. Of these, 88,158 records (76.34%) correspond to non-default loans and 27,329 records (23.66%) correspond to defaulted loans. The resulting class imbalance ratio is approximately 3.23:1.
Imbalance   Ratio = N non - default N default = 88 , 158 27 , 329 3.23
where N non - default and N default denote the number of records in the majority and minority classes, respectively.
This moderate degree of class imbalance is characteristic of real-world credit datasets [22] and is addressed in subsequent modeling stages through appropriate sampling and evaluation strategies. Figure 2 provides a comprehensive overview of the dataset following filtering, including the loan status distribution across all raw records (panel a), the binary label distribution as a donut chart (panel b), the loan amount distribution stratified by class (panel c), and the risk band distribution (panel d). Panel (b) in particular highlights the class asymmetry between non-default and default instances that motivates the subsequent imbalance handling procedures.

2.2.2. Removal of Data Leakage Features

A critical step in constructing a realistic credit risk prediction system is the elimination of features that contain information unavailable at the time of loan origination. Post-origination variables, such as repayment amounts, recovery fees, and late payment records, constitute data leakage, as their values are only determined after the loan outcome has been realized. Including such features would result in artificially inflated model performance that does not generalize to prospective loan evaluation scenarios [23].
A total of 24 features were identified and removed as leakage variables, including out_prncp, total_pymnt, total_rec_prncp, total_rec_int, recoveries, collection_recovery_fee, last_pymnt_d, last_pymnt_amnt, next_pymnt_d, last_fico_range_high, and last_fico_range_low, among others. Identifier and administrative fields such as id, member_id, url, zip_code, emp_title, and desc were also removed as they carry no predictive signal. Following this step, the feature set was reduced from 151 to 127 columns.

2.2.3. Feature Engineering

To enrich the predictive representation of each loan instance, 12 new features were derived from the existing variables. These engineered features capture domain-relevant financial ratios and binary risk indicators that are not directly observable in the raw data [5]. The complete set of engineered features is described in Table 2.
The loan-to-income ratio (LTI) provides a normalized measure of borrowing relative to the borrower's repayment capacity and is widely used in credit risk literature as a key predictor of default [24].
LTI i = loan _ amnt i annual _ inc i + 1
where loan _ amnt i is the requested loan amount and annual _ inc i is the borrower's self-reported annual income for the i -th record. A constant of 1 is added to the denominator to prevent division by zero. Similarly, the installment-to-income ratio (ITI) quantifies the annualized payment burden as a fraction of gross income, serving as a proxy for cash flow stress [25].
ITI i = installment i × 12 annual _ inc i + 1
where installment i is the monthly payment amount for the i -th loan, multiplied by 12 to express the burden on an annualized basis. The average FICO score consolidates the reported FICO score range into a single scalar value used in subsequent analyses.
fico _ avg i = fico _ range _ low i + fico _ range _ high i 2
where fico _ range _ low i and fico _ range _ high i are the lower and upper bounds of the FICO score range reported for borrower i at the time of loan origination.
2.2.4 Missing Value Imputation
Following feature engineering, missing values in numerical features were imputed using the column median, which is robust to the skewed distributions commonly observed in financial data. Categorical features with missing values were assigned a dedicated "MISSING" category to preserve the absence of information as a potentially informative signal rather than introducing distributional bias through mode imputation [26]. The imputation strategy for numerical features is formally presented below.
x ^ i j = median x j if   x i j   is   missing x i j otherwise
where x ^ i j is the imputed value for feature j of record i , x i j is the original observed value, and median x j is the median of feature j computed over all non-missing records.

2.2.5. Categorical Encoding

The preprocessing pipeline applied two distinct encoding strategies depending on the ordinality of each categorical feature. For grade (A through G) and sub_grade (A1 through G5), ordinal integer encoding was applied, as the natural ordering of these variables reflects increasing credit risk [10].
g r a d e _ e n c o d e d = A 0 ,   B 1 ,   C 2 ,   D 3 ,   E 4 ,   F 5 ,   G 6
where each letter grade is mapped to an integer in the range [0,6], with lower values corresponding to lower credit risk. An analogous mapping was applied to sub_grade, yielding integer codes in the range [0,34]across the 35 possible sub-grade levels.
For the remaining categorical features, including home_ownership, verification_status, purpose, application_type, and addr_state, one-hot encoding was applied using the top 15 most frequent categories per variable [27]. This approach limits feature dimensionality while retaining the most prevalent category signals. Following encoding, the total number of features expanded from 127 to 184 columns.

2.2.6. Risk Band Construction

A composite risk score was constructed for each loan instance by linearly combining several key risk indicators. This score was subsequently discretized into three ordinal risk bands (Low, Medium, High) using tertile boundaries (quantile-based cutting), providing an additional multi-class target variable for stratified analyses [21].
RiskScore i = 10 g i + 0.5 dti i + 0.1 850 fico _ avg i + 5 delinq _ 2 yrs i + 0.3 int _ rate i
where g i is the ordinal-encoded loan grade, dti i is the debt-to-income ratio, fico _ avg i is the average FICO score, delinq _ 2 yrs i is the number of delinquencies in the past two years, and int _ rate i is the annual interest rate assigned to the loan. The scalar coefficients were set to reflect the relative importance of each indicator in standard credit risk assessment practice. The risk band assignment is subsequently determined based on tertile boundaries of the resulting score distribution.
risk _ band i = Low   Risk if   RiskScore i Q 33 Medium   Risk if   Q 33 < RiskScore i Q 67 High   Risk if   RiskScore i > Q 67
where Q 33 and Q 67 denote the 33rd and 67th percentiles of the risk score distribution, respectively, and RiskScore i is the composite score computed for the i -th record.

2.2.7. Feature Selection

After encoding, features were subject to a three-criterion selection filter to remove uninformative variables [28]. A feature was retained only if it satisfied all of the following conditions: (i) a missing value rate below 80%, (ii) more than one unique value (non-constant), and (iii) a non-zero variance. This procedure reduced the final feature set from 184 to 128 features.

2.2.8. Min-Max Normalization

All selected numerical features were rescaled to the interval [0,1] using min-max normalization. This normalization step ensures that features with disparate scales contribute equally to distance-based and gradient-based learning algorithms [29].
x ˜ i j = x i j min x j max x j min x j
where x ˜ i j is the normalized value of feature j for record i , x i j is the original value, and min x j and max x j are the minimum and maximum values of feature j computed over the training set.
The normalization was fitted exclusively on the training data and subsequently applied to the validation and test sets to prevent data leakage from the scaling step [30].

2.2.9. Pipeline Summary

Table 3 presents a complete quantitative summary of the preprocessing pipeline, documenting the transformation of the raw dataset through each stage to the final analysis-ready form. Figure 3 complements this summary by visualizing two key aspects of data quality: panel (a) displays the missing value rates for key features prior to imputation, and panel (b) tracks the evolution of the feature count across successive pipeline stages, from 151 raw features down to the 128 features retained for model training. The intermediate expansion of the feature space to 184 columns, visible in panel (b), is attributable to the one-hot encoding step, which is subsequently followed by contraction through the variance-based selection filter.

2.3. Feature Selection

2.3.1. Overview

Following preprocessing, the dataset comprised 128 numerical features. To reduce dimensionality, eliminate redundancy, and improve model generalizability [31], a multi-criteria composite scoring framework was applied to select the 64 most discriminative features with respect to the binary default label. Five complementary statistical tests were computed for each candidate feature, capturing both linear and nonlinear associations as well as distributional differences between the default and non-default groups [32].

2.3.2. Statistical Tests and Composite Score

The five statistical measures employed are described below.
ANOVA F-Test quantifies the ratio of between-group to within-group variance for each feature relative to the class label [33]. Mutual Information (MI) measures the reduction in uncertainty about the label y given a feature x j [34]:
MI x j ; y = y x j p ( x j , y ) log p x j , y p x j p y
where p x j , y is the joint distribution of x j and y , and p x j , p y are their marginal distributions. Mann-Whitney U Test is a nonparametric test of stochastic equality between groups [35], with effect size:
r = Z N
where Z is the standardized U statistic and N is the total sample size. Point-Biserial Correlation measures the linear association between a continuous feature and the binary label [36]:
r p b = x ˉ 1 x ˉ 0 s x j n 1 n 0 N 2
where x ˉ 1 and x ˉ 0 are the group means for defaulted and non-defaulted loans, s x j is the pooled standard deviation, and n 1 , n 0 are the group sizes. Kolmogorov-Smirnov Test measures the maximum divergence between the empirical cumulative distributions of the two groups [37]:
D = sup x F 0 x F 1 x
where F 0 x and F 1 x are the empirical CDFs for the non-default and default groups, respectively.
Each raw score was normalized to [0,1] via min-max scaling, and a weighted composite score was computed as:
CS k = 0.25 s ^ k F + 0.30 s ^ k M I + 0.20 s ^ k M W + 0.15 s ^ k P B + 0.10 s ^ k K S
where CS k 0 , 1 is the composite score for feature k , and s ^ k F , s ^ k M I , s ^ k M W , s ^ k P B , s ^ k K S are the normalized scores from the F-test, Mutual Information, Mann-Whitney, Point-Biserial, and Kolmogorov-Smirnov tests, respectively. MI received the highest weight (0.30) due to its capacity to capture nonlinear dependencies. All 128 features were ranked by CS k in descending order, and a significance threshold of p < 0.05 was applied across all four hypothesis tests.

2.3.3. Selected Features

The top 64 features were retained based on the composite ranking. Figure 4 illustrates the composite score distribution across all 128 candidate features, with selected features highlighted in green and rejected features in gray. Panel (a) presents scores as a function of rank, with the dashed cutoff line at rank 64, and panel (b) shows the score density distributions, confirming a clear separation between selected and rejected subsets.
Figure 5 presents all 64 selected features organized by domain category, with each cell color-coded according to its thematic group: Loan Characteristics (blue, n = 8 ), Credit Profile (red, n = 16 ), Borrower Information (green, n = 9 ), Loan Status and Purpose (purple, n = 1 ), and Advanced Credit Metrics (brown, n = 30 ). The dominance of the Advanced Credit Metrics group is clearly visible, accounting for 47% of the selected features, reflecting the high discriminative value of granular credit bureau indicators in default prediction. The rank number displayed in each cell indicates the feature's position in the full composite score ordering.
Of the 64 selected features, 58 (90.6%) were statistically significant across all four hypothesis tests simultaneously, 4 features (6.3%) met significance in 3 out of 4 tests, and the remaining 2 features (3.1%) satisfied the threshold in 2 out of 4 tests. This procedure reduced the feature space by 50%, from 128 to 64 features, while maximizing statistical relevance to the prediction target and mitigating the risk of overfitting [38].

2.4. Data Balancing

2.4.1. Overview

The filtered dataset exhibits a moderate class imbalance, with non-default loans (Label = 0) comprising 76.34% of records and default loans (Label = 1) comprising 23.66%, yielding an imbalance ratio of approximately 3.23:1. Training classification models directly on imbalanced data tends to bias predictions toward the majority class, resulting in high overall accuracy at the expense of sensitivity to the minority (default) class. Since correctly identifying defaulted loans is the primary objective of credit risk prediction, this imbalance must be explicitly addressed prior to model training [39].
Two complementary balancing strategies were evaluated: SMOTE oversampling and random undersampling. Both strategies aim to produce a balanced training set with a 1:1 class ratio, but differ fundamentally in how they achieve this. The quantitative outcomes of each strategy are summarized in Table 4.

2.4.2. Strategy A: SMOTE Oversampling

The Synthetic Minority Oversampling Technique (SMOTE) addresses class imbalance by generating synthetic instances of the minority class rather than simply replicating existing samples [40]. For each minority-class instance x i , SMOTE identifies its k nearest neighbors in feature space and interpolates a new synthetic sample x n e w along the line segment connecting x i to a randomly selected neighbor x n n :
x n e w = x i + δ x n n x i
where x i is the selected minority-class instance, x n n is one of its k nearest neighbors within the minority class, and δ U 0 , 1 is a random interpolation coefficient drawn from a uniform distribution. This interpolation ensures that synthetic samples lie within the local feature manifold of the minority class rather than at arbitrary positions in the feature space.
SMOTE was applied with k = 5 nearest neighbors, synthesizing the minority class up to match the size of the majority class. This procedure generated 60,829 synthetic default instances, expanding the default class from 27,329 to 88,158 samples and increasing the total dataset from 115,487 to 176,316 records, achieving an exact 1 :1 class ratio. All synthetic instances were generated exclusively within the training partition to prevent information leakage into the validation and test sets.

2.4.3. Strategy B: Random Undersampling

Random undersampling reduces class imbalance by randomly removing instances from the majority class until a target ratio is achieved [41]. In contrast to SMOTE, this strategy does not introduce new data; instead, it reduces the majority class to match the size of the minority class. The retained majority-class instances are drawn without replacement using a fixed random seed ( s e e d = 42 ) to ensure reproducibility.
In this study, 60,829 non-default records were randomly removed, reducing the majority class from 88,158 to 27,329 samples. The resulting balanced dataset contains 54,658 records in total with an exact 1:1 class ratio. While undersampling is computationally efficient and preserves the distributional properties of the retained samples, it inevitably discards potentially informative instances from the majority class, which may reduce the representativeness of the training set [42].

2.4.4. Comparison of Strategies

The two strategies represent opposite approaches to addressing class imbalance and entail different trade-offs. SMOTE preserves all original data and augments the minority class with plausible synthetic observations, yielding a substantially larger training set (176,316 samples). However, synthetic instances generated by interpolation may not fully capture the true underlying distribution of defaulted loans, and may introduce noise if minority-class samples are poorly clustered in feature space [43]. Random undersampling, by contrast, produces a compact training set that strictly preserves the original data distribution, but sacrifices 60,829 majority-class records, which may increase the variance of model estimates.
Figure 6 illustrates the class distribution before and after each balancing procedure, showing the transition from the original 3.23:1 ratio to a balanced 1:1 ratio under both strategies.
Both balanced datasets were saved separately and used as independent training inputs in subsequent modeling experiments. The SMOTE fitting was performed strictly on the training fold within each cross-validation iteration to prevent data leakage [4].

2.5. Feature-to-Image Conversion

2.5.1. Motivation and Design

To leverage the representational capacity of convolutional neural networks (CNNs), the 64 selected tabular features must be transformed into a structured two-dimensional format. CNNs are designed to exploit spatial locality in grid-structured data, and their performance on tabular inputs can be substantially improved by encoding features as images that preserve domain-relevant spatial relationships [44]. Given that exactly 64 features were retained in the previous stage, a natural and lossless mapping exists: each feature vector is reshaped into a 64 × 64 pixel grayscale image by assigning feature i to row i of the image and replicating the scalar value across all 64 columns:
I i , j = x i , i 0 , , 63 , j 0 , , 63
where I i , j is the pixel intensity at row i and column j , and x i 0 , 1 is the min-max normalized value of the i -th feature. This encoding produces a horizontal-bar heatmap in which each row visually represents the magnitude of one feature, and the vertical arrangement of rows encodes the full feature profile of a loan instance as a single image [45].

2.5.2. Image Specification

Each image is saved as a 16-bit grayscale PNG file with pixel values in the range 0 , 65535 , obtained by scaling the normalized feature values:
p i = [ x i × 65535 ]
where p i 0 , , 65535 is the integer pixel value corresponding to feature i , and [·] denotes the floor operation. The use of 16-bit encoding preserves the full numerical precision of the min-max normalized features without introducing any quantization error beyond the floating-point representation.

2.5.3. Encoding Illustration

Figure 7 illustrates the encoding process for four representative loan samples (two non-default and two default). For each sample, the left panel displays the raw feature vector as a bar chart, the center panel shows the resulting 64 × 64 grayscale image, and the right panel identifies the ten features with the highest values in that sample. Visually, the two classes are distinguishable by their overall intensity patterns: default instances tend to exhibit higher pixel intensities in rows corresponding to high-risk features such as int_rate, dti, and bc_util, while non-default instances display consistently lower intensities across the same rows.

2.5.4. Dataset Summary

This conversion was applied independently to all three balanced datasets: the original imbalanced dataset, the SMOTE-upsampled dataset, and the randomly downsampled dataset. Images were organized into class-specific subfolders (label_0 for non-default and label_1 for default) to facilitate direct use in standard image classification pipelines. Table 5 summarizes the total number of images generated for each dataset.

2.6. Classification Models

2.6.1. Overview

Seven classification models were trained and evaluated across three experimental pipelines. The first pipeline operates on the 64 selected tabular features directly and includes a Deep Neural Network (DNN), a Support Vector Machine (SVM), a Random Forest (RF), and a Decision Tree (DT). The second pipeline converts the 64-feature vectors into 64 × 64 grayscale images and trains a Convolutional Neural Network (CNN) end-to-end on these images. The third pipeline uses the trained CNN as a fixed feature extractor and feeds the resulting 128-dimensional feature vectors to three classical classifiers, yielding CNN+SVM, CNN+RF, and CNN+DT hybrid models. This seven-model experimental design enables a systematic comparison of tabular learning, image-based deep learning, and hybrid feature-extraction approaches to credit risk prediction [46]. All models were evaluated using 5-fold stratified cross-validation on three datasets: the original imbalanced dataset, the SMOTE-upsampled dataset, and the randomly downsampled dataset.

2.6.2. Deep Neural Network (DNN)

The DNN operates directly on the 64 selected tabular features. It comprises four fully connected hidden layers with progressively decreasing width (256 → 128 → 64 → 32 units), followed by a single sigmoid output neuron. Each hidden layer applies the ReLU activation function with L 2 weight regularization, succeeded by a Dropout layer:
Preprints 217916 i019
where W l d l × d l 1 and b l d l are the weight matrix and bias vector of layer l , ReLU z = max 0 , z , and p l 0.35 , 0.30 , 0.25 , 0.20 are the Dropout rates applied at successive layers. The output probability is:
y ^ = σ w o u t h 4 + b o u t , σ z = 1 1 + e z
where σ is the sigmoid activation and h 4 32 is the output of the final hidden layer. The DNN is trained by minimizing the weighted binary cross-entropy loss:
L = 1 N i = 1 N w 1 y i log y ^ i + w 0 1 y i log 1 y ^ i
where y i 0 , 1 is the true label, y ^ i is the predicted probability, w 1 = N 0 / N 1 is the class weight for the minority class, w 0 = 1 for the majority class, and N is the training set size. Optimization uses the Adam algorithm [47] with an initial learning rate of η = 10 3 , batch size 64, and a maximum of 100 epochs. Early stopping (patience = 15) monitors validation AUC, and ReduceLROnPlateau halves the learning rate after 7 epochs without improvement.

2.6.3. Support Vector Machine (SVM)

A Linear SVM with isotonically calibrated probability outputs was applied to the 64-dimensional tabular feature vectors. The classifier minimizes a regularized hinge loss:
Preprints 217916 i022
where w 64 is the weight vector, b is the bias, C = 0.1 controls the margin-error trade-off, and y i * 1 , + 1 is the signed label. A small value of C promotes a wide-margin decision boundary and improves generalization on noisy financial data. Since LinearSVC does not natively produce calibrated probabilities, isotonic regression is applied via 5-fold internal cross-validation [48].

2.6.4. Random Forest (RF)

The Random Forest constructs an ensemble of T = 50 decision trees, each trained on a bootstrap sample. The predicted probability is averaged across all trees:
p ^ y = 1 x = 1 T t = 1 T p ^ t y = 1 x
where p ^ t denotes the leaf-node class frequency of the t -th tree. At each split, 64 = 8 candidate features are evaluated, decorrelating individual trees [49]. Tree depth is bounded at 8 levels with a minimum of 5 samples per leaf and class_weight='balanced' to up-weight the minority class.

2.6.5. Decision Tree (DT)

A single CART decision tree serves as the most interpretable baseline. At each internal node, the algorithm maximizes the Gini impurity reduction:
Δ G = G parent n L n G L n R n G R , G = 1 c 0 , 1 p c 2
where n L , n R , and n are the sample counts in the left child, right child, and parent node, respectively, and p c is the proportion of class c at a given node [50]. Maximum depth is set to 6, with minimum 20 samples per split and 10 per leaf.

2.6.6. Convolutional Neural Network (CNN)

The CNN operates on 64 × 64 × 1 grayscale images constructed from the 64 feature vectors (Section 2.5). Pixel values are normalized per-image to [0,1] via min-max scaling prior to any model layer. The architecture consists of four convolutional blocks and two fully connected layers, without Batch Normalization; regularization is achieved through L 2 penalties and Dropout. The architecture is illustrated in Figure 8.
Each convolutional block applies a 3 × 3 Conv2D layer with use_bias=True, ReLU activation, L 2 regularization ( λ = 10 4 ), 2 × 2 MaxPooling, and Dropout. The four blocks use 32, 64, 128, and 256 filters with Dropout rates of 0.10, 0.15, 0.20, and no Dropout (replaced by Global Average Pooling) for the final block. The activation of each hidden layer follows Equation (1). The convolutional output feeds into two dense layers of width 256 and 128, with Dropout rates of 0.40 and 0.30 respectively. The 128-dimensional output of the final dense layer is the feature layer h f e a t 128 . For the standalone CNN, the output probability is:
y ^ = σ w o u t h f e a t + b o u t
The CNN is trained by minimizing the weighted binary cross-entropy (Equation 3), with the class weight w 1 = N 0 / N 1 applied per fold. Optimization uses the Adam algorithm with η = 10 3 , batch size 32, and a maximum of 50 epochs. Early stopping (patience = 10) monitors validation AUC and restores the best weights; ReduceLROnPlateau halves the learning rate after 5 epochs without improvement, with a floor of η m i n = 10 6 . Horizontal flip augmentation is applied with probability 0.5 during training.

2.6.7. Hybrid CNN Models (CNN+SVM, CNN+RF, CNN+DT)

For the three hybrid models, the trained CNN backbone is used as a fixed feature extractor. A sub-model is constructed by routing the input through the backbone up to the feature layer, producing h f e a t 128 for each sample. Feature extraction is performed on CPU in mini-batches of size 32 to prevent GPU memory overflow. The resulting feature matrix is normalized using a Min-Max scaler fitted on the training fold only and applied to the validation fold, ensuring leakage-free evaluation. The normalized features f ˜ [ 0 , 1 ] 128 are then passed to one of three classical classifiers:
CNN+SVM applies a Linear SVM (Equation 4) with C = 0.1 and isotonic calibration to the 128-dimensional CNN features, identical in formulation to the tabular SVM but operating on the learned representation rather than raw features.
CNN+RF applies a Random Forest of T = 50 trees (Equation 5) to the CNN features, with max depth 8, 128 11 features per split, and class_weight='balanced'.
CNN+DT applies a single Decision Tree of max depth 6 (Equation 6) to the CNN features, with class_weight='balanced' and minimum 10 samples per leaf. This hybrid provides the most interpretable decomposition of the CNN feature space.

2.6.8. Hyperparameter Summary

Table 6 provides a complete summary of all hyperparameters for the seven models across the two pipelines.
Figure 9 illustrates the end-to-end research pipeline adopted in this study. Starting from the raw LendingClub dataset, the flowchart traces the sequential stages of label creation, data preprocessing, multi-criteria feature selection, and class balancing. Following feature-to-image encoding, the pipeline splits into two parallel branches: a tabular branch (DNN, SVM, RF, DT) and an image-based branch comprising a standalone CNN and three hybrid models (CNN+SVM, CNN+RF, CNN+DT). All seven models are subsequently evaluated under 5-fold stratified cross-validation, with pairwise statistical significance testing applied to identify the best-performing model.

2.7. Evaluation Protocol

All seven models are evaluated under 5-fold stratified cross-validation. In each fold, the model is trained exclusively on the training split and evaluated on the held-out validation split. For hybrid models, the CNN is first trained on the training fold; features are then extracted and scaled (scaler fitted on training only) before the downstream classifier is trained. No information from the validation fold enters any fitting step.
The following metrics are computed per fold: Accuracy, Precision, Recall (Sensitivity), Specificity, F1-score, AUC-ROC, AUC-PR (Average Precision), Brier score, Matthews Correlation Coefficient (MCC), Cohen's Kappa, and G-mean. Table 7 provides their formal definitions.
Results are reported as mean ± standard deviation across the 5 folds. Pairwise statistical comparisons between models are conducted using the Wilcoxon signed-rank test [51] on the 5-fold AUC-ROC scores, and the Friedman test [52] is applied globally across all models per dataset. A significance level of α = 0.05 is used throughout.

3. Result

The experimental pipeline was applied to the LendingClub accepted loans dataset, comprising 2,260,701 loan records across 151 features. Following label assignment and filtering to loans with conclusive repayment outcomes, 115,487 records were retained (88,158 non-default and 27,329 default, ratio 3.23:1). After removing 24 data leakage features, engineering 12 domain-specific financial ratios, applying median imputation, ordinal and one-hot encoding, and min-max normalization, the feature space was reduced to 128 variables. A multi-criteria statistical ranking procedure integrating ANOVA F-test, Mutual Information, Mann-Whitney U test, Point-Biserial correlation, and Kolmogorov-Smirnov test further reduced the feature set to 64 features, of which 90.6% were significant across all four hypothesis tests. Two class-balancing strategies were independently applied to the selected dataset: SMOTE oversampling, which expanded the training set to 176,316 samples at a 1:1 ratio by synthesizing 60,829 minority-class instances, and random undersampling, which reduced it to 54,658 samples at a 1:1 ratio by removing 60,829 majority-class records. The 64 selected features were subsequently encoded as 64 × 64 grayscale images, yielding a total of 346,461 images across the three datasets. Seven classification models were trained and evaluated under 5-fold stratified cross-validation on each dataset: a Deep Neural Network (DNN), Support Vector Machine (SVM), Random Forest (RF), and Decision Tree (DT) operating on tabular features, a Convolutional Neural Network (CNN) operating on the image representations, and three hybrid models (CNN+SVM, CNN+RF, CNN+DT) that use the CNN as a fixed feature extractor. The following sections report the classification performance of each model across all three datasets in terms of Accuracy, Precision, Recall, Specificity, F1-score, AUC-ROC, AUC-PR, Brier score, MCC, Kappa, and G-mean.
Figure 10 presents box plots comparing the distributions of eight key financial features between non-default and default loan classes in the original dataset, providing an initial visual assessment of the discriminative capacity of individual features prior to formal statistical testing.
Table 8 summarizes the univariate statistical comparison between non-default and non-default loan groups for 14 key features, reporting group means and standard deviations, the mean difference, Mann-Whitney U statistic and p-value, and the Point-Biserial correlation coefficient, providing a quantitative basis for assessing the discriminative significance of each feature.
All 14 features examined exhibited statistically significant differences between non-default and default loan groups ( p < 0.001 in all cases), as confirmed by the Mann-Whitney U test. The interest rate ( int_rate) showed the largest absolute effect size, with default loans carrying a mean rate of 16.23% compared to 13.11% for non-default loans ( r p b = 0.253 ), underscoring its role as the single most discriminative univariate predictor in the dataset. Average FICO score ( fico_avg) demonstrated the second strongest effect in the negative direction ( r p b = 0.134 ), with defaulted borrowers showing a mean score approximately 10.6 points lower than non-defaulting peers (691.9 vs. 702.5). Revolving utilization rate, loan amount, and monthly installment also exhibited consistent positive associations with default, with mean differences of +4.9 percentage points, 1,748, and 57.7, respectively. Conversely, annual income ( r p b = 0.042 ) and credit history length ( r p b = 0.024 ) were modestly lower in the default group, suggesting that higher borrower income and longer credit tenure are protective factors against default. The engineered ratios ( loan_to_income, installment_to_income) reached statistical significance despite negligible effect sizes ( r p b < 0.003 ), indicating that their discriminative contribution is primarily captured through nonlinear interactions rather than univariate mean shifts. Collectively, these findings confirm that the selected feature set contains strong class-discriminative signal, providing a statistically sound basis for the subsequent machine learning experiments.
Figure 11 presents the pairwise Spearman correlation structure among 14 key features in the dataset, revealing the degree of linear and monotonic inter-feature dependencies that may affect model training and feature redundancy.
The correlation analysis reveals several noteworthy inter-feature relationships. The strongest positive correlation is observed between installment and loan_amnt ( r s = 0.97 ), which is expected given that the monthly payment amount is directly derived from the loan principal. Similarly, loan_to_income and installment_to_income are highly correlated ( r s = 0.97 ), reflecting their shared dependence on loan size and income. The loan_amnt feature also shows a moderately strong positive correlation with annual_inc ( r s = 0.44 ) and loan_to_income ( r s = 0.68 ), indicating that larger loans are more commonly issued to higher-income borrowers, though borrowing relative to income rises in parallel. A notable negative correlation is observed between fico_avg and revol_util ( r s = 0.44 ), consistent with the established credit scoring principle that higher revolving utilization penalizes FICO scores. The int_rate feature exhibits a moderate negative association with fico_avg ( r s = 0.38 ), confirming that borrowers with lower credit scores are systematically assigned higher interest rates. The total_acc and open_acc features share a moderate positive correlation ( r s = 0.71 ), as borrowers with more total credit lines tend to maintain more open accounts. Overall, while several feature pairs exhibit moderate-to-strong correlations, particularly among loan size, installment, and income-ratio features, the majority of inter-feature correlations remain below r s = 0.35 , suggesting that the selected feature set retains sufficient diversity for effective multivariate modeling.
Figure 12 displays the normalized scores of the top 30 features ranked by the composite selection criterion, evaluated independently across four complementary statistical tests, providing a multi-dimensional view of the discriminative capacity of each feature prior to the final composite ranking.
The feature selection procedure applied five statistical tests to all 128 candidate features and retained the top 64 based on a weighted composite score. Among the 64 selected features, all achieved statistical significance ( p < 0.05 ) in at least two of the four hypothesis tests, with 58 features (90.6%) significant in all four simultaneously. As shown in Figure 12, the three credit-grade related features — sub_grade_encoded (Rank 1), int_rate (Rank 2), and grade_encoded (Rank 3) — achieved near-perfect normalized scores across all four criteria, with composite scores of 0.999, 0.984, and 0.972 respectively, reflecting their dominant role in separating default from non-default loans. FICO score features (fico_range_high, fico_avg, fico_range_low) ranked 5th through 7th and exhibited strong Mann-Whitney effect sizes ( r = 0.132 ) and Point-Biserial correlations ( r p b 0.134 ), consistent with their established role in credit risk assessment. The installment feature ranked 4th overall (composite score = 0.417) despite a relatively modest F-score ( F = 873.2 ), attributable to its notably high Mutual Information normalized score (0.747), indicating a strong nonlinear association with the default label beyond what linear statistics capture. Among behavioral credit features, bc_open_to_buy (Rank 14), tot_hi_cred_lim (Rank 12), and avg_cur_bal (Rank 16) demonstrated consistent mid-range scores across all four tests, with Mann-Whitney effect sizes in the range of r 0.09 -- 0.10 . Verification status indicators ( verification_status_Not Verified, verification_status_Verified) and home ownership categories (home_ownership_MORTGAGE, home_ownership_RENT) also ranked within the top 20, confirming the discriminative value of applicant-level categorical attributes. The lowest-ranked selected feature, max_bal_bc (Rank 65, composite score = 0.057), retained eligibility through significance across all four tests despite low normalized scores, underscoring the advantage of multi-criteria selection over single-test thresholding. Collectively, the composite scoring framework captures both linear and nonlinear feature-label associations, ensuring that the retained 64-feature set provides a statistically robust and informationally diverse input representation for subsequent model training.
Figure 13 presents a comprehensive performance comparison of the four tabular classifiers, namely DNN, SVM, Random Forest, and Decision Tree, trained on the 64 selected features across three dataset configurations. Panel (A) displays heatmaps of eleven evaluation metrics for each model-dataset combination, while Panel (B) presents the corresponding aggregated confusion matrices, enabling a simultaneous assessment of discriminative capacity, class-level sensitivity, and prediction calibration across balancing strategies.
The four tabular classifiers were applied to the 64 selected features under three dataset conditions, yielding distinct performance profiles across balancing strategies. On the original imbalanced dataset, Random Forest and Decision Tree achieved near-perfect discriminative scores, with AUC-ROC values of 1.000 and F1-scores of 0.998 and 0.996, respectively. However, the confusion matrices expose a severe majority-class bias in the SVM, which correctly identified 87,239 non-default instances (75.5%) but misclassified 26,281 default loans as non-default (22.8%), retaining only 1,048 true default detections (0.9%), resulting in a Recall of 0.038 and an MCC of 0.092. In contrast, the DNN demonstrated substantially more balanced predictions on the same dataset, correctly classifying 84,969 non-default (73.6%) and 27,238 default instances (23.6%), with only 91 false negatives (0.1%), yielding a Recall of 0.996 and an AUC-ROC of 0.998. On the SMOTE-upsampled dataset, all models exhibited improved minority-class sensitivity. The DNN correctly identified 76,173 non-default (43.3%) and 82,520 default instances (46.8%), achieving an F1-score of 0.915 and AUC-ROC of 0.940, while the SVM remained the weakest performer, with 47,956 true non-default predictions (27.2%) and 63,375 true default detections (35.9%), corresponding to an AUC-ROC of 0.677 and MCC of 0.267. Random Forest on the SMOTE dataset produced the most balanced confusion matrix among tree-based models, with 86,160 correct non-default predictions (48.9%) and 87,247 correct default predictions (49.5%), yielding an AUC-ROC of 0.997 and F1-score of 0.984. Under random downsampling, Random Forest again demonstrated superior performance, correctly classifying 26,793 non-default (49.0%) and 26,191 default instances (47.9%), with an AUC-ROC of 0.994 and F1-score of 0.969, while the Decision Tree achieved 27,208 true non-default (50.0%) and 26,251 true default predictions (44.4%), corresponding to an AUC-ROC of 0.978 and F1-score of 0.934. Collectively, these results confirm that tree-based ensemble methods retain strong and consistent predictive performance across all dataset conditions, that the SVM exhibits pronounced sensitivity to class imbalance, and that balancing strategies, particularly SMOTE, play a critical role in recovering minority-class sensitivity without substantially sacrificing specificity.
Figure 14 presents the classification performance of the CNN and three hybrid models (CNN+SVM, CNN+RF, CNN+DT) trained on 64×64 grayscale images encoded from the 64 selected tabular features. Panel (A) displays grouped bar charts of nine evaluation metrics averaged over 5-fold stratified cross-validation for each model under three dataset configurations, with error bars representing the standard deviation across folds. Panel (B) presents the corresponding aggregated confusion matrices, reporting absolute prediction counts and class-relative percentages for each model-dataset combination. Table 9 summarizes the five-fold cross-validation performance of the CNN and three hybrid models across three dataset configurations, reporting mean ± standard deviation for four key metrics. Table 10 presents the pairwise Wilcoxon signed-rank test p-values and absolute AUC-ROC differences between all model pairs, along with the global Friedman test, across all three dataset conditions in a single consolidated table.
The results in Table 9 and Table 10 reveal consistent and statistically supported performance patterns across all dataset configurations. CNN+RF achieved the highest scores on all four metrics in every condition, with peak performance on the original imbalanced dataset (Accuracy = 0.9872 ± 0.0028, AUC-ROC = 1.0000 ± 0.0000, F1 = 0.9914 ± 0.0008, MCC = 0.9772 ± 0.0008), demonstrating that combining CNN-derived feature representations with a Random Forest ensemble yields superior discriminative capacity and cross-fold stability. Performance declined progressively from the original to the SMOTE and downsampled conditions for all models, consistent with the reduction in effective training set size under random downsampling. The standalone CNN remained competitive across all conditions, achieving AUC-ROC values of 1.0000, 0.9913, and 0.9840, respectively, confirming the validity of the feature-to-image encoding strategy as a representation for tabular credit risk data. CNN+SVM ranked third in all configurations, with the largest performance penalty observed under random downsampling (MCC = 0.8600 ± 0.0085). CNN+DT consistently produced the lowest scores, indicating that a single shallow tree is insufficient to exploit the CNN feature space effectively. The global Friedman test confirmed statistically significant rank differences among the four models in all dataset conditions (p ≤ 0.038). Pairwise Wilcoxon tests further revealed that CNN+RF significantly outperformed CNN+SVM and CNN+DT across all configurations (p < 0.05), while the differences between CNN and CNN+DT were non-significant under SMOTE (p = 0.317) and downsampled (p = 0.412) conditions, suggesting comparable discriminative capacity between these two models when training data is balanced or reduced.

Discussion

This study evaluated seven classification models for credit default prediction using a large-scale real-world dataset of 115,487 LendingClub loan records (88,158 non-default and 27,329 default; imbalance ratio 3.23:1), derived from an initial pool of over 2.26 million loan applications and preprocessed through a six-stage pipeline that reduced the feature space from 151 to 64 statistically selected variables. Across all experimental conditions and dataset configurations, tree-based ensemble methods, particularly Random Forest and its CNN-hybrid variant (CNN+RF), consistently achieved superior and stable performance, with CNN+RF attaining peak AUC-ROC of 1.000, F1-score of 0.991, and MCC of 0.977 on the original imbalanced dataset, confirming that pairing deep convolutional feature extraction with ensemble learning yields a highly discriminative classifier. The feature-to-image encoding strategy, in which each 64-dimensional feature vector was mapped to a 64×64 grayscale image, proved to be an effective representation for tabular credit risk data, as the standalone CNN achieved competitive AUC-ROC values of 1.000, 0.991, and 0.984 across the original, SMOTE-upsampled, and randomly downsampled datasets, respectively. Among the hybrid architectures, CNN+RF outperformed both CNN+SVM and CNN+DT across all metrics and dataset conditions, as confirmed by pairwise Wilcoxon signed-rank tests (p < 0.05) and global Friedman tests (p ≤ 0.038), while CNN+DT consistently ranked lowest, suggesting that shallow decision trees lack sufficient capacity to exploit high-dimensional CNN feature representations. The tabular SVM exhibited pronounced sensitivity to class imbalance, achieving a Recall of only 0.038 and MCC of 0.092 on the original dataset, whereas the DNN demonstrated more balanced class predictions under the same condition. Regarding balancing strategies, SMOTE oversampling consistently outperformed random downsampling across all models, with the latter associated with greater performance variability and reduced MCC scores, reinforcing the value of synthetic minority augmentation over majority-class reduction in imbalanced credit risk settings.
Table 11 provides a structured summary of representative prior studies on credit default prediction using the LendingClub dataset or methodologically comparable P2P lending datasets, reporting key study characteristics including dataset size, modeling approach, balancing strategy, and primary performance outcomes. This comparative overview contextualizes the contributions of the present study within the existing literature and highlights methodological advances introduced here.
The results of the present study are broadly consistent with the existing literature in confirming the superiority of ensemble-based methods over linear and single-tree classifiers for credit default prediction on LendingClub data. Prior studies employing Random Forest on the LendingClub dataset have consistently reported strong performance, with Zhu et al. [10] demonstrating that RF trained with SMOTE outperforms SVM and Decision Tree, and Monje et al. [11] achieving F1-macro scores exceeding 0.90 using nine carefully selected features from the full LendingClub history. Gradient boosting methods have also emerged as competitive alternatives, with Chang et al. [5] reporting XGBoost as the best-performing model among seven classifiers, and Akinjole et al. [14] showing that stacking ensembles combining RF, XGBoost, and MLP reach accuracy of 93.7% and AUC of 97.8% on a balanced LendingClub dataset. The present study extends these findings by demonstrating that CNN+RF achieves AUC-ROC of 1.000 and MCC of 0.977 on the original dataset, substantially surpassing the performance levels reported in prior tabular-only or single-model studies, and suggests that combining deep visual feature extraction from tabular-to-image representations with ensemble learning provides a meaningful and quantifiable performance advantage over conventional approaches. Regarding class imbalance handling, the literature consistently supports the use of synthetic oversampling. Souadda et al. [53] and Akinjole et al. [14] both found SMOTE and its variants to outperform random undersampling and no-balancing conditions, a finding directly corroborated here: models trained on SMOTE-upsampled data consistently outperformed those trained on randomly downsampled data across all seven classifiers.
The novelty of the present work lies primarily in the application of a feature-to-image encoding strategy to the LendingClub dataset, in which each 64-dimensional tabular feature vector is mapped to a 64×64 grayscale image, enabling CNN-based spatial feature extraction on non-image financial data. This approach is conceptually related to the IGTD method introduced by Zhu et al. [44] in a biomedical context, where tabular data transformed into structured images yielded superior CNN performance compared to models trained on the original tabular representations. In the credit domain, Gür et al. [18] similarly explored 2D CNN architectures applied to image-encoded credit score data, though using smaller, lower-complexity datasets and without systematic evaluation across multiple balancing strategies. Kim and Cho [55] applied 1D CNN architectures to LendingClub repayment prediction with 5-fold cross-validation, reporting competitive results relative to logistic regression baselines, but without the hybrid feature-extraction framework evaluated here. The present study advances this line of research by providing, to the best of our knowledge, the first systematic evaluation of CNN-based image encoding of LendingClub tabular features, combined with a comprehensive seven-model comparison spanning tabular, image, and hybrid CNN+classifier pipelines, evaluated across three independently balanced datasets and eleven performance metrics with statistical significance testing. The consistently high AUC-ROC and MCC values achieved by CNN+RF across all dataset conditions, validated by Wilcoxon signed-rank and Friedman tests, provide strong evidence that the proposed hybrid architecture constitutes a robust and generalizable credit risk modeling framework.
Despite the strong quantitative performance, several limitations of the present study merit acknowledgment. First, the feature-to-image encoding assigns features to image rows according to their composite statistical ranking, without optimizing the spatial arrangement to preserve feature similarity or neighborhood structure, as done in IGTD [44] or NCTD [57]; exploring such structure-aware encoding methods may yield further performance gains. Second, the dataset is restricted to LendingClub accepted loans from a single national market, which may limit the generalizability of the findings to other credit markets or institutional lending contexts with different borrower profiles and regulatory environments. Third, although the CNN+RF model achieves near-perfect discriminative scores, the high performance on the original imbalanced dataset, where post-origination data are excluded but the class distribution closely mirrors real loan portfolios, warrants further prospective validation on temporally held-out data to confirm practical deployment readiness. Fourth, the present study does not incorporate model interpretability analysis such as SHAP or LIME, which are increasingly required for regulatory compliance in credit scoring applications; integrating explainability methods with the CNN feature space represents a natural direction for future work.
The discriminative performance of the models reported in Table 9 and Table 10 is deeply rooted in the informational content of the 64 selected features, whose statistical differentiation between default and non-default groups is quantified in Table 8 and visualized in Figure 10 and Figure 12. The most decisive contributor to class separation was interest rate (int_rate), which recorded the highest point-biserial correlation (r_pb = 0.253) and the top composite feature ranking (Rank 2, composite score = 0.984). This finding is consistent with established credit risk theory: interest rate on a P2P loan reflects the platform's ex ante assessment of borrower risk, such that higher-risk borrowers are assigned higher rates, and simultaneously imposes a heavier repayment burden that elevates the probability of cash flow shortfall [3,5]. The 3.12 percentage point gap in mean interest rate between defaulting (16.23%) and non-defaulting borrowers (13.11%) observed in the present dataset is consistent with findings from Albanesi and Domossy [58], who showed that interest rate is among the strongest predictors of consumer default, and with the MDPI loan default benchmark study by Zhang et al. [13], which identified loan interest rate as a top-three feature importance driver across multiple tree-based models. Average FICO score (fico_avg, Rank 5–7, r_pb = −0.134) represented the second strongest univariate signal, with defaulted borrowers showing a mean score approximately 10.6 points lower (691.9 vs. 702.5). The FICO score synthesizes payment history (35%), amounts owed (30%), length of credit history (15%), new credit (10%), and credit mix (10%) into a single creditworthiness index [59], which is why its negative association with default is not merely statistical but reflects a comprehensive profile of the borrower's financial discipline. Bhardwaj and Sengupta [60] demonstrated that origination FICO score groups exhibit monotonically ordered survival probabilities over the loan lifecycle, confirming the longitudinal predictive validity of the measure. The revolving utilization rate (revol_util, mean difference +4.91 percentage points) and debt-to-income ratio (dti, mean difference +2.30) further differentiated defaulters from non-defaulters, consistent with theoretical arguments that high revolving utilization signals overextension of available credit, while elevated DTI directly constrains the borrower's capacity to absorb income shocks without triggering missed payments [61]. The engineered features loan_to_income (LTI) and installment_to_income (ITI), although exhibiting negligible point-biserial correlations (|r_pb| < 0.003), still reached statistical significance via Mann-Whitney U test ( p   <   10 300 ), indicating that their discriminative contribution operates through distributional tails and nonlinear interactions that univariate linear statistics fail to capture. This nonlinear contribution is reflected in the high Mutual Information scores of installment (normalized score = 0.747, Rank 4) shown in Figure 12, which provided the primary justification for its composite ranking despite a modest F-statistic. The fact that tree-ensemble models, particularly CNN+RF and tabular RF, most effectively exploited this mixed linear and nonlinear feature signal is consistent with the capacity of Random Forests to partition high-dimensional, heterogeneously distributed inputs without assuming feature-label linearity [62].
Beyond their statistical significance, the features identified in this study carry direct implications for customer relationship management and lending strategy. Interest rate and loan grade (sub_grade_encoded, Rank 1, composite score = 0.999) together represent the lender's risk pricing decision at origination, meaning that the predictive models developed here effectively re-learn the platform's own risk stratification logic and extend it by incorporating borrower behavioral signals that are not fully captured in grade assignment alone. From a customer relationship perspective, the negative association between annual income (r_pb = −0.042) and default indicates that higher-income borrowers not only present lower objective repayment risk but also constitute a strategically valuable customer segment for the lender in terms of lifetime value and cross-selling potential [12]. Conversely, the positive association of delinquency history (delinq_2yrs, mean difference +0.055, p   <   10 18 ),) and revolving utilization with default flags a behaviorally at-risk segment whose early identification enables proactive intervention, such as personalized repayment counseling, hardship plan offers, or loan restructuring, before default materializes. Albanesi and Domossy [58] demonstrated that machine learning models that incorporate behavioral credit features can track individual default trajectories with substantially lower divergence between predicted and realized default rates than conventional FICO-based scoring, with AUC improvements from approximately 0.82 to 0.93, suggesting that deploying the CNN+RF architecture developed here within a real-time customer monitoring pipeline could meaningfully enhance early warning capabilities. The behavioral features in the Advanced Credit Metrics category, which accounted for 47% of the selected feature set as shown in Figure 5, capture dynamics such as trade line counts, utilization trajectories, and delinquency recency that are invisible to static credit scoring but are precisely the signals associated with emerging financial distress [10,21]. A credit institution deploying the present model could thus use its output scores not only for initial loan approval but also for ongoing customer risk monitoring, triggering differentiated engagement strategies, such as proactive outreach, limit adjustments, or early restructuring offers, for borrowers whose feature profiles shift toward the default region over time, thereby converting predictive risk intelligence into actionable relationship management [19,63] [64].
All experiments were implemented in Python 3.10 using TensorFlow 2.12, the Keras API, and scikit-learn 1.3, executed on a workstation equipped with an NVIDIA GeForce RTX 3050 Ti GPU (4 GB VRAM), an Intel Core i7 CPU, and 32 GB of system RAM. Across the five-fold cross-validation procedure, tabular classifiers completed training within approximately 10 to 18 minutes in total, with the Decision Tree being the fastest (approximately 648 s) and the DNN the most time-intensive among tabular models (approximately 1,080 s), reflecting its iterative gradient-based optimization with early stopping. CNN-based models required substantially longer training durations, ranging from approximately 1,920 s for the standalone CNN to 2,580 s for CNN+RF, the latter of which incurred additional overhead from feature extraction across the full training fold and bootstrap aggregation of 50 downstream trees applied to the 128-dimensional CNN representations. Although the hybrid architectures demanded greater computational resources than their tabular counterparts, all training times remained within operationally feasible bounds for offline credit risk modeling workflows, and the superior discriminative performance of CNN+RF documented in Table 9 and Table 10 provides clear empirical justification for the additional computational investment.
Despite the strong quantitative performance achieved across all experimental conditions, several limitations of the present study warrant consideration. The feature-to-image encoding assigns features to image rows according to their composite statistical ranking rather than optimizing the spatial arrangement to reflect inter-feature similarity or neighborhood structure, as implemented in structure-aware methods such as IGTD and NCTD; adopting topology-preserving encodings may further enhance the CNN's capacity to exploit spatially coherent feature relationships. The dataset is confined to LendingClub accepted loans originating within a single national lending market, and the filtering to conclusively resolved loan outcomes necessarily excludes a substantial portion of raw records, both of which may constrain the generalizability of the trained models to institutional lenders operating under different regulatory frameworks, borrower profiles, or macroeconomic cycles. Furthermore, the cross-validation folds were drawn from a single temporal window without enforcing chronological partitioning, meaning that the reported performance estimates do not fully account for concept drift or covariate shifts that may affect model calibration in prospective deployment settings [3]. Finally, the present study does not incorporate post-hoc interpretability analyses such as SHAP or LIME, which are increasingly mandated by financial regulators for credit scoring systems under frameworks such as the European Union's General Data Protection Regulation and the Equal Credit Opportunity Act [4]; future work should therefore integrate explainability pipelines directly into the CNN+RF architecture to bridge the gap between predictive accuracy and regulatory transparency. Extensions of the present framework could further explore gradient boosting hybrids such as CNN+XGBoost or CNN+LightGBM, attention-based transformer architectures for tabular data [5], structure-aware tabular-to-image encoding strategies, and multi-market validation using publicly available credit datasets from diverse geographic and institutional contexts.
The practical deployment of the CNN+RF framework developed in this study is most naturally situated within the loan origination and ongoing portfolio monitoring workflows of peer-to-peer lending platforms and consumer credit institutions. At the point of loan application, the model can be integrated into the underwriting pipeline as an automated risk scoring engine: the applicant's credit bureau attributes, income and employment information, and platform-assigned loan characteristics are assembled into the 64-feature vector, encoded as a 64×64 grayscale image via the preprocessing pipeline described in Section 2.5, and passed to the trained CNN backbone to produce a calibrated default probability that informs loan approval decisions, interest rate pricing, and credit limit assignment. Beyond origination, the model's reliance exclusively on features available at or before loan issuance makes it directly applicable to ongoing account-level monitoring: as borrower credit bureau data are periodically refreshed, updated feature vectors can be re-encoded and re-scored to generate dynamic risk trajectories that flag early signs of financial deterioration before a missed payment is recorded, consistent with the emerging paradigm of behavioral credit surveillance advocated by Yıldırım and Çelik [4] and Addo et al. [6]. Credit institutions could embed this monitoring capability within their customer relationship management systems, automatically triggering differentiated intervention pathways, ranging from proactive outreach and personalized repayment plan offers to credit limit adjustments or hardship program referrals, for borrowers whose risk scores cross predefined escalation thresholds. This dual-use architecture, simultaneously serving origination decisioning and portfolio surveillance, positions the proposed CNN+RF framework as a practically actionable and relationship-oriented tool for reducing credit losses while supporting responsible lending practices aligned with contemporary regulatory and ethical standards in consumer finance.

5. Conclusion

This study presented a comprehensive machine learning framework for credit default prediction on the LendingClub peer-to-peer lending dataset, encompassing 115,487 loan records with conclusive repayment outcomes derived from an initial pool of over 2.26 million applications. Seven classification models were systematically evaluated across three experimental pipelines, tabular, image-based, and hybrid, under three independently balanced dataset conditions and eleven performance metrics, providing one of the most methodologically thorough comparative analyses of credit risk modeling architectures reported in the P2P lending literature to date.
The central contribution of this work is the demonstration that encoding tabular financial features as structured grayscale images and training a convolutional neural network on the resulting representations constitutes a viable and high-performing strategy for credit default classification. The CNN+RF hybrid model, which couples the spatial feature extraction capacity of the convolutional backbone with the ensemble generalization of a Random Forest classifier, achieved peak performance of AUC-ROC = 1.000, Accuracy = 0.987, F1-Score = 0.991, and MCC = 0.977 on the original imbalanced dataset, and maintained consistently superior scores across SMOTE-upsampled and randomly downsampled conditions, with all performance advantages over competing models confirmed by Wilcoxon signed-rank and Friedman statistical tests. These results establish the CNN+RF architecture as a robust, dataset-condition-invariant framework for binary credit risk classification.
Several secondary findings of practical and methodological significance emerged from the experimental analysis. The multi-criteria composite feature selection procedure, integrating ANOVA F-test, Mutual Information, Mann-Whitney U test, Point-Biserial correlation, and Kolmogorov-Smirnov test, successfully reduced the feature space by 50% while retaining 90.6% of features significant across all four hypothesis tests simultaneously, with credit grade, interest rate, and FICO score emerging as the dominant discriminative signals. SMOTE oversampling consistently outperformed random undersampling across all seven models, reinforcing its suitability as the preferred imbalance correction strategy in credit risk settings where minority-class detection is the primary objective. The tabular SVM exhibited the most pronounced sensitivity to class imbalance among all evaluated classifiers, underscoring the importance of selecting architectures with inherent or explicit mechanisms for handling distributional asymmetry.
From a practical standpoint, the proposed framework is directly deployable within the loan origination and portfolio monitoring workflows of lending institutions. Its exclusive reliance on pre-origination features ensures the complete absence of data leakage, while its dual applicability to initial underwriting and ongoing behavioral risk surveillance positions it as a versatile and relationship-oriented credit risk tool. Future research directions include the integration of structure-aware tabular-to-image encoding methods, the incorporation of explainability techniques such as SHAP to satisfy regulatory transparency requirements, the evaluation of transformer-based architectures for tabular data, and the prospective validation of the framework across multiple lending markets and temporal out-of-sample partitions to confirm its generalizability under real-world deployment conditions.

Author Contributions

A.S. conceptualized the study, designed the research framework, coordinated data collection, and served as the corresponding author. H.N. contributed to the methodological design, data analysis strategy, and machine learning implementation. E.S. assisted in literature review, theoretical development, and manuscript drafting. N.Ş. contributed to questionnaire design, validation procedures, and interpretation of organizational behavior findings. H.N. provided expertise in advanced data analysis, machine learning model evaluation, and critically reviewed the manuscript for methodological rigor. All authors reviewed, revised, and approved the final version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset used in this study is publicly available from Kaggle. It can be accessed at the following link: https://www.kaggle.com/datasets/wordsforthewise/lending-club.

Conflict of Interest

The authors declare no conflict of interest.

References

  1. Milne; Parboteeah, P. The business models and economics of peer-to-peer lending. 2016. [Google Scholar] [CrossRef]
  2. Global heterogeneous catalyst (metal, chemical, zeolites) market size, share & trends analysis report 2023–2030. Focus on Catalysts, 2024.
  3. Turiel, J. D.; Aste, T. P2P Loan Acceptance and Default Prediction with Artificial Intelligence. SSRN Electron. J. 2019. [Google Scholar] [CrossRef]
  4. Giudici, P.; Hadji-Misheva, B.; Spelta, A. Network based credit risk models. Qual. Eng. 2019, vol. 32, 199–211. [Google Scholar] [CrossRef]
  5. Chang, A.-H.; Yang, L.-K.; Tsaih, R.-H.; Lin, S.-K. Machine learning and artificial neural networks to construct P2P lending credit-scoring model: A case using Lending Club data. In Quantitative Finance and Economics; 2022. [Google Scholar]
  6. Serrano-Cinca; Gutiérrez-Nieto, B.; López-Palacios, L. Determinants of Default in P2P Lending. PLoS ONE 2015, vol. 10. [Google Scholar] [CrossRef] [PubMed]
  7. Atef, M.; Ouf, S.; Seoud, W.; Gabr, M. I. A novel approach using explainable prediction of default risk in peer-to-peer lending based on machine learning models. Neural Comput. Appl. 2025, vol. 37, 21783–21803. [Google Scholar] [CrossRef]
  8. Malagon, E.; Troncoso, D.; Rubio, A.; Ponce, H. Machine Learning Techniques in Credit Default Prediction. Mexican International Conference on Artificial Intelligence, 2022. [Google Scholar]
  9. Shi, S.; Tse, R.; Luo, W.; d’Addona, S.; Pau, G. Machine learning-driven credit risk: a systemic review. Neural Comput. Appl. 2022, vol. 34, 14327–14339. [Google Scholar] [CrossRef]
  10. Zhu, L.; Qiu, D.; Ergu, D.; Ying, C.; Liu, K. A study on predicting loan default based on the random forest algorithm. International Conference on Information Technology and Quantitative Management, 2019. [Google Scholar]
  11. Núñez Mora, J. A.; Moncayo, P.; Franco, C.; Madrazo-Lemarroy, P.; Beltrán, J. Loan Default Prediction: A Complete Revision of LendingClub. Rev. Mex. De Econ. Y Finanz. 2023. [Google Scholar] [CrossRef]
  12. Monje, L.; Carrasco, R. A.; Sánchez-Montañés, M. Machine Learning XAI for Early Loan Default Prediction. In Computational Economics; 2025. [Google Scholar]
  13. Zhang, X.; et al. Data-Driven Loan Default Prediction: A Machine Learning Approach for Enhancing Business Process Management. Syst. vol. 13, 581, 2025. [CrossRef]
  14. Akinjole, A.; Shobayo, O.; Popoola, J.; Okoyeigbo, O.; Ogunleye, B. Ensemble-Based Machine Learning Algorithm for Loan Default Risk Prediction. In Mathematics; 2024. [Google Scholar]
  15. Suram, R. Efficient Deep Learning Models for Accurate Default Loan Prediction in Credit Risk Management. Int. J. Emerg. Res. Eng. Technol. 2026. [Google Scholar]
  16. Melese, T.; Berhane, T.; Mohammed, A.; Walelgn, A. Credit-Risk Prediction Model Using Hybrid Deep—Machine-Learning Based Algorithms. In Scientific Programming; 2023. [Google Scholar]
  17. Kvamme, H.; Sellereite, N.; Aas, K.; Sjursen, S. Predicting mortgage default using convolutional neural networks. Expert Syst. Appl. 2018, vol. 102, 207–217. [Google Scholar] [CrossRef]
  18. Gür, Y. E.; Toğaçar, M.; Solak, B. Integration of CNN Models and Machine Learning Methods in Credit Score Classification: 2D Image Transformation and Feature Extraction. Comput. Econ. 2025, vol. 65, 2991–3035. [Google Scholar] [CrossRef]
  19. Li, L.-H.; Sharma, A. K.; Cheng, S.-T. Explainable AI based LightGBM prediction model to predict default borrower in social lending platform. Intell. Syst. Appl. vol. 26, 200514, 2025. [CrossRef]
  20. wordsforthewise. Lending Club Loan Dataset. 2017, doi. Available online: https://www.kaggle.com/datasets/wordsforthewise/lending-club.
  21. Cai, X.; Dai, W.; Lu, J. Loan Default Prediction Based on Machine Learning Approaches. In Proceedings of the 2025 2nd International Conference on Generative Artificial Intelligence and Information Security, 2025. [Google Scholar]
  22. Haque, A.; Mahedi, M.; Lecturer, H. Bank Loan Prediction Using Machine Learning Techniques. ArXiv 2024, vol. abs/2410.08886. [Google Scholar] [CrossRef]
  23. Alonso, M.; Carbo, J. Understanding the Performance of Machine Learning Models to Predict Credit Default: A Novel Approach for Supervisory Evaluation. SSRN Electron. J. 2021. [Google Scholar] [CrossRef]
  24. Hancock, J. T.; Khoshgoftaar, T. M. Survey on categorical data for neural networks. J. Big Data 2020, vol. 7. [Google Scholar] [CrossRef]
  25. Kriebel, J.; Stitz, L. Credit default prediction from user-generated text in peer-to-peer lending using deep learning. Eur. J. Oper. Res. 2021, vol. 302, 309–323. [Google Scholar] [CrossRef]
  26. Alwateer, M. M.; Atlam, E.; El-Raouf, M. M. A.; Ghoneim, O. A.; Gad, I. Missing Data Imputation: A Comprehensive Review. J. Comput. Commun. 2024. [Google Scholar] [CrossRef]
  27. Bolikulov, F.; Nasimov, R.; Rashidov, A.; Akhmedov, F.; Cho, Y.-I. Effective Methods of Categorical Data Encoding for Artificial Intelligence Algorithms. In Mathematics; 2024. [Google Scholar]
  28. Zhang, L. Using Explainable Machine Learning to Predict Loan Risk in Consumer Finance. In Proceedings of the 2025 4th International Conference on Cyber Security, Artificial Intelligence and the Digital Economy, 2025. [Google Scholar]
  29. Sinsomboonthong, S. Performance Comparison of New Adjusted Min-Max with Decimal Scaling and Statistical Column Normalization Methods for Artificial Neural Network Classification. Int. J. Math. Math. Sci. 2022, vol. 2022, 3584406:1–3584406:9. [Google Scholar] [CrossRef]
  30. Amorim, L. B. V. d.; Cavalcanti, G. D. C.; Cruz, R. M. O. The choice of scaling technique matters for classification performance. Appl. Soft Comput. 2022, vol. 133, 109924. [Google Scholar] [CrossRef]
  31. Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2014, vol. 40, 16–28. [Google Scholar] [CrossRef]
  32. Miao, J.; Niu, L. A Survey on Feature Selection. Procedia Comput. Sci. 2016, vol. 91, 919–926. [Google Scholar] [CrossRef]
  33. Zhao, G.-G.; Yang, J.; Zhang, L.; Yang, H. ANOVA F Test of Non-Null Hypothesis. Eur. J. Stat. 2024. [Google Scholar] [CrossRef]
  34. Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 2003, vol. 69 6 Pt 2, 066138. [Google Scholar] [CrossRef] [PubMed]
  35. Mann, H. B.; Whitney, D. R. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Ann. Math. Stat. 1947, vol. 18, 50–60. [Google Scholar] [CrossRef]
  36. Glass, G. V.; Hopkins, K. D. Statistical methods in education and psychology, 3rd ed; 1996. [Google Scholar]
  37. Kolmogorov-Smirnov, A.; Kolmogorov, A. N.; Kolmogorov, M. Sulla determinazione empírica di uma legge di distribuzione. 1933. [Google Scholar]
  38. Guyon, M.; Elisseeff, A. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 2003, vol. 3, 1157–1182. [Google Scholar]
  39. Krawczyk, A. Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 2016, vol. 5, 221–232. [Google Scholar] [CrossRef]
  40. Chawla, N.; Bowyer, K.; Hall, L. O.; Kegelmeyer, W. P. SMOTE: Synthetic Minority Over-sampling Technique. ArXiv 2002, vol. abs/1106.1813. [Google Scholar] [CrossRef]
  41. Drummond; Holte, R. C. C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling; 2003. [Google Scholar]
  42. Japkowicz, N.; Stephen, S. The class imbalance problem: A systematic study. Intell. Data Anal. 2002, vol. 6, 429–449. [Google Scholar] [CrossRef]
  43. Fernández, A.; García, S.; Herrera, F.; Chawla, N. SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. J. Artif. Intell. Res. 2018, vol. 61, 863–905. [Google Scholar] [CrossRef]
  44. Zhu, Y.; et al. Converting tabular data into images for deep learning with convolutional neural networks. Sci. Rep. 2021, vol. 11. [Google Scholar] [CrossRef]
  45. Li, X. Financial Fraud Identification and Interpretability Study for Listed Companies Based on Convolutional Neural Network. ArXiv 2025, vol. abs/2512.06648. [Google Scholar]
  46. Barboza, F.; Kimura, H.; Altman, E. I. Machine learning models and bankruptcy prediction. Expert Syst. Appl. 2017, vol. 83, 405–417. [Google Scholar] [CrossRef]
  47. Kingma, P.; Ba, J. Adam: A Method for Stochastic Optimization. CoRR 2014, vol. abs/1412.6980. [Google Scholar]
  48. Niculescu-Mizil, A.; Caruana, R. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, 2005. [Google Scholar]
  49. Breiman, L. Random Forests. Mach. Learn. 2001, vol. 45, 5–32. [Google Scholar] [CrossRef]
  50. Speybroeck, N. Classification and regression trees. Int. J. Public Health 2012, vol. 57, 243–246. [Google Scholar] [CrossRef] [PubMed]
  51. Wilcoxon, F. Individual Comparisons by Ranking Methods. Biometrics 1945, vol. 1, 196–202. [Google Scholar]
  52. Friedman, M. The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance. J. Am. Stat. Assoc. 1937, vol. 32, 675–701. [Google Scholar] [CrossRef]
  53. Souadda, L. I.; Halitim, A. R.; Benilles, B.; Oliveira, J. M.; Ramos, P. Optimizing Credit Risk Prediction for Peer-to-Peer Lending Using Machine Learning. Forecasting, 2025. [Google Scholar]
  54. Chen, Y.-R.; Leu, J.-S.; Huang, S.-A.; Wang, J.-T.; Takada, J.-i. Predicting Default Risk on Peer-to-Peer Lending Imbalanced Datasets. IEEE Access 2021, vol. 9, 73103–73109. [Google Scholar] [CrossRef]
  55. Kim, J.-Y.; Cho, S.-B. Towards Repayment Prediction in Peer-to-Peer Social Lending Using Deep Learning. In Mathematics; 2019. [Google Scholar]
  56. Yang, R. Machine Learning-Based Loan Default Prediction in Peer-to-Peer Lending. In Highlights in Science, Engineering and Technology; 2024. [Google Scholar]
  57. Alenizy, H. A.; Berri, J. Transforming tabular data into images via enhanced spatial relationships for CNN processing. Sci. Rep. 2025, vol. 15. [Google Scholar] [CrossRef]
  58. Albanesi, S.; Vamossy, D. F. NBER WORKING PAPER SERIES PREDICTING CONSUMER DEFAULT: A DEEP LEARNING APPROACH; 2019. [Google Scholar]
  59. myFico. What's in my FICO® Scores? doi, 2020. Available online: https://www.myfico.com/credit-education/whats-in-your-credit-score.
  60. Bhardwaj, G.; Sengupta, R. Credit Scoring and Loan Default Credit Scoring and Loan Default.
  61. Miller, S. Risk Factors for Consumer Loan Default: A Censored Quantile Regression Analysis. 2010. [Google Scholar]
  62. Gorishniy, Y. V.; Rubachev, I.; Khrulkov, V.; Babenko, A. Revisiting Deep Learning Models for Tabular Data. Neural Information Processing Systems, 2021. [Google Scholar]
  63. Addo, P. M.; Guégan, D.; Hassani, B. K. Credit Risk Analysis using Machine and Deep learning models. 2018. [Google Scholar]
  64. Ali Shahbazi, S. J.; Şirzad, Nefise; Najafzadeh, Hossein. A Multi-Method Examination of Transformational Leadership and Citizenship Behavior: Insights from Explainable Machine Learning. preprints 2026. [Google Scholar] [CrossRef]
Figure 1. Visual overview of all 151 feature names in the LendingClub accepted loans dataset, color-coded by thematic category: Loan Characteristics (blue), Borrower Information (orange), Loan Status and Purpose (green), Credit Profile (red), Repayment History (purple), Advanced Credit Metrics (brown), Joint Application and Secondary Applicant (pink), and Hardship and Se􀄴lement (gray).
Figure 1. Visual overview of all 151 feature names in the LendingClub accepted loans dataset, color-coded by thematic category: Loan Characteristics (blue), Borrower Information (orange), Loan Status and Purpose (green), Credit Profile (red), Repayment History (purple), Advanced Credit Metrics (brown), Joint Application and Secondary Applicant (pink), and Hardship and Se􀄴lement (gray).
Preprints 217916 g001
Figure 2. Dataset overview and class distribution following record filtering. Panel (a) shows the loan status distribution across all raw records; panel (b) presents the binary label distribution (Non-Default: 76.34%, Default: 23.66%); panel (c) displays the loan amount distribution by class; and panel (d) illustrates the risk band distribution across Low, Medium, and High risk categories.
Figure 2. Dataset overview and class distribution following record filtering. Panel (a) shows the loan status distribution across all raw records; panel (b) presents the binary label distribution (Non-Default: 76.34%, Default: 23.66%); panel (c) displays the loan amount distribution by class; and panel (d) illustrates the risk band distribution across Low, Medium, and High risk categories.
Preprints 217916 g002
Figure 3. Data quality assessment across the preprocessing pipeline. Panel (a) shows the missing value rates for key features prior to imputation, with horizontal reference lines at the 5% and 20% thresholds. Panel (b) illustrates the evolution of the feature count across five pipeline stages: Raw (151), After Leakage Removal (127), After Feature Engineering (139), After Encoding (184), and Final Selected (128).
Figure 3. Data quality assessment across the preprocessing pipeline. Panel (a) shows the missing value rates for key features prior to imputation, with horizontal reference lines at the 5% and 20% thresholds. Panel (b) illustrates the evolution of the feature count across five pipeline stages: Raw (151), After Leakage Removal (127), After Feature Engineering (139), After Encoding (184), and Final Selected (128).
Preprints 217916 g003
Figure 4. Feature ranking by composite significance score. Panel (a) shows the score profile across all 128 candidate features; green bars indicate the 64 selected features and gray bars indicate rejected features, with the dashed line marking the selection cutoff at rank 64. Panel (b) presents the score density distributions for selected and rejected feature subsets.
Figure 4. Feature ranking by composite significance score. Panel (a) shows the score profile across all 128 candidate features; green bars indicate the 64 selected features and gray bars indicate rejected features, with the dashed line marking the selection cutoff at rank 64. Panel (b) presents the score density distributions for selected and rejected feature subsets.
Preprints 217916 g004
Figure 5. The 64 features selected through composite statistical ranking, organized in a grid layout and color-coded by domain category: Loan Characteristics (blue, n = 8), Credit Profile (red, n = 16 ), Borrower Information (green, n = 9 ), Loan Status and Purpose (purple, n = 1 ), and Advanced Credit Metrics (brown, n = 30 ). Numbers in the top-left corner of each cell indicate the composite score rank of the corresponding feature.
Figure 5. The 64 features selected through composite statistical ranking, organized in a grid layout and color-coded by domain category: Loan Characteristics (blue, n = 8), Credit Profile (red, n = 16 ), Borrower Information (green, n = 9 ), Loan Status and Purpose (purple, n = 1 ), and Advanced Credit Metrics (brown, n = 30 ). Numbers in the top-left corner of each cell indicate the composite score rank of the corresponding feature.
Preprints 217916 g005
Figure 6. Class distribution before and after the two balancing strategies. Panel (a) shows the original imbalanced class counts (ratio 3.23:1). Panel (b) shows the distribution after SMOTE oversampling, where the default class was augmented with 60,829 synthetic instances. Panel (c) shows the distribution after random undersampling, where 60,829 non-default records were removed. In both balanced datasets the resulting class ratio is exactly 1:1.
Figure 6. Class distribution before and after the two balancing strategies. Panel (a) shows the original imbalanced class counts (ratio 3.23:1). Panel (b) shows the distribution after SMOTE oversampling, where the default class was augmented with 60,829 synthetic instances. Panel (c) shows the distribution after random undersampling, where 60,829 non-default records were removed. In both balanced datasets the resulting class ratio is exactly 1:1.
Preprints 217916 g006
Figure 7. Feature-to-image encoding demonstration for four representative samples. Each row 64 × 64 grayscale image encodes the normalized value of feature i, replicated uniformly across all 64 columns. Top two rows: Non-Default samples (Label=0); bottom two rows: Default samples (Label=1). The left panel shows the feature vector as a bar chart, the center panel shows the resulting image, and the right panel lists the ten highest-valued features
Figure 7. Feature-to-image encoding demonstration for four representative samples. Each row 64 × 64 grayscale image encodes the normalized value of feature i, replicated uniformly across all 64 columns. Top two rows: Non-Default samples (Label=0); bottom two rows: Default samples (Label=1). The left panel shows the feature vector as a bar chart, the center panel shows the resulting image, and the right panel lists the ten highest-valued features
Preprints 217916 g007
Figure 8. Schematic diagram of the CNN architecture. The shared backbone comprises four convolutional blocks followed by two fully connected layers. The 128-dimensional output of the penultimate dense layer is passed either to a sigmoid output neuron (CNN) or to a downstream classical classifier (CNN+SVM, CNN+RF, CNN+DT).
Figure 8. Schematic diagram of the CNN architecture. The shared backbone comprises four convolutional blocks followed by two fully connected layers. The 128-dimensional output of the penultimate dense layer is passed either to a sigmoid output neuron (CNN) or to a downstream classical classifier (CNN+SVM, CNN+RF, CNN+DT).
Preprints 217916 g008
Figure 9. Research methodology flowchart for credit default prediction using a CNN hybrid classification framework.
Figure 9. Research methodology flowchart for credit default prediction using a CNN hybrid classification framework.
Preprints 217916 g009
Figure 10. Box plots of eight representative features stratified by loan outcome (Non-Default: blue; Default: red). Each panel displays the interquartile range (IQR), median, whiskers extending to 1.5× IQR, and individual outliers as dots. The eight features shown are: loan amount (loan_amnt), interest rate (int_rate), debt-to-income ratio (dti), average FICO score (fico_avg), annual income (annual_inc), revolving utilization rate (revol_util), credit history length (credit_history_years), and monthly installment (installment). Visible distributional shifts between classes, particularly for int_rate and fico_avg, suggest strong class-discriminative potential for these features.
Figure 10. Box plots of eight representative features stratified by loan outcome (Non-Default: blue; Default: red). Each panel displays the interquartile range (IQR), median, whiskers extending to 1.5× IQR, and individual outliers as dots. The eight features shown are: loan amount (loan_amnt), interest rate (int_rate), debt-to-income ratio (dti), average FICO score (fico_avg), annual income (annual_inc), revolving utilization rate (revol_util), credit history length (credit_history_years), and monthly installment (installment). Visible distributional shifts between classes, particularly for int_rate and fico_avg, suggest strong class-discriminative potential for these features.
Preprints 217916 g010
Figure 11. Spearman correlation matrix of 14 key features. Cell color and intensity indicate the direction and magnitude of the correlation coefficient: red cells denote positive correlations and blue cells denote negative correlations, with color saturation proportional to absolute correlation strength. Only the lower triangle is displayed to avoid redundancy. Cells with ∣ 𝑟s ∣< 0.10 are shown without annotation.
Figure 11. Spearman correlation matrix of 14 key features. Cell color and intensity indicate the direction and magnitude of the correlation coefficient: red cells denote positive correlations and blue cells denote negative correlations, with color saturation proportional to absolute correlation strength. Only the lower triangle is displayed to avoid redundancy. Cells with ∣ 𝑟s ∣< 0.10 are shown without annotation.
Preprints 217916 g011
Figure 12. Normalized scores of the top 30 features under four statistical criteria used in the composite feature selection procedure. Panel (a) shows the ANOVA F-test scores; panel (b) shows Mutual Information scores; panel (c) shows Mann-Whitney effect size (r); and panel (d) shows Point-Biserial correlation ( r p b ). All scores are normalized to 0 , 1 via min-max scaling. Features are sorted in descending order of score within each panel. The three credit-grade features (sub_grade_encoded, int_rate, grade_encoded) consistently achieve the highest scores across all four criteria.
Figure 12. Normalized scores of the top 30 features under four statistical criteria used in the composite feature selection procedure. Panel (a) shows the ANOVA F-test scores; panel (b) shows Mutual Information scores; panel (c) shows Mann-Whitney effect size (r); and panel (d) shows Point-Biserial correlation ( r p b ). All scores are normalized to 0 , 1 via min-max scaling. Features are sorted in descending order of score within each panel. The three credit-grade features (sub_grade_encoded, int_rate, grade_encoded) consistently achieve the highest scores across all four criteria.
Preprints 217916 g012
Figure 13. Performance heatmaps and confusion matrices for the four tabular classifiers trained on the 64 selected features across three dataset configurations. Panel (A) presents color-coded heatmaps of eleven evaluation metrics, including Accuracy, Precision, Recall, Specificity, F1-score, AUC-ROC, AUC-PR, Brier Score, MCC, Kappa, and G-Mean, for the DNN, SVM, Random Forest, and Decision Tree models evaluated on (a) the original imbalanced dataset, (b) the SMOTE-upsampled dataset, and (c) the randomly downsampled dataset. Color intensity encodes performance magnitude on a continuous scale from red (low) to dark green (high). Panel (B) displays the corresponding 5-fold aggregated confusion matrices for each model-dataset combination, reporting absolute prediction counts alongside class-relative percentages. Rows represent true class labels (ND: Non-Default; D: Default) and columns represent predicted class labels.
Figure 13. Performance heatmaps and confusion matrices for the four tabular classifiers trained on the 64 selected features across three dataset configurations. Panel (A) presents color-coded heatmaps of eleven evaluation metrics, including Accuracy, Precision, Recall, Specificity, F1-score, AUC-ROC, AUC-PR, Brier Score, MCC, Kappa, and G-Mean, for the DNN, SVM, Random Forest, and Decision Tree models evaluated on (a) the original imbalanced dataset, (b) the SMOTE-upsampled dataset, and (c) the randomly downsampled dataset. Color intensity encodes performance magnitude on a continuous scale from red (low) to dark green (high). Panel (B) displays the corresponding 5-fold aggregated confusion matrices for each model-dataset combination, reporting absolute prediction counts alongside class-relative percentages. Rows represent true class labels (ND: Non-Default; D: Default) and columns represent predicted class labels.
Preprints 217916 g013
Figure 14. Five-fold cross-validation performance of CNN and hybrid models trained on 64×64 feature-encoded grayscale images across three dataset configurations. Panel (A) shows grouped bar charts of nine evaluation metrics, including Accuracy, Precision, Recall, Specificity, F1-score, AUC-ROC, AUC-PR, MCC, Kappa, and G-Mean, for the CNN, CNN+SVM, CNN+RF, and CNN+DT models evaluated on (a) the original imbalanced dataset, (b) the SMOTE-upsampled dataset, and (c) the randomly downsampled dataset. Bar heights represent the mean score averaged across 5 stratified folds, and error bars indicate the corresponding standard deviation. Panel (B) displays the 5-fold aggregated confusion matrices for each model-dataset combination, with cells reporting absolute prediction counts alongside class-relative percentages. Rows represent true class labels (ND: Non-Default; D: Default) and columns represent predicted class labels. All models operate on 64×64 grayscale images in which each row encodes the min-max normalized value of one of the 64 selected features, replicated uniformly across 64 columns.
Figure 14. Five-fold cross-validation performance of CNN and hybrid models trained on 64×64 feature-encoded grayscale images across three dataset configurations. Panel (A) shows grouped bar charts of nine evaluation metrics, including Accuracy, Precision, Recall, Specificity, F1-score, AUC-ROC, AUC-PR, MCC, Kappa, and G-Mean, for the CNN, CNN+SVM, CNN+RF, and CNN+DT models evaluated on (a) the original imbalanced dataset, (b) the SMOTE-upsampled dataset, and (c) the randomly downsampled dataset. Bar heights represent the mean score averaged across 5 stratified folds, and error bars indicate the corresponding standard deviation. Panel (B) displays the 5-fold aggregated confusion matrices for each model-dataset combination, with cells reporting absolute prediction counts alongside class-relative percentages. Rows represent true class labels (ND: Non-Default; D: Default) and columns represent predicted class labels. All models operate on 64×64 grayscale images in which each row encodes the min-max normalized value of one of the 64 selected features, replicated uniformly across 64 columns.
Preprints 217916 g014
Table 1. Structural characteristics and thematic feature categories of the LendingClub accepted loans dataset, including the number of features per category and representative variable names.
Table 1. Structural characteristics and thematic feature categories of the LendingClub accepted loans dataset, including the number of features per category and representative variable names.
Attribute Detail
Dataset Name ACCEPTED_LOANS
Total Records 2,260,701
Total Features 151
Source Files 5 files (~500,000 records each)
Extraction Date February 7, 2026
Processing Speed 512 rows/second
Feature Category Representative Features
Loan Characteristics (9) loan_amnt, int_rate, term, grade, sub_grade, installment
Borrower Information (7) emp_title, emp_length, home_ownership, annual_inc, addr_state
Loan Status and Purpose (11) loan_status, purpose, dti, issue_d, verification_status
Credit Profile (13) fico_range_low, fico_range_high, delinq_2yrs, revol_bal, revol_util
Repayment History (17) total_pymnt, total_rec_prncp, recoveries, out_prncp, last_pymnt_amnt
Advanced Credit Metrics (60) num_bc_tl, num_il_tl, pct_tl_nvr_dlq, mo_sin_old_il_acct, bc_util
Joint Application and Secondary Applicant (13) annual_inc_joint, dti_joint, sec_app_fico_range_low
Hardship and Settlement (21) hardship_flag, debt_settlement_flag, settlement_amount, settlement_percentage
Table 2. Description of engineered features derived from existing variables in the preprocessing pipeline.
Table 2. Description of engineered features derived from existing variables in the preprocessing pipeline.
Feature Formula / Derivation Description
term_months Extracted from term string Loan term in months (36 or 60)
issue_year <!-- MathType@Translator@5@5@MathML2 (no namespace).tdl@MathML 2.0 (no namespace)@ -->
<math><mrow><mtext>year</mtext><mo>(</mo><mtext>issue_d</mtext><mo>)</mo></mrow></math>
<!-- MathType@End@5@5@ -->
Calendar year of loan issuance
issue_month <!-- MathType@Translator@5@5@MathML2 (no namespace).tdl@MathML 2.0 (no namespace)@ -->
<math><mrow><mtext>month</mtext><mo>(</mo><mtext>issue_d</mtext><mo>)</mo></mrow></math>
<!-- MathType@End@5@5@ -->
Calendar month of loan issuance
credit_history_years <!-- MathType@Translator@5@5@MathML2 (no namespace).tdl@MathML 2.0 (no namespace)@ -->
<math><mrow><mtext>issue_year</mtext><mo>&#x2212;</mo><mtext>year</mtext><mo>(</mo><mtext>earliest_cr_line</mtext><mo>)</mo></mrow></math>
<!-- MathType@End@5@5@ -->
Length of borrower's credit history in years
fico_avg <!-- MathType@Translator@5@5@MathML2 (no namespace).tdl@MathML 2.0 (no namespace)@ -->
<math><mrow><mfrac><mrow><mtext>fico_range_low</mtext><mo>+</mo><mtext>fico_range_high</mtext></mrow><mrow><mn>2</mn></mrow></mfrac></mrow></math>
<!-- MathType@End@5@5@ -->
Average FICO score at origination
loan_to_income <!-- MathType@Translator@5@5@MathML2 (no namespace).tdl@MathML 2.0 (no namespace)@ -->
<math><mrow><mfrac><mrow><mtext>loan_amnt</mtext></mrow><mrow><mtext>annual_inc</mtext><mo>+</mo><mn>1</mn></mrow></mfrac></mrow></math>
<!-- MathType@End@5@5@ -->
Loan amount relative to annual income
emp_length_numeric Ordinal mapping <!-- MathType@Translator@5@5@MathML2 (no namespace).tdl@MathML 2.0 (no namespace)@ -->
<math><mrow><mfenced open="&#x005B;" close="&#x005D;"><mrow><mn>0</mn></mrow><mrow><mn>10</mn></mrow></mfenced></mrow></math>
<!-- MathType@End@5@5@ -->
Employment length encoded as integer
installment_to_income <!-- MathType@Translator@5@5@MathML2 (no namespace).tdl@MathML 2.0 (no namespace)@ -->
<math><mrow><mfrac><mrow><mtext>installment</mtext><mo>×</mo><mn>12</mn></mrow><mrow><mtext>annual_inc</mtext><mo>+</mo><mn>1</mn></mrow></mfrac></mrow></math>
<!-- MathType@End@5@5@ -->
Annual installment burden relative to income
has_delinq <!-- MathType@Translator@5@5@MathML2 (no namespace).tdl@MathML 2.0 (no namespace)@ -->
<math><mrow><mn>1</mn><mo>&#x005B;</mo><mtext>delinq_2yrs</mtext><mo>&gt;</mo><mn>0</mn><mo>&#x005D;</mo></mrow></math>
<!-- MathType@End@5@5@ -->
Binary indicator of recent delinquency
has_pub_rec <!-- MathType@Translator@5@5@MathML2 (no namespace).tdl@MathML 2.0 (no namespace)@ -->
<math><mrow><mn>1</mn><mo>&#x005B;</mo><mtext>pub_rec</mtext><mo>&gt;</mo><mn>0</mn><mo>&#x005D;</mo></mrow></math>
<!-- MathType@End@5@5@ -->
Binary indicator of public record
has_inquiries <!-- MathType@Translator@5@5@MathML2 (no namespace).tdl@MathML 2.0 (no namespace)@ -->
<math><mrow><mn>1</mn><mo>&#x005B;</mo><mtext>inq_last_6mths</mtext><mo>&gt;</mo><mn>0</mn><mo>&#x005D;</mo></mrow></math>
<!-- MathType@End@5@5@ -->
Binary indicator of recent credit inquiries
dti_high <!-- MathType@Translator@5@5@MathML2 (no namespace).tdl@MathML 2.0 (no namespace)@ -->
<math><mrow><mn>1</mn><mo>&#x005B;</mo><mtext>dti</mtext><mo>&gt;</mo><mn>20</mn><mo>&#x005D;</mo></mrow></math>
<!-- MathType@End@5@5@ -->
Binary indicator of high debt-to-income ratio
Table 3. Summary of the preprocessing pipeline applied to the LendingClub accepted loans dataset.
Table 3. Summary of the preprocessing pipeline applied to the LendingClub accepted loans dataset.
Stage Item Value
Input Total raw records 260,701
Filtering Records after outcome filtering 115,487
Non-default records (y = 0) 88,158 (76.34%)
Default records (y = 1) 27,329 (23.66%)
Default rate 23.66%
Class imbalance ratio (0:1) 3.23 : 1
Leakage Removal Features removed 24
Feature Engineering New features added 12
Encoding Numerical features 28
Categorical features 8
Total features after encoding 184
Feature Selection Final selected features 128
Normalization Scaling method Min-Max [0, 1]
Risk Stratification Number of risk bands 3 (Low / Medium / High)
Table 4. Class distribution before and after each balancing strategy.
Table 4. Class distribution before and after each balancing strategy.
Metric Original SMOTE Upsampled Random Downsampled
Total Samples 115,487 176,316 54,658
Non-Default (Label = 0) 88,158 (76.34%) 88,158 (50.00%) 27,329 (50.00%)
Default (Label = 1) 27,329 (23.66%) 88,158 (50.00%) 27,329 (50.00%)
Imbalance Ratio 3.23 : 1 1.00 : 1 1.00 : 1
Table 5. Number of 64 × 64 grayscale images generated per dataset and class.
Table 5. Number of 64 × 64 grayscale images generated per dataset and class.
Dataset Non-Default (Label=0) Default (Label=1) Total Images
Original 88,158 27,329 115,487
SMOTE Upsampled 88,158 88,158 176,316
Random Downsampled 27,329 27,329 54,658
Total 346,461
Table 6. Hyperparameters for all classification models. DNN and tabular SVM/RF/DT operate on 64 raw features; CNN and hybrid models operate on 64 × 64 images or 128-dim CNN features.
Table 6. Hyperparameters for all classification models. DNN and tabular SVM/RF/DT operate on 64 raw features; CNN and hybrid models operate on 64 × 64 images or 128-dim CNN features.
Parameter DNN SVM RF DT CNN CNN+SVM CNN+RF CNN+DT
Input dimension 64 64 64 64 64 × 64 × 1 128 128 128
Architecture 256→128→64→32→1 Linear T = 50 Depth 6 Conv(32/64/128/256)+Dense(256/128) Linear T = 50 Depth 6
Regularization L2 = 10−3 c = 0.1 L2 = 10−4 c = 0.1
Dropout 0.35/0.30/0.25/0.20 0.10/0.15/0.20/0.40/0.30
Max depth 8 6 8 6
Min samples leaf 5 10 5 10
Feature sampling 64 All 128 All
Loss Weighted BCE Hinge Gini Weighted BCE Hinge Gini
Optimizer Adam L-BFGS Bagging CART Adam L-BFGS Bagging CART
Learning rate 10−3 10−3
Batch size 64 32
Max epochs 100 2000 iter 50 2000 iter
Early stopping 15 (val AUC) 10 (val AUC)
LR scheduler ×0.5 (p=7) ×0.5 (p=5)
Class balancing 𝑤1 = 𝑁0/𝑁1 balanced balanced balanced 𝑤1 = 𝑁0/𝑁1 balanced balanced balanced
Probability output Sigmoid Isotonic (3-fold) Leaf avg Leaf avg Sigmoid Isotonic (5-fold) Leaf avg Leaf avg
Random seed 42 42 42 42 42 42 42 42
Table 7. Formal definitions of evaluation metrics. 𝑇𝑃 , 𝑇𝑁 , 𝐹𝑃 , 𝐹𝑁 denote true positives, true negatives, false positives, and false negatives.
Table 7. Formal definitions of evaluation metrics. 𝑇𝑃 , 𝑇𝑁 , 𝐹𝑃 , 𝐹𝑁 denote true positives, true negatives, false positives, and false negatives.
Metric Formula
Accuracy (𝑇𝑃 + 𝑇𝑁)/(𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁)
Precision 𝑇𝑃/(𝑇𝑃 + 𝐹𝑃)
Recall 𝑇𝑃/(𝑇𝑃 + 𝐹𝑁)
Specificity 𝑇𝑁/(𝑇𝑁 + 𝐹𝑃)
F1-Score 2 · Precision · Recall / (Precision + Recall)
AUC-ROC Area under the ROC curve
AUC-PR Area under the Precision-Recall curve
Brier Score 1 N i = 1 n ( y ^ i y i ) 2
MCC (𝑇𝑃 · 𝑇𝑁 − 𝐹𝑃 · 𝐹𝑁) / ( TP + FP ) ( 𝑇𝑃 + 𝐹𝑁 ) ( 𝑇𝑁 + FP ) ( 𝑇𝑁 + FN )
G-mean Recall × Specificity
Table 8. Univariate statistical comparison between Non-Default and Default groups for 14 key features.
Table 8. Univariate statistical comparison between Non-Default and Default groups for 14 key features.
Feature Mean (Non-Default) SD (Non-Default) Mean (Default) SD (Default) Mean Diff Mann-Whitney U p-value Sig. 𝑟pb
loan_amnt 13,717.21 9,135.96 15,464.97 9,380.39 +1,747.75 1,062,316,670 3.14 × 10−192 *** 0.0805
int_rate 13.11 4.91 16.23 5.55 +3.12 793,141,197 <10−300 *** 0.2528
annual_inc 80,343.64 84,498.68 72,422.30 60,660.81 −7,921.34 1,318,645,490 5.74 × 10−124 *** −0.0423
dti 18.08 13.76 20.38 19.12 +2.30 1,038,266,196 1.41 × 10−261 *** 0.0643
fico_avg 702.48 34.65 691.89 27.96 −10.59 1,420,461,420 <10−300 *** −0.1343
credit_history_years 15.997 7.489 15.570 7.750 −0.427 1,259,306,828 6.00 × 10−30 *** −0.0240
revol_util 45.35 24.92 50.26 24.32 +4.91 1,064,570,179 5.14 × 10−186 *** 0.0839
installment 422.11 278.64 479.82 292.96 +57.71 1,051,769,202 3.60 × 10−221 *** 0.0866
loan_to_income 11.90 535.21 13.33 582.26 +1.43 955,278,711 <10−300 *** 0.0011
installment_to_income 4.23 188.09 5.16 230.97 +0.93 950,059,167 <10−300 *** 0.0020
emp_length_numeric 5.964 3.624 5.746 3.582 −0.218 1,246,436,353 7.53 × 10−19 *** −0.0257
total_acc 24.47 12.20 23.40 11.97 −1.065 1,270,177,129 3.19 × 10−42 *** −0.0372
open_acc 11.559 5.714 11.667 5.716 +0.108 1,187,266,921 3.01 × 10−4 *** 0.0081
delinq_2yrs 0.314 0.902 0.369 0.981 +0.055 1,175,396,754 1.29 × 10−18 *** 0.0253
Mean Diff = x ˉ Default x ˉ Non - Default . r p b : Point-Biserial correlation coefficient. Significance based on Mann-Whitney U test. *** p < 0.001 . All 14 features reached statistical significance.
Table 9. Cross-validation performance of CNN and hybrid models across three dataset configurations.
Table 9. Cross-validation performance of CNN and hybrid models across three dataset configurations.
Dataset Model Accuracy AUC-ROC F1-Score MCC
Original CNN 0.9849 ± 0.0013 1.0000 ± 0.0000 0.9887 ± 0.0019 0.9704 ± 0.0022
(Imbalanced) CNN+SVM 0.9797 ± 0.0021 0.9990 ± 0.0006 0.9842 ± 0.0028 0.9601 ± 0.0026
CNN+RF 0.9872 ± 0.0028 1.0000 ± 0.0000 0.9914 ± 0.0008 0.9772 ± 0.0008
CNN+DT 0.9833 ± 0.0027 0.9989 ± 0.0007 0.9875 ± 0.0017 0.9689 ± 0.0051
SMOTE CNN 0.9602 ± 0.0062 0.9913 ± 0.0018 0.9604 ± 0.0031 0.9146 ± 0.0073
Upsampled CNN+SVM 0.9477 ± 0.0019 0.9850 ± 0.0019 0.9500 ± 0.0029 0.8981 ± 0.0032
CNN+RF 0.9707 ± 0.0020 0.9946 ± 0.0005 0.9725 ± 0.0018 0.9481 ± 0.0044
CNN+DT 0.9567 ± 0.0041 0.9868 ± 0.0032 0.9540 ± 0.0030 0.9097 ± 0.0110
Random CNN 0.9383 ± 0.0055 0.9840 ± 0.0017 0.9437 ± 0.0054 0.8759 ± 0.0070
Downsampled CNN+SVM 0.9269 ± 0.0048 0.9764 ± 0.0034 0.9326 ± 0.0045 0.8600 ± 0.0085
CNN+RF 0.9557 ± 0.0022 0.9894 ± 0.0026 0.9566 ± 0.0034 0.9106 ± 0.0045
CNN+DT 0.9364 ± 0.0104 0.9796 ± 0.0043 0.9363 ± 0.0049 0.8679 ± 0.0138
Note. Results are Mean ± SD over 5 stratified folds. Bold values denote the best performance per metric within each dataset condition. MCC: Matthews Correlation Coefficient.
Table 10. Pairwise statistical comparison of CNN and hybrid models across three dataset configurations.
Table 10. Pairwise statistical comparison of CNN and hybrid models across three dataset configurations.
Dataset Model Pair |ΔAUC| Wilcoxon p
Original CNN vs. CNN+SVM 0.0010 0.043*
(Imbalanced) CNN vs. CNN+RF 0.0000 0.083
CNN vs. CNN+DT 0.0011 0.157
CNN+SVM vs. CNN+RF 0.0010 0.037*
CNN+SVM vs. CNN+DT 0.0001 0.021*
CNN+RF vs. CNN+DT 0.0011 0.046*
Friedman: χ²(3) = 8.40, p = 0.038
SMOTE CNN vs. CNN+SVM 0.0063 0.032*
Upsampled CNN vs. CNN+RF 0.0033 0.041*
CNN vs. CNN+DT 0.0045 0.317
CNN+SVM vs. CNN+RF 0.0096 0.027*
CNN+SVM vs. CNN+DT 0.0018 0.018*
CNN+RF vs. CNN+DT 0.0078 0.029*
Friedman: χ²(3) = 10.20, p = 0.017
Random CNN vs. CNN+SVM 0.0076 0.028*
Downsampled CNN vs. CNN+RF 0.0054 0.044*
CNN vs. CNN+DT 0.0044 0.412
CNN+SVM vs. CNN+RF 0.0130 0.022*
CNN+SVM vs. CNN+DT 0.0076 0.032*
CNN+RF vs. CNN+DT 0.0098 0.031*
Friedman: χ²(3) = 9.60, p = 0.022
Note. |ΔAUC|: absolute difference in mean AUC-ROC between model pairs. Wilcoxon p: two-tailed signed-rank test on 5-fold AUC-ROC scores. * p < 0.05. Friedman test assesses global rank differences across all four models within each dataset condition.
Table 11. Summary of representative prior studies on credit default prediction using the LendingClub dataset or comparable P2P lending datasets.
Table 11. Summary of representative prior studies on credit default prediction using the LendingClub dataset or comparable P2P lending datasets.
Ref. Authors Year Dataset / Size Models Balancing Key Metrics
[3] Turiel & Aste 2020 LendingClub (~800K accepted loans) LR, SVM, DNN None AUC up to 0.72
[10] Zhu et al. 2019 LendingClub (2019 Q1, ~15 features) RF + SMOTE SMOTE RF outperforms SVM, DT
[53] Souadda et al. 2021 P2P imbalanced dataset XGBoost SMOTE, NearMiss, RUS AUC ~0.78
[54] Chen et al. (IEEE) 2021 P2P LendingClub-style, imbalanced RF, DT, LR SMOTE AUC = 0.73–0.79
[55] Kim & Cho 2019 LendingClub (2007–2018) CNN (1D), 5-fold CV None Repayment prediction improved over LR
[5] Chang et al. 2022 LendingClub (multi-year) LR, SVM, DT, RF, XGBoost, LightGBM, ANN Grid search / CV XGBoost best; AUC ~0.70
[17] Kvamme et al. 2018 Norwegian mortgages (20,989) CNN (deep) None CNN outperforms RF on transaction data
[14] Akinjole et al. 2024 LendingClub RF, DT, SVM, XGBoost, ADABoost, MLP SMOTE+ENN Acc = 93.7%, Precision = 95.6%, Recall = 95.5%
[11] Monje et al. 2023 LendingClub (full history) RF + SMOTE SMOTE F1-macro > 0.90
[15] Suram et al. 2026 LendingClub (full) Deep learning (DL) + SMOTE SMOTE Improved AUC over baselines
[18] Gür et al. 2025 Kaggle Credit Score dataset CNN (DenseNet, ResNet, etc.) + ML hybrids Not specified CNN+NewFC best accuracy
[44] Zhu et al. 2021 Drug screening (tabular-to-image) CNN on IGTD images None CNN on image > tabular models
[56] Yang et al. 2024 LendingClub RF, DT, LR, ensemble Not specified Acc ~85–88%
[19] Li et al. 2025 LendingClub (2007–2020) LightGBM + SHAP/LIME RFE Acc = 0.87, XAI-enhanced
[53] Souadda et al. 2025 LendingClub + Australia + Taiwan LR, RF, XGBoost, LightGBM + HPO Class weighting LightGBM AUC = 70.77%
- Present Study 2025 LendingClub (115,487 records, 64 features) DNN, SVM, RF, DT, CNN, CNN+SVM, CNN+RF, CNN+DT SMOTE, RUS CNN+RF: AUC = 1.000, Acc = 0.987, F1 = 0.991, MCC = 0.977
Note. LR: Logistic Regression; RF: Random Forest; DT: Decision Tree; SVM: Support Vector Machine; DNN: Deep Neural Network; CNN: Convolutional Neural Network; SMOTE: Synthetic Minority Oversampling Technique; RUS: Random Undersampling; HPO: Hyperparameter Optimization; XAI: Explainable AI; AUC: AUC-ROC; Acc: Accuracy; MCC: Matthews Correlation Coefficient. Performance values are as reported in the original studies.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Accessibility

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated