Class Cardinality as a Source of Prediction Uncertainty in E-Commerce Customer Analytics

Ishan Ghosh; Mourani Sinha; Partho Mallick; Jayanta Poray; Souvik Sarkar

doi:10.20944/preprints202606.1884.v1

Submitted:

24 June 2026

Posted:

25 June 2026

You are already at the latest version

Abstract

Class imbalance and class cardinality both affect multiclass classification but their influence on probabilistic estimation has been less explored. This study examines these impacts using an e-commerce dataset. Multiclass classification tasks involving 4, 6, and 7 classes are evaluated using supervised classifiers like Support Vector Machine, Gaussian Naive Bayes, Logistic Regression, Random Forest and Decision Tree. Under a 5 fold cross-validation the performance of the tasks is assessed using macro F1 score, log-loss and accuracy. Results show macro F1 score and accuracy decrease as class cardinality increases causing greater classification difficulty. Tree based models like Random Forest exhibit more balanced performance across classes. Gaussian Naive Bayes obtains the lowest log-loss indicating more accurate and reliable probability estimations. Class cardinality effects is isolated by varying the number of classes as we keep the features and classifier fixed. Increasing class cardinality reduced posterior confidence and increased entropy and log-loss. Using real time categorical variables these trends are confirmed. Next experiments done showed that class imbalance primarily affects minority class performance, whereas class cardinality exerts a broader influence on probabilistic confidence and prediction uncertainty. The findings highlight the need to consider class cardinality, class imbalance, and probabilistic metrics when evaluating multiclass classification models.

Keywords:

multiclass classification

;

class cardinality

;

class imbalance

;

Gaussian naive bayes

;

e-commerce analytics

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

E-commerce has developed into a data driven ecosystem where advanced technologies facilitates personalized customer experiences that enables efficient decision making, and thus leading to enhanced business performance. The rapid growth of e-commerce has increased the demand for intelligent systems that can analyze user behavior, improve product discovery, and support personalized shopping experiences. Future research in e-commerce recommender systems and artificial intelligence has been discussed in study [1] to improve personalization and decision making. For effective e-commerce recommendation systems the study identifies emerging areas such as knowledge graphs and content based image retrieval that leads to future research. The review [2] highlights agent based modeling, network theory and machine learning methods to enhance personalized customer experiences and understanding market trends. The study shows e-commerce decision making can be improved by combining complex systems models with predictive analytics for improved inventory management and pricing strategies. In e-commerce data analysis class cardinality refers to the different categories present in a dataset, whereas class imbalance describes inequalities in the number of observations belonging to different categories. Considering large e-commerce datasets challenges faced by imbalance data are studied [3] that are frequently misclassified by conventional machine learning methods. A hybrid approach is proposed that combines multiple data processing techniques and algorithms to improve classification accuracy and predictive performance. The survey [4] highlights that class imbalance remains a major challenge in machine learning and deep learning studies affecting model performance across a wide range of real world applications. The study reviews recent approaches to manage imbalance data for classification and regression leading to future research directions. The study [5] demonstrates that class imbalance significantly reduces the effectiveness of machine learning models and leads to biased predictions toward the majority class. The study with 2 classes or binary classification shows improvement in metrics such as F1 score and accuracy.

Comparing existing literature, previous studies address only class imbalance, whereas the present work investigates the joint impact of class cardinality and class imbalance on e-commerce classification performance. This tells how the number of classes influences the performance of machine learning models. Our approach is an empirical understanding of the interaction between class cardinality and class imbalance in e-commerce classification, which is less explored in the literature.

E-commerce systems employ multiclass classifiers for customer segmentation, product recommendation, payment preference prediction, and purchase frequency analysis. As businesses introduce more specific customer categories, the number of target classes increases substantially. Understanding how increasing class cardinality affects predictive confidence and uncertainty is therefore essential for designing reliable e-commerce decision support systems. E-commerce platforms increasingly use multiclass prediction systems and increasing the number of categories can degrade confidence and reliability. The study [6] proposes a continuous learning Naive Bayes framework for sentiment classification of e-commerce product reviews. This makes possible efficient processing of large scale and continuously growing review data. The results show that the model successfully transfers knowledge across domains and achieves improved adaptability and classification performance on reviews from different product categories and domains. This study does not addresses how does classifiers behave under simultaneous multiclass and imbalanced conditions. In another study a large scale multi-label e-commerce customer review dataset comprising more than 50,000 reviews across three product categories was developed [7]. The study demonstrated that machine learning methods can effectively identify multiple aspects of customer opinions from a single review with high classification performance. Although the study considers multiple labels for each customer review, its focus is on multi-label sentiment analysis rather than multiclass classification. Consequently, it does not examine how varying class cardinality or class imbalance affects classifier performance, which constitutes the primary focus and novelty of the present study. The research [8] proposes a self-supervised learning framework combined with a structured domain knowledge model to automatically classify e-commerce products based on their raw materials, reducing the need for extensive manual labeling. As per the given results of the work, the proposed method achieves an accuracy of 91%. This shows e-commerce applications require efficient and reliable product classification. Although the above research addresses product classification in e-commerce, it neither considers class imbalance nor evaluates classifier behavior under varying numbers of classes. Thus the challenges of multiclass imbalanced classification remains unexplored.

The major contribution of the present study is the detailed analysis of multiclass classification performance due to both class cardinality and class imbalance. Unlike prior studies that primarily focus on binary or fixed cardinality classification problems, this work evaluates how increasing the number of classes influences the effectiveness and stability of machine learning algorithms in imbalanced e-commerce datasets.

2. Materials and Methods

An e-commerce dataset comprising 3,900 customers and eight variables was analyzed to investigate the effects of class cardinality and class imbalance on multiclass classification performance. Two continuous variables, namely ‘Age’ and ‘Purchase Amount’, are selected as predictor features. The study uses only two predictor variables to isolate the effects of class cardinality and class imbalance without the added complexity of a large number of input features. These two continuous variables are used as inputs to train all the classifiers while the categorical variables are used as target variables. The categorical variables ‘Category’, ‘Season’ and ‘Size’ has 4 classes each, ‘Payment Method’ and ‘Shipping Type’ has 6 classes each while ‘Frequency of Purchases’ has 7 classes. Table 1 and Table 2 gives the class distribution (frequency and percentage) of the categorical variables. Table 1 shows a moderately imbalanced distribution for the variable ‘Category’, with ‘Clothing’ clearly dominating. The four seasonal classes have nearly uniform frequencies indicating a well-balanced categorical variable, suitable for fair multi-class classification. For the variable ‘size’ there is noticeable class imbalance, dominated by the ‘M’ size. Table 2 summarizes the class-wise frequency distribution of three categorical variables ‘Payment Method’, ‘Shipping Type’, and ‘Frequency of Purchases’ with 6 and 7 classes, along with their corresponding counts and percentages. The variable ‘Payment Method’ indicates a well-balanced class distribution, minimizing bias toward any single payment method. The variable ‘Shipping Type’ reflects a balanced categorical variable, suitable for fair multi-class classification. The variable ‘Frequency of Purchases’ having 7 classes show slight variations but remain largely balanced overall. Despite higher class cardinality, no class dominates the dataset.

Five supervised machine learning classifiers, namely, Gaussian Naive Bayes (GNB), Logistic Regression (LR), Random Forest (RF), Decision Tree (DT), and Support Vector Machine (SVM) are evaluated using a 5 fold cross-validation framework [9,10]. Model performance is assessed using accuracy, macro F1 score, and log-loss to capture classification accuracy, class balanced performance, and probabilistic calibration, respectively [11,12]. To isolate the effect of class cardinality, a controlled experiment is conducted by artificially merging class labels to construct 2, 4, 6, and 7 class scenarios while keeping the predictor variables and classifier fixed. Gaussian Naive Bayes is selected for this analysis because it directly estimates posterior probabilities from Bayes’ theorem, enabling the evaluation of posterior confidence, entropy, and log-loss under varying class cardinalities [13,14,15]. The results are further validated using real categorical variables with naturally occurring class counts. Finally, the impact of class imbalance is examined by comparing classification performance before and after balancing selected target variables through random under sampling. Statistical significance of model differences is assessed using paired t-tests and Wilcoxon signed-rank tests [16]. These tests are used when two continuous measurements are obtained from the same observations under two related conditions. The Paired t-test evaluates whether the average difference between two related sets of observations is statistically significant when the differences follow a normal distribution. When this normality assumption is not satisfied, the Wilcoxon Signed-Rank Test serves as a non-parametric alternative by comparing the ranks of the paired differences rather than their actual values.

3. Results

In this section we discuss the experimental results acquired from six multiclass e-commerce classification tasks with varying levels of class cardinality and class imbalance. The analysis is organized into four parts, (i) comparative evaluation of machine learning classifiers, (ii) controlled investigation of class cardinality effects, (iii) real-variable validation of the observed trends, and (iv) assessment of the impact of class imbalance and statistical significance.

3.1. Comparative Performance of Machine Learning Classifiers

Six independent comparative experiments are conducted using two continuous features (age and purchase amount). Each experiment targeted one categorical variable with varying class cardinality (4 or 6 or 7 classes). Gaussian Naive Bayes which is a probabilistic supervised classifier was compared with widely used supervised machine learning classifiers like Logistic Regression, Decision Trees, Random Forests, and Support Vector Machines using a 5 fold cross-validation framework to assess relative effectiveness. All models are trained and tested using the same dataset, 5 fold cross-validation, and identical evaluation metrics to ensure a fair comparison. The performance of the models is assessed using accuracy, macro averaged F1 score and log-loss. The metric accuracy which works well when classes are balanced measures the proportion of correct predictions that the model computes. Macro F1 score measures how well the model performs across all equal classes. For multiclass problems such balanced evaluation provides overall correctness and reliability. The metric log-loss evaluates how well the model assigns probability to the correct class. Lower values obtained indicate better probabilistic predictions. A 5 fold cross-validation is used to evaluate model performance. The dataset is split into five mutually exclusive folds. Four folds are used for model training and the remaining fold is used for testing during each iteration. Performance metrics are averaged across the five folds to obtain robust estimates and minimize the effects of sampling variability. Multiclass classification is handled in a model dependent manner. Gaussian Naive Bayes, Decision Trees, and Random Forests natively support multiclass targets and are trained directly. Logistic Regression and Support Vector Machines, which are essentially binary classifiers, are extended to the multiclass setting using an Error Correcting Output Codes (ECOC) framework with linear logistic and RBF (radial basis function) kernel SVM learners, respectively. This ensured consistent and fair evaluation across all classification tasks involving four to seven classes. A SVM with a RBF kernel is employed to model nonlinear relationships between customer age, purchase amount, and the target classes. Since SVMs are inherently binary, multiclass classification is achieved using an ECOC framework with RBF kernel SVM learners. Feature standardization is applied to ensure numerical stability and improve classification performance.

For the e-commerce dataset Figure 1 and Figure 2 summarizes the accuracy and macro F1 Score of the five classifiers (GNB, LR, DT, RF, and SVM) involving six multiclass classification tasks. Figure 1 shows classification accuracy decreases as class cardinality increases from 4 to 6 to 7 classes. Unlike accuracy which measures the overall proportion of correct predictions, macro F1 score analysis (Figure 2) shows that tree based models (Decision Tree and Random Forest) provide better balance across classes and more effective identification of less frequent classes. Macro F1 scores decrease as class cardinality increases from 4 to 7 classes, highlighting the growing difficulty of maintaining performance across classes in higher cardinality classification tasks. Thus higher accuracy does not necessarily imply better multiclass classification performance.

Table 3 evaluates model performances using metrics accuracy, macro F1 scores, and log-loss, across the six independent multiclass experiments with increasing class cardinality as discussed above. The performance of five supervised classifiers, Gaussian Naive Bayes (GNB), Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), and Support Vector Machine (SVM) across six categorical targets ‘Category’, ‘Size’, ‘Season’ with 4 classes, ‘Payment Method’, ‘Shipping Type’, with 6 classes and ‘Frequency of Purchase’ with 7 classes are reported.

Considering 4 class variables the models GNB, LR, and SVM show identical accuracies for ‘Category’ and ‘Season’ (almost 0.44), suggesting these models are likely predicting dominant classes, or linear decision boundaries are insufficient to separate classes using only two features. Decision Tree and Random Forest yield lower accuracies (0.34–0.36), indicating difficulty in learning stable splits from limited continuous inputs. Despite similar accuracies, GNB, LR, and SVM exhibit extremely low macro F1 (almost 0.15) for ‘Category’ and ‘Season’. This implies poor minority class recognition and class bias. Decision Tree and Random Forest achieve noticeably higher macro F1 (0.23–0.26), indicating better balance across classes, even though overall accuracy is lower. High accuracy here is misleading. Tree based models like Decision Tree and Random Forest provide reasonable multiclass predictions, as reflected by macro F1.

For the 4 class variables (category, size, season) GNB consistently shows low log-loss (1.2–1.4), indicating well calibrated probability estimates. Logistic Regression and SVM exhibit extremely high log-loss (almost 36), suggesting severely over confident or numerically unstable probability outputs. Random Forest achieves moderate log-loss (3–4), balancing discrimination and calibration. Considering performance on higher cardinality variables (6 and 7 classes) accuracy drops sharply across all models (0.14–0.18), reflecting the increased difficulty of classification with higher class cardinality. No model shows dominant accuracy, indicating a limited ability of the given features to distinguish between classes. For 6 and 7 class variables Decision Tree and Random Forest outperform all others in macro F1 score (0.15–0.17), demonstrating superior learning across classes in high cardinality settings. Compared to LR and SVM which have their limitations, GNB maintains modest macro F1 score (0.10–0.14). With the increasing number of classes, tree based models perform better in terms of class balance, even when accuracy remains low. Although GNB shows weaker classification performance it provides more reliable probability estimates by generating lowest log-loss (1.8–2.0). Random Forest maintains moderate log-loss (6–9). For all experiments LR and SVM show high log-loss values (almost 36) indicating less reliable probability estimations with varying class structures.

Therefore accuracy alone is not sufficient for to evaluate multiclass problems which has imbalance or large number of classes. Macro F1 which measures how well the model performs across all equal classes is the most informative score in this study. Log-loss shows probabilistic reliability and the classifier GNB performs best. Random Forest emerges as the best overall model that provides balanced performance across classes along with stable probability estimations.

Table 3 demonstrates that increasing the number of classes significantly decreases classification performance for all models. The linear models show higher accuracy but their low F1 score and high log-loss values indicate class-wise poor performance and also less reliable probability estimations. Again considering Random Forest which is a tree based model we can observe effective balance across the classes and stable probabilistic behavior. Although the classification performance is weak, Gaussian Naive Bayes computes the lowest log-loss indicating better probability estimates.

3.2. Controlled Analysis of Class Cardinality Effects

Next to isolate the impact of class cardinality on multiclass tasks we selected a single categorical variable and kept all other factors fixed. The class labels were artificially merged to construct 2, 4, 6, and 7 class scenarios. Among all classifiers, GNB was selected because it directly models posterior probabilities derived from Bayes’ theorem and then trained using identical features. By fixing the classifier and varying only the number of classes, the observed changes in confidence, entropy, and log-loss can be attributed primarily to class cardinality rather than differences among learning algorithms. Figure 3 illustrates the resulting changes in probabilistic behavior as class cardinality increases. A clear monotonic decline in mean posterior confidence is observed, decreasing from approximately 0.58 (2 class) to 0.16 (7 class). Entropy and log-loss increase steadily, indicating greater uncertainty in the predicted class probabilities and reduced confidence in the classifier’s decisions. Thus increasing the number of classes reduces posterior confidence while increasing entropy and log-loss, and finally leading to greater uncertainty in probabilistic classification. Table 4 show a monotonic decrease in posterior confidence accompanied by increased entropy and log-loss as class cardinality increases. Thus increasing class cardinality systematically reduces posterior confidence and increases uncertainty.

Artificial class cardinality control is used to isolate the effect of class count alone, whereas using different variables (category, season, payment, etc.) introduces multiple confounding factors beyond class cardinality.

3.3. Real-Variable Validation of Class Cardinality Effects

In addition to controlled class cardinality variation, real-variable validation was performed using categorical variables with naturally occurring class counts (4 or 6 or 7 classes). Figure 4 presents the class cardinality effect using real categorical variables having different classes and confirms the controlled experimental findings, that increasing class cardinality is associated with reduced posterior confidence and increased entropy and log-loss. Table 5 shows that the four-class variables (Category, Size, and Season) exhibit relatively higher confidence and lower uncertainty measures, whereas the six-class variables (Payment Method and Shipping Type) and the seven-class variable (Frequency Purchases) display progressively lower confidence and higher entropy and log-loss values. These consistent trends demonstrate that the influence of class cardinality extends beyond controlled experiments and is also evident in real world categorical variables.

3.4. Impact of Class Imbalance and Statistical Significance Testing

Next the impact of class imbalance on posterior probability behavior is examined using a Gaussian Naive Bayes classifier. Two categorical variables are analyzed under 5 fold cross-validation, one with a naturally imbalanced class distribution (category) and another with a more uniform distribution (Frequency Purchases). First, the model is trained and evaluated on the original data to obtain macro F1 score, posterior probability variance, and confusion entropy. Next, a balanced dataset is created by reducing the number of samples in the majority classes and thus class imbalance is addressed. Using the balanced dataset the model was trained again and the evaluation metrics are recomputed. Comparing the imbalanced and balanced datasets, we can examine the impact of unequal class distributions on probability confidence and uncertainty.

Table 6 having original and balanced evaluations, shows clearly the impact of class imbalance on posterior probability behavior for the ‘Category’ variable. Balancing the class distribution leads to increase in macro F1 score from 0.154 to 0.293, indicating improvement in recognition of less frequent classes. Balancing the dataset reduced overconfident predictions. Higher posterior probability variance and entropy suggested realistic estimations. In contrast the variable ‘Frequency of Purchases’ shows very little change after balancing. The macro F1 score increases from 0.101 to 0.111 which is a very small improvement and the changes in probability variance and entropy are negligible. The stability observed tells that the variable is already balanced. Thus the prediction uncertainty involved is influenced more by the large number of classes than by class imbalance. The above results show that class imbalance has a strong impact on prediction confidence and model performance and this happens when there is strong class dominance. Variables that are already balanced, resampling produces little change in performance or uncertainty.

Additional statistical tests were performed to check whether the observed differences between the models were real or due to random variation. Using 5-fold cross-validation, Gaussian Naive Bayes was compared with the best performing classifier model after considering a class-wise balanced dataset. Accuracy and log-loss are computed for each fold and paired statistical tests are applied across the folds. The paired t-test is used to compare the average performance of two models such that the data are normally distributed. The Wilcoxon signed-rank test is used when the data are not satisfying this normal condition. As we perform both the tests the conclusions about the model performance differences becomes more reliable.

‘Frequency Purchases’ is chosen as the target variable as its class distribution is already balanced and having minimum class imbalance effects. Models compared are Gaussian Naive Bayes (GNB) and Random Forest (RF) which is considered the best performing in this case. For each fold of cross-validation accuracy and log-loss are computed. As discussed above statistical tests applied are Paired t-test and Wilcoxon signed rank test. The decision rule says when p ≥ 0.05 the difference is likely due to chance and when p < 0.05 the difference is statistically significant.

Accuracy for GNB is 0.1390 and accuracy for the best model RF is 0.1390. Paired t-test p-value is 1.0 and Wilcoxon signed-rank p-value is 1.0. To be noted here both models achieve identical accuracy across all cross-validation folds. This results in no major difference to be observed and considered. Consequently, neither the parametric nor the non-parametric tests indicate statistical significance. Further we can say that in terms of accuracy alone the two models are same for the selected target variable.

For GNB log-loss is 1.95 and for the best model or RF it is 8.93. Paired t-test has p-value 3.07×10-5 and Wilcoxon signed-rank p-value is 0.0625. GNB computes a lower log-loss indicating better probabilistic confidence compared to the best performing RF model. The paired t-test strongly rejects the null hypothesis, demonstrating a statistically significant improvement in log-loss under normally distributed data. Although the Wilcoxon test result is slightly above the 0.05 significance level, it still indicates that GNB generally performs better. The lack of statistical significance may be due to the small number of cross-validation folds. While both the models (GNB and RF) achieve the same accuracy, GNB performs better in estimating class probabilities. This is evident from its lower log-loss and the significant statistical test results.

4. Discussion

The present study investigated how class cardinality and class imbalance impacted multiclass classification tasks using an e-commerce dataset. The comparative analysis of five machine learning classifiers showed that as number of classes increases the classification tasks become more difficult and complex. For all the models, classification accuracy and macro F1 scores declined as the number of classes increased from 4 to 7. This indicated that higher class cardinality creates greater uncertainty as one seeks decision making process. GNB, LR, and SVM often achieved relatively high accuracy and lower macro F1 scores. This suggest that they were less accurate at correctly classifying less frequent classes. In contrast, DT and RF obtained higher macro F1 scores and thus showing a more balanced performance across different classes in multiclass problems.

One of the major findings of this study is that having more classes can make estimations more uncertain. The controlled experiments with varying number of classes and fixed features and classifiers showed decreased posterior or prediction confidence along with increased entropy and log-loss. This behavior occurs because the classifier must distribute its probability mass or predicted probabilities across a larger number of competing classes. As the number of classes increases the probability mass is distributed across many possible outcomes. This reduces maximum probability confidence and increases entropy indicating more uncertainty in the classification process [17,18]. The increase in log-loss shows that the model becomes less confident in its estimations as the number of classes increases. Thus the study shows that increasing the number of class increases prediction uncertainty even when the classifier and input features remain the same.

The validation experiments with real-variables confirmed that the class cardinality effect extends beyond the controlled experiments performed previously. Categorical variables with 6 and 7 classes exhibited lower probability confidence along with higher entropy and log-loss values than categorical variables with 4 classes. The similar trend between controlled and real variable analyses makes the conclusion strong that class cardinality exerts significant influence on probabilistic classification behavior. This observation is especially important for e-commerce applications, where tasks such as customer segmentation, product classification, and purchase frequency estimation often involve many categories.

The study found that unlike class cardinality, class imbalance has a different impact on model performance. As we balanced the imbalanced ‘Category’ variable it produced results that showed improvement in macro F1 score. This indicated improved classification of less frequent classes. On the other hand, balancing the already well distributed ‘Frequency of Purchases’ variable resulted in only marginal changes. These results suggest that class imbalance affects model performance mainly because some classes have many more samples than others. Again class cardinality influences the overall uncertainty of the classification problem. Therefore, class cardinality and class imbalance should be considered as two different factors such that they have related challenges influencing multiclass classification tasks.

The primary contribution of this study lies in isolating the effect of class cardinality on probabilistic confidence, entropy, and log-loss and at the same time examining the role of class imbalance. Unlike previous studies that focus primarily on comparison of machine learning classifiers or correction of imbalance in classes, the present work demonstrates that increasing class cardinality itself contributes to uncertainty estimation. The combination of controlled experiments and real variable validation provides strong evidence that class cardinality should be considered an important factor when developing reliable e-commerce classification and decision support systems.

5. Business Implications

The results of this study have important implications for the design and deployment of e-commerce analytics systems. As the number of product categories increases, prediction confidence decreases and uncertainty increases, which can negatively affect recommendation and inventory management systems. Customer grouping approaches that create a large number of highly specific customer groups may decrease predictive reliability. This as a result makes it more difficult to accurately identify customer preferences and purchasing behavior. The study [19] demonstrates that advanced machine learning models improve customer segmentation and prediction performance. Hence the importance of reliable classification for marketing and personalization strategies gets noticed. In marketing analytics, businesses should consider more than just classification accuracy when increasing the number of customer or product classes. Evaluating uncertainty metrics such as entropy and log-loss can offer important insights into the confidence and reliability of model predictions. The research [20] propose uncertainty based classification techniques that explicitly quantify uncertainty in categories, emphasizing the importance of uncertainty measures alongside accuracy when making business decisions. The results also suggest that personalization systems with many classes may need additional customer information, behavioral data, or improved classification methods to maintain good predictive performance. Overall, the study emphasizes the importance of maintaining a balance between the number of categories and predictive reliability when designing data driven decision support systems for e-commerce applications.

6. Conclusions

This study explored the effects of class cardinality and class imbalance on multiclass classification performance in an e-commerce environment using five widely used machine learning classifiers. The results demonstrate that increasing class cardinality significantly increases classification difficulty, leading to lower accuracy, reduced macro F1 scores, decreased posterior confidence, and higher entropy and log-loss values. While Decision Tree and Random Forest models achieved higher balance across classes, Gaussian Naive Bayes consistently produced the lowest log-loss, indicating better probabilistic calibration and confidence estimation.

A major contribution of this work is the demonstration that class cardinality acts as an independent source of predictive uncertainty. Controlled experiments are conducted with varied number of classes while keeping the classifier and predictor variables fixed. We observed a clear reduction in probability confidence and a corresponding increase in entropy and log-loss as class cardinality increased. These findings were further validated using real categorical variables with naturally occurring class counts, confirming the effectiveness of the observed trends.

The study also showed that class imbalance and class cardinality influence classification performance through different mechanisms. Class imbalance primarily affects the recognition of less frequent classes, whereas class cardinality exerts a broader influence on probabilistic confidence and prediction uncertainty, even in relatively balanced datasets.

Overall, the results emphasize the importance of considering both class cardinality and class imbalance, together with probabilistic evaluation measures, when assessing multiclass classification models. The results have practical implications for customer segmentation, product categorization, marketing analytics, and personalization systems in e-commerce. Increasing the number of categories may improve business specificity but can also reduce predictive reliability and increase uncertainty. Future work may extend this analysis to larger datasets, additional machine learning and deep learning models, and hierarchical classification frameworks for managing high cardinality prediction tasks.

Author Contributions

Conceptualization, M.S. and I.G.; methodology, M.S. and P.M.; software, I.G.; validation, P.M.., I.G. and M.S.; formal analysis, J.P. and M.S.; investigation, P.M.; resources, I.G.; data curation, P.M. and S.S.; writing—original draft preparation, M.S.; writing—review and editing, P.M.; visualization, I.G.; supervision, P.M.; project administration, I.G.; funding acquisition, I.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

The authors are thankful to Techno India University, West Bengal for the support given during the work done.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Valencia-Arias, A.; Uribe-Bedoya, H.; González-Ruiz, J. D.; Santos, G. S.; Ramírez, E. C.; Rojas, E. M. Artificial intelligence and recommender systems in e-commerce. Trends and research agenda. Intell. Syst. With Appl. 2024, 24, 200435. [Google Scholar] [CrossRef]
Madanchian, M. The Role of Complex Systems in Predictive Analytics for E-Commerce Innovations in Business Management. Systems 2024, 12(10), 415. [Google Scholar] [CrossRef]
Anthoniraj, S.; Kumar, A.N.; Hemakumar Reddy, G.; Raju, M. Classification of Imbalanced Data in E-Commerce. International Conference on Smart and Sustainable Technologies in Energy and Power Sectors (SSTEPS), Mahendragarh, India, 2022; pp. 204–209. [Google Scholar] [CrossRef]
Chen, W.; Yang, K.; Yu, Z.; Shi, Y.; Philip Chen, C.L. A survey on imbalanced learning: latest research, applications and future directions. Artif. Intell. Rev. 2024, 57, 137. [Google Scholar] [CrossRef]
Suguna, R.; Suriya Prakash, J.; Aditya Pai, H.; Mahesh, T. R.; Vinoth Kumar, V.; Yimer, T. E. Mitigating class imbalance in churn prediction with ensemble methods and SMOTE. Sci. Rep. 2025, 15(1), 16256. [Google Scholar] [CrossRef] [PubMed]
Xu, F.; Pan, Z.; Xia, R. E-commerce product review sentiment classification based on a naïve Bayes continuous learning framework. Inf. Process. Manag. 2020, 57(5), 102221. [Google Scholar] [CrossRef]
Deniz, E.; Erbay, H.; Coşar, M. Multi-Label Classification of E-Commerce Customer Reviews via Machine Learning. Axioms 2022, 11(9), 436. [Google Scholar] [CrossRef]
Lei, B.; Wang, J.; Shen, C. Automatic classification method of e-commerce commodity raw materials through the introduction of self-supervised concepts and the construction of domain ontology. Sci. Rep. 2026, 16(1), 8058. [Google Scholar] [CrossRef] [PubMed]
Rodriguez, J.D.; Perez, A.; Lozano, J.A. Sensitivity Analysis of k-Fold Cross Validation in Prediction Error Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32(3), 569–575. [Google Scholar] [CrossRef] [PubMed]
Abedin, T.; Xu, H.; Uddin, S. The impact of K selection in K-fold cross-validation on bias and variance in supervised learning models. Sci. Rep. 2026, 16(1), 6084. [Google Scholar] [CrossRef] [PubMed]
Rainio, O.; Teuho, J.; Klén, R. Evaluation metrics and statistical tests for machine learning. Sci. Rep. 2024, 14(1), 6086. [Google Scholar] [CrossRef] [PubMed]
Farhadpour, S.; Warner, T.A.; Maxwell, A.E. Selecting and Interpreting Multiclass Loss and Accuracy Assessment Metrics for Classifications with Class Imbalance: Guidance and Best Practices. Remote Sens. 2023, 16(3), 533. [Google Scholar] [CrossRef]
Stern, H.S. Bayesian Statistics; Smelser, Neil J., Baltes, Paul B., Eds.; International Encyclopedia of the Social & Behavioral Sciences: Pergamon, 2001; pp. Pages 1052–1056. ISBN 9780080430768. [Google Scholar] [CrossRef]
Ramos, D.; Franco-Pedroso, J.; Lozano-Diez, A.; Gonzalez-Rodriguez, J. Deconstructing Cross-Entropy for Probabilistic Binary Classifiers. Entropy 2018, 20(3), 208. [Google Scholar] [CrossRef] [PubMed]
Warren, E.M.; Handley, J.C.; Sheets, H.D. Cross entropy and log likelihood ratio cost as performance measures for multi-conclusion categorical outcomes scales. J. Forensic Sci. 2025, 70(2), 589–606. [Google Scholar] [CrossRef] [PubMed]
Cleophas, T.J.; Zwinderman, A.H. Paired Continuous Data (Paired T-Test, Wilcoxon Signed Rank Test, 10 Patients). In SPSS for Starters and 2nd Levelers; Springer: Cham, 2016. [Google Scholar] [CrossRef]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, 2006. [Google Scholar]
Cover, T. M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley, 2006. [Google Scholar] [CrossRef]
Balasundaram, E.; Aranganathan, P.; Annavajjala, K.S.; Sivakumar, R.; Arumugam, M.; Vinoth, A. A Hybrid Approach for Customer Segmentation and Loyalty Prediction in E-Commerce. Prabandhan Indian J. Manag. 2024, 17(10), 56–69. [Google Scholar] [CrossRef]
Zhang, J.; Qiu, Y.; Dong, L. Conformal deep forest for uncertainty-aware classification. J. King Saud. Univ. Comput. Inf. Sci. 2025, 37, 155. [Google Scholar] [CrossRef]

Figure 1. Comparison of classification accuracy across six multiclass ecommerce prediction tasks.

Figure 2. Comparison of macro F1 scores across six multiclass ecommerce prediction tasks.

Figure 3. Effect of Class Cardinality on Posterior Confidence, Entropy, and Log-Loss.

Figure 4. Real-Variable Validation of the Effect of Class Cardinality on Confidence, Entropy, and Log-Loss.

Table 1. Summary of the categorical variables with 4 classes.

Category		Season		Size
Classes	Frequency	Classes	Frequency	Classes	Frequency
Accessories	1240 (31.79%)	Fall	975 (25.00%)	L	1053 (27.00%)
Clothing	1737 (44.54%)	Spring	999 (25.62%)	M	1755 (45.00%)
Footwear	599 (15.36%)	Summer	955 (24.49%)	S	663 (17.00%)
Outerwear	324 (8.31%)	Winter	971 (24.90%)	XL	429 (11.00%)

Table 2. Summary of the categorical variables with 6 and 7 classes.

Payment method		Shipping type		Frequency of purchases
Classes	Frequency	Classes	Frequency	Classes	Frequency
Bank Transfer	612 (15.69%)	2-Day Shipping	627 (16.08%)	Annually	572 (14.67%)
Cash	670 (17.18%)	Express	646 (16.56%)	Bi-Weekly	547 (14.03%)
Credit Card	671 (17.21%)	Free Shipping	675 (17.31%)	Every 3 months	584 (14.97%)
Debit Card	636 (16.31%)	Next Day Air	648 (16.62%)	Fortnightly	542 (13.90%)
PayPal	677 (17.36%)	Standard	654 (16.77%)	Monthly	553 (14.18%)
Venmo	634 (16.26%)	Store Pickup	650 (16.67%)	Quarterly	563 (14.44%)
				Weekly	539 (13.82%)

Table 3. Comparative analysis of five machine learning classifiers for six multiclass e-commerce classification tasks.

	Category			Size			Season
Model	Accuracy	F1Score	LogLoss	Accuracy	F1Score	LogLoss	Accuracy	F1Score	LogLoss
GNB	0.4454	0.1540	1.2204	0.2531	0.2357	1.3868	0.4500	0.1551	1.2594
LR	0.4454	0.1540	36.0437	0.2613	0.2399	36.0437	0.4500	0.1551	36.0437
DT	0.3608	0.2346	8.7996	0.2559	0.2551	11.0777	0.3415	0.2391	9.2554
RF	0.3438	0.2386	3.8950	0.2592	0.2586	3.6113	0.3395	0.2617	3.6257
SVM	0.4454	0.1540	36.0437	0.2551	0.2497	36.0437	0.4500	0.1551	36.0437
	Payment Method			Shipping Type			Frequency of Purchase
Model	Accuracy	F1Score	LogLoss	Accuracy	F1Score	LogLoss	Accuracy	F1Score	LogLoss
GNB	0.1633	0.1208	1.7960	0.1690	0.1398	1.7948	0.1395	0.1030	1.9508
LR	0.1536	0.0974	36.0437	0.1669	0.1363	36.0437	0.1405	0.0967	36.0437
DT	0.1662	0.1653	16.0100	0.1692	0.1681	15.8230	0.1456	0.1439	17.4117
RF	0.1615	0.1609	7.1804	0.1649	0.1648	6.8832	0.1526	0.1522	9.0245
SVM	0.1608	0.1368	36.0437	0.1774	0.1707	36.0437	0.1467	0.1343	36.0437

Table 4. Effect of Class Cardinality on Probability Confidence (Gaussian Naive Bayes).

Number of Classes	Mean Confidence	Mean Entropy	Log Loss
2	0.576442871	0.681177521	0.682370808
4	0.296332834	1.348275726	1.352318492
6	0.284825849	1.746593782	1.752378223
7	0.156502523	1.9436029	1.950618192

Table 5. Real-variable validation of class cardinality effect (using Gaussian Naive Bayes).

Variable	Number of Classes	Mean Confidence	Mean Entropy	Log Loss
Category	4	0.445714555	1.217489845	1.219867201
Size	4	0.271413979	1.383760718	1.387446215
Season	4	0.449923314	1.255456913	1.258927761
Payment Method	6	0.182427439	1.789578089	1.795229106
Shipping Type	6	0.182400476	1.789331466	1.793801015
Frequency Purchases	7	0.156502523	1.9436029	1.950618192

Table 6. Impact of Class Imbalance on Probability Measures using Gaussian Naive Bayes.

Variable	Macro F1 Original	Macro F1 Balanced	Prob Variance Original	Prob Variance Balanced	Confusion Entropy Original	Confusion Entropy Balanced
Category	0.154035241	0.293478374	0.000155742	0.000165245	1.217532035	1.382238689
Frequency Purchases	0.100641123	0.111198672	3.23E-05	3.21E-05	1.943562775	1.944018522

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.