Submitted:
13 October 2025
Posted:
14 October 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Motivation and Problem Definition
- Adaptive Uncertainty Calibration: Continuously refines confidence estimates using Wasserstein-based temperature scaling and cross-validation to align predictions with true uncertainty, mitigating overconfidence [12].
- Diversity-Driven Sampling: Employs multi-resolution hashing to select representative samples, overcoming the redundancy of uncertainty-based methods.
- Dynamic Pseudo-Labeling Thresholds: Adjusts thresholds in real-time via a composite objective function, reducing error propagation compared to static or heuristic thresholds [6].
1.2. Advances and Limitations in Semi-Supervised Learning
1.3. Integrating Advances Across Domains
1.4. Comprehensive Comparison with Recent Studies
1.5. Identified Gaps and Unresolved Challenges
- Overconfident Predictions: Skew uncertainty estimates, leading to unreliable pseudo-labels [3].
- Redundant Sample Selection: Uncertainty-driven methods select similar examples, reducing diversity [4].
- Static Thresholding: Fails to adapt to data or model changes, causing error propagation [5].
- Fragmented Approaches: Lack synergy across calibration, diversity, and thresholding [7].
- Scalability: Computationally intensive methods limit applicability to large datasets.
1.6. Synthesis of Contributions and Novel Framework
- Adaptive Calibration: Uses Wasserstein-based temperature scaling and cross-validation to refine confidence estimates, outperforming static calibration methods [12].
- Diversity-Promoting Sampling: Employs multi-resolution hashing to ensure representative sample selection, overcoming redundancy in uncertainty-based methods.
- Dynamic Thresholding: Optimizes a composite objective function via ternary search, reducing error propagation compared to fixed thresholds [6].
- Scalability and Integration: Combines efficient components to scale to large datasets, unlike computationally heavy alternatives.
1.7. Structure of the Paper
2. Proposed Model
2.1. Base Classifier: LightGBM
2.2. Adaptive Calibration

2.2.1. Theoretical Justification for Tanh-Based Scaling
- (1)
- Bounded Output Range: The tanh function naturally bounds the temperature parameter T within a practical range. With our scaling, for typical Wasserstein distances. This prevents two failure modes: (i) when (under-smoothing), probabilities remain overconfident and miscalibrated; (ii) when (over-smoothing), probabilities become uninformative and overly uniform. The bounded nature of tanh ensures stable behavior across diverse datasets.
- (2)
-
Smooth, Nonlinear Growth: Unlike linear scaling (), the tanh function exhibits smooth, S-shaped growth. It is:
- More responsive to moderate miscalibration (), where adjustment is most beneficial.
- Less sensitive to small noise (), avoiding over-reaction to minor fluctuations.
- Saturating for extreme miscalibration (), preventing unbounded temperature increases.
This smooth growth pattern aligns with empirical observations that calibration benefits diminish beyond a certain threshold, and excessive temperature can destroy informative predictions. - (3)
- Theoretical Grounding: The tanh transformation has been successfully employed in adaptive systems across multiple domains—as an activation function in neural networks for smooth, bounded nonlinearity, and in control theory for smooth response to error signals. Its mathematical properties (differentiability, monotonicity, bounded range) make it well-suited for adaptive parameter adjustment.
-
We conducted ablation experiments comparing tanh with three alternatives:
- Linear scaling: - resulted in excessive temperatures for high , degrading F1 by 4.2%.
- Sigmoid scaling: - less sensitive to moderate miscalibration, reducing AUC by 2.8%.
- Exponential scaling: - unstable for large , causing convergence issues.
The tanh-based approach achieved the best balance, maintaining ECE while preserving discriminative performance (AUC on Sparkify). These results confirm that the smooth, bounded growth of tanh effectively handles miscalibration severity without over- or under-smoothing.
2.2.2. Empirical Validation of Calibration
2.2.3. Comparative Analysis of Distance Metrics
2.2.4. Rationale and Comparison: Adaptive Calibration
2.3. Diversity-Promoting Sampling

- Dimensionality Reduction with SVD: Initially, Singular Value Decomposition (SVD) [25] is applied to reduce the dimensionality of the unlabeled data. Specifically, we compute the truncated SVD of the data matrix , approximating it as , where is a matrix of left singular vectors, is a diagonal matrix of singular values, and is a matrix of right singular vectors. We then project the data into a lower-dimensional subspace by computing . The rank k is set to approx_rank, dynamically adjusted to be no larger than to ensure a valid decomposition. This step reduces the computational cost of subsequent hashing operations and captures the principal components of the data.
- Multi-Resolution Hashing: After dimensionality reduction, multi-resolution hashing [26] is performed to capture diversity at multiple scales. The reduced data is scaled by geometric factors of 0.5, 1.0, and 2.0, resulting in: . For each scaled version, random hyperplanes are generated. Each hyperplane is defined by a normal vector , sampled from a normal distribution and normalized by dividing by its norm: . The partition code for a data point is given by , which yields a binary code [27]. This process generates a set of binary partition codes for each data point at each resolution.
- Unique Hash Selection: The binary partition codes from all scaling factors are horizontally stacked and converted to integers. A unique hash for each data point is computed by applying a hash function (e.g., Python’s built-in hash function) to the tuple of concatenated binary codes. These unique hash values are then used to identify a set of unique samples.
- Fallback Mechanism: In cases where the number of unique hashes is below the target sample size, a fallback mechanism is employed. Specifically, the sampler randomly selects additional samples from the unlabeled data until the desired number of samples is reached. This random sampling is performed without replacement to avoid duplicates, and typically activates when the target sample size exceeds 20% of the unlabeled pool, ensuring the diversity sampler always returns exactly T samples.
2.3.1. Rationale and Comparison: Diversity Sampling
2.4. Dynamic Thresholding via an Adaptive Heuristic
2.4.1. Composite Objective Function
2.4.2. Adaptive Heuristic Parameter Adjustment
2.4.3. Algorithm and Workflow


2.4.4. Rationale and Comparison: Dynamic Threshold Optimization
2.5. Integrated Iterative Framework

2.6. Computational Complexity Analysis
- Adaptive Calibration: Sorting the confidence scores requires time, and subsequent temperature scaling and entropy computations operate in .
- Diversity Sampling: The truncated SVD has complexity , where r is significantly smaller than d. The multi-resolution hashing itself is computed in .
- Dynamic Threshold Optimization: The ternary search converges in iterations. As each iteration involves calculating metrics over the evaluation set, the cost per iteration is .
Summary
3. Dataset Information, Preprocessing, and Feature Engineering
3.1. Dataset Selection Rationale
3.2. Dataset Information
3.2.1. Sparkify Dataset
3.2.2. IBM Telco Dataset
3.2.3. Bank Churn Dataset
3.2.4. CrowdAnalytix Dataset
3.3. Data Preprocessing
- Sparkify: Flattened nested JSON, removed users with missing IDs, converted timestamps, and encoded categorical values.
- IBM Telco: Dropped irrelevant columns such as Customer ID, Churn Label, Churn Score etc. and also encoded categorical variables using label encoding.
- Bank Churn: Encoded categorical features using one-hot encoding, normalized continuous variables, and dropped irrelevant columns like CustomerId and Surname.
- CrowdAnalytix: Categorical variables were label-encoded, missing numerical values were imputed using mean values, and all features were standardized.
3.4. Feature Engineering
- Sparkify: Session entropy, monthly session duration, average listen time, ad interruptions, and offline listening.
- IBM Telco: Service bundle compatibility score, contract duration bins, and monthly charge volatility.
- Bank Churn: Credit utilization ratio, customer activity score, and tenure-product interaction terms.
- CrowdAnalytix: Derived features such as call behavior metrics (dropped calls, unanswered calls, inbound-/outbound ratios), and tenure-based segmentation.
3.5. Feature Selection
- IBM Telco: Used Mutual Information [35] to identify 10 key features.
- Bank Churn: Used Mutual Information to identify the top 10 features, which included NumOfProducts, Age and IsActiveMember.
- CrowdAnalytix: Used Mutual Information to identify the top 10 features, which included AgeHH1, DroppedCalls, DroppedBlockedCalls, BlockedCalls and InboundCalls.
4. Experiment and Results
4.1. Hyperparameter Tuning and Baseline Models
4.2. Performance Evaluation
5% Labeled Data:
10% Labeled Data:
4.3. Computational Efficiency
4.4. Summary of Findings
4.5. Robustness to Label Noise
5. Ablation Study
- No WCE: The Wasserstein-based adaptive calibration was replaced with a standard, non-adaptive calibration step, and uncertainty was measured using standard entropy.
- No ATO: The Adaptive Threshold Optimizer was replaced with a fixed pseudo-labeling threshold of 0.5.
- No Diversity: The diversity-promoting sampler was replaced by simple random sampling to select from the pool of uncertain samples.
6. Statistical Comparison & Performmance Analysis
6.1. Statistical Comparison of the Proposed Model and Baselines
- Training with 5% and 10% labeled data scenarios.
- Five independent runs with different random seeds to ensure statistical reliability.
- Evaluation on consistent test sets to enable paired comparisons.
- Collection of multiple performance metrics (AUC, F1, TPR, FPR) for comprehensive analysis.
- Proposed vs. Pseudo-Labeling: Proposed model [0.9107, 0.9326] (mean across seeds) vs. Pseudo-Labeling [0.8501, 0.8633].
- Proposed vs. BAAL: Proposed model [0.9107, 0.9326] vs. BAAL [0.9106, 0.9236].
- Proposed vs. Core Set Selection: Proposed model [0.9107, 0.9326] vs. Core Set Selection [0.9022, 0.9188].
- Proposed vs. CoMatch: Proposed model [0.9107, 0.9326] vs. CoMatch [0.8174, 0.8071].
- The proposed model significantly outperforms Pseudo-Labeling and CoMatch, as indicated by the Wilcoxon test () and large effect sizes (Cohen’s d = 12.21 and 19.00, respectively), with significance confirmed across multiple experimental runs.
- Compared to Core Set Selection, the proposed model also demonstrates a significant improvement (), though with a more moderate effect size (Cohen’s d = 1.73).
- Against BAAL, the differences are not statistically significant (Wilcoxon , Cohen’s d = 0.03), suggesting comparable performance on this dataset.
6.2. Performance Analysis via Conditional Entropy and ROC Curves
7. Evaluation on Imbalanced Home Credit Dataset
7.1. Dataset Description and Preprocessing
7.2. Performance Comparison
7.3. Sensitivity Analysis
7.4. Scalability Analysis
7.5. Discussion
8. Conclusions, Limitations, and Future Work
Author Contributions
Funding
Data Availability Statement
Declaration of Generative AI and AI-Assisted Technologies in the Writing Process Statement
Conflicts of Interest
References
- Zheng, Y., Liu, Y., Qing, D., Zhang, W., Pan, X., Li, G.: A novel ensemble label propagation with hierarchical weighting for semi-supervised learning. Knowledge and Information Systems, 1–22 (2024).
- Wen, Z., Pizarro, O., Williams, S.: Active self-semi-supervised learning for few labeled samples. Neurocomputing 614, 128772 (2025).
- Minderer, M., Djolonga, J., Romijnders, R., Hubis, F., Zhai, X., Houlsby, N., Tran, D., Lucic, M.: Revisiting the calibration of modern neural networks. NeurIPS (2021).
- Ash, J.T., Zhang, C., Krishnamurthy, A., Langford, J., Agarwal, A.: Deep batch active learning by diverse, uncertain gradient lower bounds. In: ICLR (2020).
- Zhou, Y., Wang, Y., Tang, P., Bai, S., Shen, W., Fishman, E., Yuille, A.: Semi-supervised medical image segmentation via uncertainty rectified pyramid consistency. Medical Image Analysis 64, 101744 (2020).
- Qu, A., Wu, Q., Yu, L., Li, J., Liu, J.: Class-specific thresholding for imbalanced semi-supervised learning. IEEE Signal Processing Letters (2024).
- Rizve, M.N., Duarte, K., Rawat, Y.S., Shah, M.: In defense of pseudo-labeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning. In: ICLR (2021).
- Xia, Y., Yang, D., Yu, Z., Liu, F., Cai, J., Yu, L., Zhu, Z., Xu, D., Yuille, A., Roth, H.: Uncertainty-aware deep co-training for semi-supervised medical image segmentation. Medical Image Analysis 65, 101766 (2020).
- Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.: Mixmatch: A holistic approach to semi-supervised learning. In: NeurIPS (2019).
- Sohn, K., Berthelot, D., Li, C.-L., Zhang, Z., Carlini, N., Cubuk, E.D., Kurakin, A., Zhang, H., Raffel, C.: Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In: NeurIPS (2020).
- Zhu, J., Fan, C., Yang, M., Qian, F., Mahalec, V.: A semi-supervised learning algorithm for high and low-frequency variable imbalances in industrial data. Computers & Chemical Engineering 193, 108933 (2025).
- Andéol, L., Kawakami, Y., Wada, Y., Kanamori, T., Müller, K.-R., Montavon, G.: Learning domain invariant representations by joint wasserstein distance minimization. Neural Networks 167, 233–243 (2023).
- Li, X., Hu, X., Qi, X., Zhao, L., Zhang, S.: Noise-robust semi-supervised learning via label propagation with uncertainty estimation. IEEE Transactions on Medical Imaging 41(5), 1123–1135 (2022).
- Luo, X., Zhao, Y., Qin, Y., Ju, W., Zhang, M.: Towards semi-supervised universal graph classification. IEEE Transactions on Knowledge and Data Engineering 36(1), 416–428 (2023).
- Xie, Q., Luong, M.-T., Hovy, E., Le, Q.V.: Self-training with noisy student improves imagenet classification. In: CVPR (2020).
- Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.-Y.: Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30 (2017).
- Wu, B., Hong, S., Lin, R.: Explainable prediction for business process activity with transformer neural networks. Knowledge and Information Systems, 1–32 (2025).
- Mohr, F., Rijn, J.N.: Learning curves for decision making in supervised machine learning: a survey, vol. 113, pp. 8371–8425 (2024). [CrossRef]
- Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 1321–1330. PMLR, ??? (2017). https://proceedings.mlr.press/v70/guo17a.html.
- Platt, J., et al.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers 10(3), 61–74 (1999).
- Balanya, S.A., Maroñas, J., Ramos, D.: Adaptive temperature scaling for robust calibration of deep neural networks. Neural Computing and Applications 36(14), 8073–8095 (2024).
- Xu, L., Bai, L., Jiang, X., Tan, M., Zhang, D., Luo, B.: Deep rényi entropy graph kernel. Pattern Recognition 111, 107668 (2021).
- Hao, J., Ho, T.K.: Machine learning made easy: a review of scikit-learn package in python programming language. Journal of Educational and Behavioral Statistics 44(3), 348–361 (2019).
- Brennan, D.G.: Linear diversity combining techniques. Proceedings of the IRE 47(6), 1075–1102 (2007).
- Furnas, G.W., Deerwester, S., Durnais, S.T., Landauer, T.K., Harshman, R.A., Streeter, L.A., Lochbaum, K.E.: Information retrieval using a singular value decomposition model of latent semantic structure. SIGIR Forum 51(2), 90–105 (2017) . [CrossRef]
- Wu, Q., Bauer, D., Doyle, M.J., Ma, K.-L.: Interactive volume visualization via multi-resolution hash encoding based neural representation. IEEE Transactions on Visualization and Computer Graphics 30(8), 5404–5418 (2024) . [CrossRef]
- Yang, H.-F., Tu, C.-H., Chen, C.-S.: Learning binary hash codes based on adaptable label representations. IEEE Transactions on Neural Networks and Learning Systems 33(11), 6961–6975 (2021).
- Zhang, Y., Xu, C., Yang, H.H., Wang, X., Quek, T.Q.: Dpp-based client selection for federated learning with non-iid data. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023). IEEE.
- JOY, U.: Customer Churn Dataset. IEEE Dataport (2023). [CrossRef]
- IBM: Telco customer churn. IBM (2019). https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113.
- Badole, S.: Banking Customer Churn Prediction Dataset. kaggle (2024). https://www.kaggle.com/datasets/saurabhbadole/bank-customer-churn-prediction-dataset/data.
- crowdanalytix: crowdanalytix customer churn. crowdanalytix (2012). https://www.crowdanalytix.com/contests/why-customer-churn.
- McHugh, M.L.: The chi-square test of independence. Biochem Med (Zagreb) 23(2), 143–149 (2012) . [CrossRef]
- Rückstieß, T., Osendorfer, C., Smagt, P.: Sequential feature selection for classification. In: Wang, D., Reynolds, M. (eds.) AI 2011: Advances in Artificial Intelligence, pp. 132–141. Springer, Berlin, Heidelberg (2011).
- Ross, B.C.: Mutual information between discrete and continuous data sets. PLOS ONE 9(2), 1–5 (2014) . [CrossRef]
- Çakır, M.Y., Şirin, Y.: Enhanced autoencoder-based fraud detection: a novel approach with noise factor encoding and smote. Knowledge and Information Systems 66(1), 635–652 (2024).
- Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. Journal of machine learning research 13(2) (2012).
- Sharma, M., Bilgic, M.: Evidence-based uncertainty sampling for active learning. Data Mining and Knowledge Discovery 31, 164–202 (2017).
- Hino, H., Eguchi, S.: Active learning by query by committee with robust divergences. Information Geometry 6(1), 81–106 (2023).
- Ren, P., Xiao, Y., Chang, X., Huang, P.-Y., Li, Z., Gupta, B.B., Chen, X., Wang, X.: A survey of deep active learning. ACM computing surveys (CSUR) 54(9), 1–40 (2021).
- Kim, Y., Shin, B.: In defense of core-set: A density-aware core-set selection for active learning. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. KDD ’22, pp. 804–812. Association for Computing Machinery, New York, NY, USA (2022). [CrossRef]
- Cacciarelli, D., Kulahci, M.: Active learning for data streams: a survey. Machine Learning 113(1), 185–239 (2024) . [CrossRef]
- Tanha, J., Van Someren, M., Afsarmanesh, H.: Semi-supervised self-training for decision tree classifiers. International Journal of Machine Learning and Cybernetics 8, 355–370 (2017).
- Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.A.: Mixmatch: A holistic approach to semi-supervised learning. Advances in neural information processing systems 32 (2019).
- Rizve, M.N., Duarte, K., Rawat, Y.S., Shah, M.: In defense of pseudo-labeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=-ODN6SbiUU.
- Zhou, Z.-H., Li, M.: Tri-training: exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering 17(11), 1529–1541 (2005) . [CrossRef]
- Chen, X., Wang, T.: Combining active learning and semi-supervised learning by using selective label spreading. In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 850–857 (2017). IEEE.
- Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems 30 (2017).
- Li, J., Xiong, C., Hoi, S.C.: Comatch: Semi-supervised learning with contrastive graph regularization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9475–9484 (2021).
- Sheng, G., Zhang, Z., Tang, X., Xie, K.: A first arrival picking method of microseismic signals based on semi-supervised learning using freematch and ms-picking. Computers & Geosciences 196, 105844 (2025).
- Montoya, A., inversion, KirillOdintsov, Kotek, M.: Home Credit Default Risk. https://kaggle.com/competitions/home-credit-default-risk. Kaggle (2018).










| Metric | Key Strengths | Limitations |
|---|---|---|
| KL Divergence | Sensitive to distribution shape | Asymmetric; unstable with low probabilities |
| TV Distance | Bounded and straightforward | Ignores underlying data geometry |
| Hellinger Distance | Symmetric; robust to small changes | Limited in capturing spatial structure |
| Wasserstein Distance | Integrates magnitude and geometry | Higher computational cost (in >1D) |
| Method | Adaptivity | Robustness | Complexity |
|---|---|---|---|
| Static Temperature Scaling | No | Low | |
| Platt Scaling | No | Moderate | |
| Isotonic Regression | No | High (data-dependent) | |
| Proposed Adaptive Calibration | Yes | Very High |
| Method | Scalability | Diversity Guarantee | Additional Notes |
|---|---|---|---|
| Random Sampling | Low | No explicit diversity | |
| Clustering-Based Sampling (e.g., k-means) | Moderate | Sensitive to initialization | |
| DPP-based Sampling | High | Computationally prohibitive for large | |
| Proposed Multi-Resolution Hashing | (excl. SVD) | Competitive | Scales well with large datasets |
| Method | Adaptivity | Robustness | Search/Optimization | Computational Cost |
|---|---|---|---|---|
| Fixed Threshold | No | Low | N/A | |
| Heuristic Tuning | Partial | Moderate | Manual/Ad hoc | |
| Grid Search | No | Moderate | Exhaustive Search | |
| Bayesian Optimization | Yes | High | Probabilistic Modeling | |
| Proposed Adaptive Method | Yes | Very High | Ternary Search over |
| Dataset | Key Characteristics | Selection Rationale |
|---|---|---|
| Sparkify | Large-scale streaming logs (12.5GB, 26M records, 22K users) | Tests model scalability and temporal behavior modeling |
| IBM Telco | Structured telecom customer data (7,043 customers, 36 features) | Baseline comparison with demographic and service-based features |
| Bank Churn | Banking customer data (10K records, 14 financial features) | Evaluates financial churn prediction and high-cardinality categorical data |
| CrowdAnalytix | High-dimensional service metrics | Assesses feature selection techniques on a large feature space |
| Dataset | Original | Balanced | ||
|---|---|---|---|---|
| Non-Churn | Churn | Non-Churn | Churn | |
| Sparkify | 17,274 | 5,003 | 17,274 | 17,274 |
| IBM Telco | 5,174 | 1,869 | 5,174 | 5,174 |
| Bank Churn | 7,963 | 2,037 | 7,963 | 7,963 |
| CrowdAnalytix | 36,336 | 14,711 | 36,336 | 36,336 |
| Dataset | Model | AUC | F1 | TPR | FPR |
|---|---|---|---|---|---|
| Sparkify | Proposed Model | 0.9107 | 0.8296 | 0.8643 | 0.2177 |
| Pseudo-Labeling | 0.8501 | 0.8569 | 0.8138 | 0.1001 | |
| Tri-Training | 0.8518 | 0.8576 | 0.8210 | 0.1058 | |
| Label Spreading | 0.7037 | 0.7198 | 0.6676 | 0.2281 | |
| Uncertainty Sampling | 0.9036 | 0.8299 | 0.8068 | 0.1336 | |
| Query-by-Committee | 0.8995 | 0.8194 | 0.7868 | 0.1326 | |
| Diversity Sampling | 0.8946 | 0.8018 | 0.7458 | 0.1136 | |
| Core Set Selection | 0.9022 | 0.8160 | 0.7754 | 0.1243 | |
| BAAL | 0.9106 | 0.8399 | 0.8068 | 0.1136 | |
| Self-Training | 0.8447 | 0.7470 | 0.7127 | 0.1941 | |
| MixMatch | 0.8403 | 0.7330 | 0.6633 | 0.1453 | |
| Mean Teacher | 0.8427 | 0.7470 | 0.6984 | 0.1701 | |
| CoMatch | 0.8174 | 0.7219 | 0.6915 | 0.2226 | |
| IBM Telco | Proposed Model | 0.8902 | 0.8282 | 0.8389 | 0.1899 |
| Pseudo-Labeling | 0.8413 | 0.8294 | 0.8945 | 0.2356 | |
| Tri-Training | 0.8434 | 0.8314 | 0.8984 | 0.2356 | |
| Label Spreading | 0.7909 | 0.7588 | 0.9012 | 0.3836 | |
| Uncertainty Sampling | 0.9333 | 0.8532 | 0.8495 | 0.1937 | |
| Query-by-Committee | 0.9262 | 0.8516 | 0.8696 | 0.1823 | |
| Diversity Sampling | 0.9272 | 0.8543 | 0.8571 | 0.1748 | |
| Core Set Selection | 0.9355 | 0.8632 | 0.8830 | 0.1864 | |
| BAAL | 0.9275 | 0.8595 | 0.8504 | 0.1791 | |
| Self-Training | 0.8664 | 0.8068 | 0.7948 | 0.1876 | |
| MixMatch | 0.8249 | 0.7276 | 0.6836 | 0.2013 | |
| Mean Teacher | 0.6899 | 0.6469 | 0.6472 | 0.3621 | |
| CoMatch | 0.8893 | 0.7896 | 0.7287 | 0.2188 | |
| Banking | Proposed Model | 0.8417 | 0.7630 | 0.7907 | 0.2682 |
| Pseudo-Labeling | 0.7143 | 0.7211 | 0.7147 | 0.2715 | |
| Tri-Training | 0.7150 | 0.7220 | 0.7147 | 0.2707 | |
| Label Spreading | 0.6981 | 0.7025 | 0.7057 | 0.3007 | |
| Query-by-Committee | 0.8600 | 0.7725 | 0.8036 | 0.2671 | |
| Diversity Sampling | 0.8564 | 0.7873 | 0.8281 | 0.2686 | |
| Core Set Selection | 0.8557 | 0.7883 | 0.8152 | 0.2928 | |
| BAAL | 0.8572 | 0.7794 | 0.8088 | 0.2696 | |
| Self-Training | 0.8491 | 0.7756 | 0.8100 | 0.2819 | |
| MixMatch | 0.8416 | 0.7629 | 0.7906 | 0.2731 | |
| Mean Teacher | 0.8415 | 0.7628 | 0.7905 | 0.2793 | |
| CoMatch | 0.7431 | 0.6867 | 0.7366 | 0.3889 | |
| CrowdAnalytix | Proposed Model | 0.7065 | 0.6701 | 0.9781 | 0.9547 |
| Pseudo-Labeling | 0.6119 | 0.6912 | 0.4859 | 0.1035 | |
| Tri-Training | 0.6927 | 0.7254 | 0.6163 | 0.1655 | |
| Label Spreading | 0.5443 | 0.5627 | 0.5191 | 0.3937 | |
| Uncertainty Sampling | 0.7214 | 0.6447 | 0.5975 | 0.2597 | |
| Query-by-Committee | 0.7142 | 0.6378 | 0.5927 | 0.2695 | |
| Diversity Sampling | 0.7137 | 0.6355 | 0.5881 | 0.2664 | |
| Core Set Selection | 0.7160 | 0.6373 | 0.5945 | 0.2749 | |
| BAAL | 0.7214 | 0.6447 | 0.5975 | 0.2597 | |
| Self-Training | 0.6853 | 0.6111 | 0.5699 | 0.2995 | |
| MixMatch | 0.5763 | 0.5704 | 0.5848 | 0.4724 | |
| Mean Teacher | 0.5722 | 0.5382 | 0.5156 | 0.4063 | |
| CoMatch | 0.5769 | 0.5615 | 0.5707 | 0.4688 |
| Dataset | Model | AUC | F1 | TPR | FPR |
|---|---|---|---|---|---|
| Sparkify | Proposed Model | 0.9326 | 0.8650 | 0.8472 | 0.1107 |
| Pseudo-Labeling | 0.8633 | 0.8697 | 0.8245 | 0.0851 | |
| Tri-Training | 0.8634 | 0.8676 | 0.8393 | 0.1041 | |
| Label Spreading | 0.7371 | 0.7456 | 0.7156 | 0.2243 | |
| Uncertainty Sampling | 0.9236 | 0.8487 | 0.8178 | 0.1087 | |
| Query-by-Committee | 0.9180 | 0.8372 | 0.8065 | 0.1194 | |
| Diversity Sampling | 0.9179 | 0.8349 | 0.7960 | 0.1101 | |
| Core Set Selection | 0.9188 | 0.8390 | 0.8094 | 0.1191 | |
| BAAL | 0.9236 | 0.8487 | 0.8178 | 0.1087 | |
| Self-Training | 0.8833 | 0.7825 | 0.7467 | 0.1606 | |
| MixMatch | 0.8526 | 0.7631 | 0.7382 | 0.1952 | |
| Mean Teacher | 0.8514 | 0.7687 | 0.7801 | 0.2477 | |
| CoMatch | 0.8071 | 0.7284 | 0.7281 | 0.2690 | |
| IBM Telco | Proposed Model | 0.9232 | 0.8445 | 0.8667 | 0.1889 |
| Pseudo-Labeling | 0.8445 | 0.8349 | 0.8366 | 0.2181 | |
| Tri-Training | 0.8441 | 0.8359 | 0.8210 | 0.2084 | |
| Label Spreading | 0.8046 | 0.7843 | 0.8600 | 0.3096 | |
| Uncertainty Sampling | 0.9384 | 0.8401 | 0.8582 | 0.1890 | |
| Query-by-Committee | 0.9381 | 0.8344 | 0.8155 | 0.1801 | |
| Diversity Sampling | 0.9335 | 0.8378 | 0.8541 | 0.1990 | |
| Core Set Selection | 0.9364 | 0.8320 | 0.8619 | 0.1894 | |
| BAAL | 0.9396 | 0.8299 | 0.8793 | 0.2098 | |
| Self-Training | 0.9266 | 0.8629 | 0.8593 | 0.2195 | |
| MixMatch | 0.9255 | 0.8427 | 0.8217 | 0.1996 | |
| Mean Teacher | 0.9328 | 0.8444 | 0.8266 | 0.2217 | |
| CoMatch | 0.8946 | 0.8054 | 0.7661 | 0.1383 | |
| Banking | Proposed Model | 0.8998 | 0.8185 | 0.8332 | 0.1929 |
| Pseudo-Labeling | 0.7793 | 0.7846 | 0.7798 | 0.2107 | |
| Tri-Training | 0.7790 | 0.7845 | 0.7785 | 0.2094 | |
| Label Spreading | 0.7626 | 0.7674 | 0.7663 | 0.2315 | |
| Uncertainty Sampling | 0.8572 | 0.7794 | 0.8088 | 0.2535 | |
| Query-by-Committee | 0.8600 | 0.7725 | 0.8036 | 0.2633 | |
| Diversity Sampling | 0.8564 | 0.7873 | 0.8281 | 0.2621 | |
| Core Set Selection | 0.8557 | 0.7883 | 0.8152 | 0.2407 | |
| BAAL | 0.8572 | 0.7794 | 0.8088 | 0.2535 | |
| Self-Training | 0.8491 | 0.7756 | 0.8100 | 0.2652 | |
| MixMatch | 0.8925 | 0.8072 | 0.8545 | 0.2498 | |
| Mean Teacher | 0.8864 | 0.8010 | 0.8487 | 0.2572 | |
| CoMatch | 0.7361 | 0.6823 | 0.7592 | 0.4434 | |
| CrowdAnalytix | Proposed Model | 0.7948 | 0.7137 | 0.5801 | 0.0463 |
| Pseudo-Labeling | 0.5283 | 0.6570 | 0.3835 | 0.0690 | |
| Tri-Training | 0.6811 | 0.7324 | 0.5700 | 0.1052 | |
| Label Spreading | 0.5637 | 0.5773 | 0.5426 | 0.3880 | |
| Uncertainty Sampling | 0.7348 | 0.6591 | 0.6117 | 0.2481 | |
| Query-by-Committee | 0.7307 | 0.6546 | 0.6099 | 0.2571 | |
| Diversity Sampling | 0.7261 | 0.6488 | 0.6019 | 0.2569 | |
| Core Set Selection | 0.7322 | 0.6578 | 0.6180 | 0.2646 | |
| BAAL | 0.7348 | 0.6591 | 0.6117 | 0.2481 | |
| Self-Training | 0.7117 | 0.6170 | 0.5437 | 0.2216 | |
| MixMatch | 0.5955 | 0.5509 | 0.5215 | 0.3771 | |
| Mean Teacher | 0.6022 | 0.5720 | 0.5621 | 0.4090 | |
| CoMatch | 0.5772 | 0.5409 | 0.5258 | 0.4241 |
| Dataset | 5% Labeled Data | 10% Labeled Data |
|---|---|---|
| Sparkify Churn Dataset | 89.39 | 82.67 |
| IBM Telecom Churn Dataset | 78.24 | 19.73 |
| Bank Churn Dataset | 11.84 | 27.17 |
| CrowdAnalytix Telecom Churn Dataset | 220.41 | 79.30 |
| Noise Level | F1 Score | AUC | TPR | FPR |
|---|---|---|---|---|
| 0% | 0.8469 | 0.9192 | 0.8170 | 0.1116 |
| 5% | 0.8333 | 0.9089 | 0.8164 | 0.1419 |
| 10% | 0.7912 | 0.8874 | 0.7237 | 0.1050 |
| 15% | 0.7911 | 0.8809 | 0.7382 | 0.1272 |
| Variant | Final AUC | Final F1 | Final TPR | Final FPR | Execution Time (s) |
|---|---|---|---|---|---|
| Proposed Model | 0.9107 | 0.8296 | 0.8643 | 0.2177 | 89.39 |
| No WCE | 0.5322 | 0.6650 | 1.0000 | 0.4981 | 128.41 |
| No ATO | 0.5339 | 0.5517 | 0.5997 | 0.5108 | 101.97 |
| No Diversity | 0.5522 | 0.6649 | 0.9997 | 0.4981 | 123.43 |
| Variant | Final AUC | Final F1 | Final TPR | Final FPR | Execution Time (s) |
|---|---|---|---|---|---|
| Proposed Model | 0.9326 | 0.8650 | 0.8472 | 0.1107 | 82.67 |
| No WCE | 0.4632 | 0.6646 | 0.9988 | 0.4980 | 82.45 |
| No ATO | 0.5007 | 0.4719 | 0.4372 | 0.5124 | 70.34 |
| No Diversity | 0.4711 | 0.6643 | 0.9983 | 0.4978 | 80.24 |
| Comparison | Wilcoxon Signed-Rank | Bootstrapped AUC Difference | Cohen’s d |
|---|---|---|---|
| Proposed vs. Pseudo-Labeling | (0.0000, 0.0000) | 0.0604 [0.0599, 0.0608] | 12.2104 |
| Proposed vs. BAAL | (245398.0000, 0.5953) | 0.0002 [-0.0002, 0.0006] | 0.0341 |
| Proposed vs. Core Set | (20555.0000, 0.0000) | 0.0087 [0.0083, 0.0091] | 1.7321 |
| Proposed vs. CoMatch | (0.0000, 0.0000) | 0.0937 [0.0932, 0.0941] | 18.9980 |
| Model | AUC | F1 | TPR | FPR |
|---|---|---|---|---|
| FreeMatch | 0.5703 ± 0.0029 | 0.0506 ± 0.0439 | 0.0359 ± 0.0322 | 0.0212 ± 0.0191 |
| Proposed | 0.5566 ± 0.0199 | 0.1291 ± 0.0225 | 0.4521 ± 0.2941 | 0.2117 ± 0.1116 |
| Seed | F1 | AUC | TPR | FPR |
|---|---|---|---|---|
| 42 | 0.0435 ± 0.0018 | 0.5571 ± 0.0027 | 0.0266 ± 0.0012 | 0.0172 ± 0.0002 |
| 123 | 0.0284 ± 0.0034 | 0.5542 ± 0.0015 | 0.0162 ± 0.0020 | 0.0106 ± 0.0002 |
| 456 | 0.0005 ± 0.0004 | 0.5418 ± 0.0038 | 0.0002 ± 0.0002 | 0.0002 ± 0.0000 |
| Seed | F1 | AUC | TPR | FPR |
|---|---|---|---|---|
| 42 | 0.1023 ± 0.0011 | 0.5347 ± 0.0009 | 0.1063 ± 0.0013 | 0.0850 ± 0.0004 |
| 123 | 0.1450 ± 0.0000 | 0.5600 ± 0.0008 | 0.6500 ± 0.0000 | 0.3000 ± 0.0000 |
| 456 | 0.1400 ± 0.0000 | 0.5550 ± 0.0016 | 0.6000 ± 0.0000 | 0.2500 ± 0.0000 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).