Submitted:
17 September 2025
Posted:
19 September 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- How does the amount of failure data influence prediction performance?
- Which preprocessing method is most suitable for dealing with unbalanced classification in the context of valve plate failure prediction?
- What are the limitations of the use of data balancing methods?
- How do oversampling strategies influence predictions by analyzing the changes in feature importance for different methods?
2. Related Work
2.1. Machine Learning in Pump Failures Prediction
2.2. Imbalanced Data Classification
-
Data-Level Methods The first group of methods focuses on direct modification of the dataset before it is used to train a model. These techniques can be further divided into three sub-groups: undersampling, oversampling, and hybrid methods.
- Undersampling
- Undersampling techniques aim to reduce the number of samples in the majority class to match the number of samples in the minority class. While simple random undersampling involves the random removal of majority samples, more sophisticated methods have been developed to make the process more strategic. These methods are typically based on nearest neighbor analysis. Among the most popular are the Edited Nearest Neighbor (ENN) algorithm [17] and the Tomek Links algorithm [18]. The ENN algorithm does not guarantee a perfectly balanced dataset, but it effectively prunes noisy and border samples. It works by examining each majority class sample and marking it for removal if it is incorrectly classified by its k nearest neighbors from the entire training set. Tomek Links operates differently, also pruning border samples from the majority class. It begins by identifying "Tomek Links," which are pairs of samples from two opposite classes that are each other’s nearest neighbors (e.g., A is the nearest neighbor of B, and B is the nearest neighbor of A). After identifying such links, all majority class samples involved are removed, thereby cleaning the border between the classes. These methods have also served as a foundation for other techniques, such as RIUS [19], which focuses on retaining the most relevant majority samples while discarding the rest.
- Oversampling
- These methods focus on increasing the number of samples in the minority classes. This can be achieved naively by sampling with replacement until a desired number of samples is reached. However, many more intelligent methods have been developed, most notably the SMOTE family of algorithms [20]. The basic concept behind SMOTE involves generating new synthetic samples by randomly selecting a minority class sample (sample A), then randomly selecting one of its k nearest neighbors (sample B) from the same class, and placing the new sample at a random position on the line segment connecting A and B. The SMOTE family has undergone rapid development, with advancements such as Borderline-SMOTE [21,22], k-means SMOTE [23], and SMOTE with approximate nearest neighbors [24], which all focus on generating samples in crucial border areas. Another popular oversampling method is based on Adaptive Synthetic Sampling (ADASYN) [25,26]. The idea behind ADASYN is to shift the learning algorithm’s focus toward difficult minority instances that lie near the decision boundary. It works similarly to SMOTE but first identifies these hard-to-classify minority samples, and then generates a greater number of new samples to bolster their representation.
- Hybrid Methods
- Algorithm-Level Methods This group of methods consists of algorithm-level modifications that make the learning model more sensitive to the minority class without altering the dataset itself. The most standard approach within this group is cost-sensitive learning, which uses class weights to modify the decision function. For example, popular models such as XGBoost [29], LightGBM [27], and SVM [30] provide parameters for class weights or cost function modification that allow the user to prioritize the minority class during training.
- Ensemble Methods Ensemble methods build prediction models by combining multiple submodels into a single, robust model. Each submodel is trained on balanced or nearly balanced data. A popular member of this group is the Bagging-Based Ensemble, which includes the Balanced Random Forest algorithm. This method adapts the standard Random Forest by building a training set for each tree in the forest through balanced undersampling and oversampling, ensuring each individual tree is trained on a more balanced subset of the data. This strategy can also be applied to other base classifiers, such as neural networks. Additionally, Boosting-Based Ensembles, such as AdaBoost, XGBoost, and LightGBM, often inherently handle imbalanced data. These algorithms give more weight to misclassified samples in each subsequent iteration, forcing the model to focus on the more difficult, and often minority, samples.
3. Experiment Setup
3.1. Test Bench
- Sensor 1 horizontally, perpendicular to the pump shaft
- Sensor 2 vertically, perpendicular to the pump shaft
- Sensor 3 along the pump shaft
| Dataset | Number of records | Pump condition | Class |
|---|---|---|---|
| OT | 7237 | No failure | Negative |
| UT1 | 16272 | Damaged valveplate - failure 1 | Positive |
| UT2 | 22593 | Damaged valveplate - failure 2 | Positive |
| UT3 | 21632 | Damaged valveplate - failure 3 | Positive |
3.2. Datasets Used in the Experiments
3.3. Model Evaluation Method
3.4. Methods Used in the Experiments
-
Undersampling
- ENN,
- Tomek-Links,
-
Oversampling
- SMOTE,
- Borderline-SMOTE, ,
- ADASYN,
-
Hybrid
- SMOTE+ENN, ,
- SMOTE+Tomek-Links,
3.5. Basic Analysis
3.6. Dataset-Balancing Models Comparison
3.7. Metrics Used for Model Evaluation
- Precision is the number of true positives divided by the number of true positives plus false positives. It answers the question: "Of all the instances predicted as positive, how many were actually positive?"
- Recall is the number of true positives divided by the number of true positives plus false negatives. It answers the question: "Of all the instances that were actually positive, how many did the model correctly identify?"
3.8. Tools Used in the Experiments
4. Results and Discussion
4.1. The Influence of Data Imbalance on Model’s Prediction Performance
4.2. Comparison od Data Balancing Methods
4.2.1. Performance on
4.2.2. Performance on and
4.2.3. Discussion
4.3. Feature Importance Analysis
5. Conclusions
- The assessment of data-balancing methods using a standard cross-validation procedure should be interpreted with caution. The results can be misleading, especially when dealing with extremely low balance rates, as the test set may not accurately represent the data distribution of the entire space in which the model may operate.
- The lowest effective balance rate at which data-balancing methods provided benefits ranged from 5% to 1%. While for , the methods did not yield significant gains at 1%, for , the performance improvement remained substantial.
- For the lowest balance rates (), the use of data-balancing methods proved detrimental to the models, leading to a decrease in overall prediction performance.
- The use of data-balancing methods, particularly oversampling and hybrid, causes small changes to the knowledge extracted from the data. When the balance rate decreases the oversampling methods are unable to preserve the true data distribution, leading to a gradual decrease in performance. That is observed in a decrease in the feature importance correlation plot.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Conflicts of Interest
References
- Carvalho, T.P.; Soares, F.A.; Vita, R.; Francisco, R.d.P.; Basto, J.P.; Alcalá, S.G. A systematic literature review of machine learning methods applied to predictive maintenance. Computers & Industrial Engineering 2019, 137, 106024. [Google Scholar]
- Rojek, M.; Blachnik, M. A Dataset and a Comparison of Classification Methods for Valve Plate Fault Prediction of Piston Pump. Applied Sciences 2024, 14, 7183. [Google Scholar] [CrossRef]
- Bykov, A.; Voronov, V.; Voronova, L. Machine learning methods applying for hydraulic system states classification. In Proceedings of the 2019 Systems of Signals Generating and Processing in the Field of on Board Communications. IEEE, 2019, pp. 1–4.
- Tang, S.; Zhu, Y.; Yuan, S. A novel adaptive convolutional neural network for fault diagnosis of hydraulic piston pump with acoustic images. Advanced Engineering Informatics 2022, 52, 101554. [Google Scholar] [CrossRef]
- Tang, S.; Zhu, Y.; Yuan, S. An improved convolutional neural network with an adaptable learning rate towards multi-signal fault diagnosis of hydraulic piston pump. Advanced Engineering Informatics 2021, 50, 101406. [Google Scholar] [CrossRef]
- Tang, S.; Khoo, B.C.; Zhu, Y.; Lim, K.M.; Yuan, S. A light deep adaptive framework toward fault diagnosis of a hydraulic piston pump. Applied Acoustics 2024, 217, 109807. [Google Scholar] [CrossRef]
- Tang, S.; Zhu, Y.; Yuan, S. Intelligent fault diagnosis of hydraulic piston pump based on deep learning and Bayesian optimization. ISA transactions 2022, 129, 555–563. [Google Scholar] [CrossRef]
- Guo, R.; Li, Y.; Zhao, L.; Zhao, J.; Gao, D. Remaining useful life prediction based on the Bayesian regularized radial basis function neural network for an external gear pump. IEEE Access 2020, 8, 107498–107509. [Google Scholar] [CrossRef]
- Li, Z.; Jiang, W.; Zhang, S.; Xue, D.; Zhang, S. Research on prediction method of hydraulic pump remaining useful life based on KPCA and JITL. Applied Sciences 2021, 11, 9389. [Google Scholar] [CrossRef]
- Yu, H.; Li, H. Pump remaining useful life prediction based on multi-source fusion and monotonicity-constrained particle filtering. Mechanical Systems and Signal Processing 2022, 170, 108851. [Google Scholar] [CrossRef]
- Sharma, A.K.; Punj, P.; Kumar, N.; Das, A.K.; Kumar, A. Lifetime prediction of a hydraulic pump using ARIMA model. Arabian Journal for Science and Engineering 2024, 49, 1713–1725. [Google Scholar] [CrossRef]
- Ding, Y.; Ma, L.; Wang, C.; Tao, L. An EWT-PCA and extreme learning machine based diagnosis approach for hydraulic pump. IFAC-PapersOnLine 2020, 53, 43–47. [Google Scholar] [CrossRef]
- Buabeng, A.; Simons, A.; Frempong, N.K.; Ziggah, Y.Y. Hybrid intelligent predictive maintenance model for multiclass fault classification. Soft Computing 2023, pp. 1–22. [CrossRef]
- Shao, Y.; Chao, Q.; Xia, P.; Liu, C. Fault severity recognition in axial piston pumps using attention-based adversarial discriminative domain adaptation neural network. Physica Scripta 2024, 99, 056009. [Google Scholar] [CrossRef]
- Surucu, O.; Gadsden, S.A.; Yawney, J. Condition monitoring using machine learning: A review of theory, applications, and recent advances. Expert Systems with Applications 2023, 221, 119738. [Google Scholar] [CrossRef]
- Kaur, H.; Pannu, H.S.; Malhi, A.K. A systematic review on imbalanced data challenges in machine learning: Applications and solutions. ACM computing surveys (CSUR) 2019, 52, 1–36. [Google Scholar] [CrossRef]
- Wilson, D.L. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics 2007, pp. 408–421. [CrossRef]
- Tomek, I. Two modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics 1976, 6, 769–772. [Google Scholar]
- Hoyos-Osorio, J.; Alvarez-Meza, A.; Daza-Santacoloma, G.; Orozco-Gutierrez, A.; Castellanos-Dominguez, G. Relevant information undersampling to support imbalanced data classification. Neurocomputing 2021, 436, 136–146. [Google Scholar] [CrossRef]
- Fernández, A.; Garcia, S.; Herrera, F.; Chawla, N.V. SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of artificial intelligence research 2018, 61, 863–905. [Google Scholar] [CrossRef]
- Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In Proceedings of the International conference on intelligent computing. Springer, 2005, pp. 878–887.
- Sun, Y.; Que, H.; Cai, Q.; Zhao, J.; Li, J.; Kong, Z.; Wang, S. Borderline smote algorithm and feature selection-based network anomalies detection strategy. Energies 2022, 15, 4751. [Google Scholar] [CrossRef]
- Douzas, G.; Bacao, F.; Last, F. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Information sciences 2018, 465, 1–20. [Google Scholar] [CrossRef]
- Juez-Gil, M.; Arnaiz-Gonzalez, A.; Rodriguez, J.J.; Lopez-Nozal, C.; Garcia-Osorio, C. Approx-SMOTE: fast SMOTE for big data on apache spark. Neurocomputing 2021, 464, 432–437. [Google Scholar] [CrossRef]
- He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). Ieee, 2008, pp. 1322–1328.
- Tekkali, C.G.; Natarajan, K. An advancement in AdaSyn for imbalanced learning: An application to fraud detection in digital transactions. Journal of Intelligent & Fuzzy Systems 2024, 46, 11381–11396. [Google Scholar] [CrossRef]
- Swana, E.F.; Doorsamy, W.; Bokoro, P. Tomek link and SMOTE approaches for machine fault classification with an imbalanced dataset. Sensors 2022, 22, 3246. [Google Scholar] [CrossRef]
- Nizam-Ozogur, H.; Orman, Z. A heuristic-based hybrid sampling method using a combination of SMOTE and ENN for imbalanced health data. Expert Systems 2024, 41, e13596. [Google Scholar] [CrossRef]
- Wang, C.; Deng, C.; Wang, S. Imbalance-XGBoost: leveraging weighted and focal losses for binary label-imbalanced classification with XGBoost. Pattern recognition letters 2020, 136, 190–197. [Google Scholar] [CrossRef]
- Yang, C.Y.; Yang, J.S.; Wang, J.J. Margin calibration in SVM class-imbalanced learning. Neurocomputing 2009, 73, 397–411, Timely Developments in Applied Neural Computing (EANN 2007)/Some Novel Analysis and Learning Methods for Neural Networks (ISNN 2008)/Pattern Recognition in Graphical Domains. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 2011, 12, 2825–2830. [Google Scholar]
- Lemaître, G.; Nogueira, F.; Aridas, C.K. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research 2017, 18, 1–5. [Google Scholar]
- Marcin, R.; Marcin, B. Research scripts. https://github.com/mblachnik/2025_Data_balancers_pumps.











| Pos. | Description | Symbol | Manufacturer |
|---|---|---|---|
| 1 | Temperature sensor Pt1000 150 °C | TA2105 | IFM, Germany |
| 2 | Pressure sensor -1…1 bar | PA3509 | IFM, Germany |
| 3 | Electric motor 37 kW | FCMP 225S-4/PHE | AC-Motoren, Germany |
| 4 | Temperature sensor Pt1000 150 °C | TA2105 | IFM, Germany |
| 5 | Pressure sensor 400 bar | PT5400 | IFM, Germany |
| 6 | Turbine flow meter | PPC-04/12-SFM-015 | Stauff, Germany |
| 7 | Check valve | S8A1.0 | Ponar, Poland |
| 8 | Temperature sensor Pt1000 150 °C | TA2105 | IFM, Germany |
| 9 | Pressure sensor 10 bar | PT5404 | IFM, Germany |
| 10 | Piston pump | HSP10VO45DFR | Hydraut, Italy |
| 11 | Gear wheel flow meter | DZR-10155 | Kobold, Germany |
| 12 | Temperature sensor Pt1000 150 °C | TA2105 | IFM, Germany |
| 13 | Pressure sensor 10 bar | PT5404 | IFM, Germany |
| 14 | Hydraulic motor | F12 060 MF | Parker, USA |
| 15 | Torque meter | T22/1KNM | HBM, Germany |
| 16 | Electric motor 170 kW | LSRPM250ME1 | Emerson, USA |
| 17 | Filter | FS1 | Ponar, Poland |
| 18 | Pressure sensor 250 bar | PT5401 | IFM, Germany |
| 19 | Temperature sensor Pt1000 150 °C | TA2105 | IFM, Germany |
| 20 | Vibration sensors 1 | VSA001 | IFM, Germany |
| 21 | Vibration diagnostic converter 1 | VSE100 | IFM, Germany |
| Dataset | Records Normal | Records Failure | Records Total |
|---|---|---|---|
| 4349 | 4349 | 8698 | |
| 4349 | 2174 | 6523 | |
| 4349 | 1087 | 5436 | |
| 4349 | 434 | 4783 | |
| 4349 | 217 | 4566 | |
| 4349 | 130 | 4479 | |
| 4349 | 43 | 4392 | |
| 4349 | 21 | 4370 | |
| 2887 | 2887 | 5774 | |
| 2887 | 2887 | 5774 |
| Model | |||
|---|---|---|---|
| MLP((40,20)) | 90.14 | 89.50 | 71.56 |
| Model |
Rank |
Rank |
Rank |
Average Rank |
|||
|---|---|---|---|---|---|---|---|
| Borderline-SMOTE | 0.99461 | 0.86911 | 0.69706 | 2.0 | 1.0 | 3.0 | 2.0 |
| SMOTE+Tomek Links | 0.99441 | 0.86634 | 0.70032 | 4.0 | 2.0 | 1.0 | 2.3 |
| SMOTE | 0.99464 | 0.86627 | 0.69471 | 1.0 | 3.0 | 5.0 | 3.0 |
| ENN | 0.99433 | 0.86284 | 0.69898 | 6.0 | 7.0 | 2.0 | 5.0 |
| SMOTE+ENN | 0.99449 | 0.86327 | 0.69376 | 3.0 | 4.0 | 8.0 | 5.0 |
| ADASYN | 0.99438 | 0.86037 | 0.69580 | 5.0 | 8.0 | 4.0 | 5.7 |
| no balancer | 0.99406 | 0.86320 | 0.69456 | 7.0 | 5.5 | 6.5 | 6.3 |
| Tomek Links | 0.99400 | 0.86320 | 0.69456 | 8.0 | 5.5 | 6.5 | 6.7 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).