Submitted:
28 March 2024
Posted:
29 March 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
- The evolution of code smells.
- The definition of code smells.
- The understanding, interest, and ability of software development teams to fix code smells.
- The effect on source code before and after fixing code smells.
2. Related Works
3. Methodology
3.1. Proposed Model Structure
3.1.1. First Model (3-Layer Stacked CNN-BiLSTM-LSTM)
- Layer 1: Input layer
- Layer 2 (permutes layer): Used to permute the dimensions of the input.
- Layer 3 (convolutional 1D layer): Consists of 128 filters with a kernel size of 3 and padding of 1. This layer reads input data from the input layer (layer 1) and sends outputs to convolutional layer (layer 5).
- Layer 4 (convolutional 1D layer): Has 128 filters with a kernel size of 3 and padding of 1, this layer reads input data from the permutes layer (layer 2) and sends output to convolutional layer (layer 6).
- Layer 5 (convolutional 1D layer): Has 64 filters with a kernel size of 1 and padding of 1. This layer mirrors layer 5 and reads input data from convolutional layer (layer 3), sending output to BiLSTM Layer (layer 7).
- Layer 6 (convolutional 1D layer): Has 64 filters with a kernel size of 1 and padding of 1. This layer also reads input data from convolutional layer (layer 3) and sends output to BiLSTM Layer (Layer 8).
- Layer 7 (BiLSTM layer): Contains 128 filters, serving as the integrated layer for convolutional layer 5. It reads input data from convolutional layer (Layer 5) and sends output to LSTM layer (Layer 9).
- Layer 8 (BiLSTM layer): Consists of 128 filters and serves as the integrated layer for convolutional layer 6. It reads input data from convolutional layer (layer 6) and sends output to LSTM layer (layer 9).
- Layer 9 (LSTM layer): Comprises 128 filters, serving as the integrated layer for BiLSTM layer 7. It reads input data from the BiLSTM layer (layer 7) and sends output to concatenate layer (layer 11).
- Layer 10 (LSTM layer): Contains 128 filters and serves as the integrated layer for BiLSTM layer 7. It reads input data from the BiLSTM layer (layer 8) and sends output to concatenate layer (layer 11).
- Layer 11 (encoded columns): A concatenate layer used to create a dense feature map. It concatenates the feature maps obtained from LSTM layers (layer 9 and layer 10) and passes them to dropout layer (layer 12).
- Layer 12 (dropout layer): Applies a dropout rate of 20% to encoded columns (layer 11) to prevent overfitting by randomly setting 20% of the input units to 0 during training.
- Layer 13 (dense layer): A fully connected layer where neurons receive inputs from layer 11 and predict class probabilities for each input sample. This layer utilizes the SoftMax activation function in model 1.

3.1.2. Second Model (2-Layer Stacked CNN-BiLSTM)
- Layer 1: Input layer (60, 1)
- Layer 2: Convolutional 1D layer with 128 filters, 3 kernel sizes, 1 padding, and ReLU activation function. Outputs 60 features.
- Layer 3: Max pooling 1D layer with 2 pool sizes and 1 padding. Outputs a feature vector, reducing dimension to 30.
- Layer 4: Convolutional 1D layer with 128 filters, 3 kernel sizes, 1 padding, and ReLU activation function. Outputs 30 features.
- Layer 5 (encoder): Max pooling 1D layer with 2 pool sizes and 1 padding. Encodes the feature vector to reduce dimension to 15.
- Layer 6: Convolutional 1D layer with 128 filters, 3 kernel sizes, 1 padding, and ReLU activation function. Outputs 15 features.
- Layer 7: Max pooling 1D layer with 2 pool sizes and 1 padding. Outputs a feature vector.
- Layer 8: Convolutional 1D layer with 128 filters, 3 kernel sizes, 1 padding, and ReLU activation function. Outputs 64 features.
- Layer 9 (encoder): Max pooling 1D layer with 2 pool sizes and 1 padding. Encodes the feature vector.
- Layer 10: BiLSTM layer with 128 filters, integrated layer for convolutional layer 2.
- Layer 11: BiLSTM layer, mirror of layer 4.
- Layer 12 (dense layer): Fully connected layer where neurons receive inputs from layer 11 and predict class probabilities for each input sample. Sigmoid activation function is used in this proposed model 2.
4. Results and Discussion
4.1. Data Cleaning
4.2. Classification Results for God-Class Dataset
4.2.1. Test Results for Proposed and ML Models
4.2.2. Test Results for Proposed and ML Models
4.3. Classification Results for Data-Class Dataset
4.3.1. Test Results for Proposed and ML Models
4.3.2. Test Results for Proposed and ML Models
4.4. Classification Results for Long-Method Dataset
4.4.1. Test Results for Proposed and ML Models
4.4.2. Test Results for Proposed and ML Models
4.5. Classification Results for Feature-Envy Dataset
4.5.1. Test Results for Proposed and ML Models
4.5.2. Test Results for Proposed and ML Models
5. Conclusion
References
- Code Smells Dataset (Oracles)”. [CrossRef]
- Rodríguez-Pérez, G.; Robles, G.; Serebrenik, A.; Zaidman, A.; Germán, D.M.; Gonzalez-Barahona, J.M. How bugs are born: a model to identify how bugs are introduced in software components. Empir. Softw. Eng. 2020, 25, 1294–1340. [Google Scholar] [CrossRef]
- Cairo, A.S.; Carneiro, G.d.F.; Monteiro, M.P. The Impact of Code Smells on Software Bugs: A Systematic Literature Review. Information 2018, 9, 273. [Google Scholar] [CrossRef]
- Cerny, T.; Abdelfattah, A.S.; Al Maruf, A.; Janes, A.; Taibi, D. Catalog and detection techniques of microservice anti-patterns and bad smells: A tertiary study. J. Syst. Softw. 2023, 206. [Google Scholar] [CrossRef]
- Liu, X.; Zhang, C. “The Detection of Code Smell on Software Development: A Mapping Study,” 2017. [Online]. Available: http://checkstyle.sourceforge.net.
- M. Kaur and D. Singh, “An Intelligent Code Smell Detection Technique Using Optimized Rule-Based Architecture for Object-Oriented Programmings,” Lecture Notes in Electrical Engineering, vol. 836, pp. 349–363, 2022. [CrossRef]
- S. Gilman, “Ethics Codes and Codes of Conduct as Tools for Promoting an Ethical and Professional Public Service,” Journal of Professional Issues in Engineering Education and Practice, 2005.
- Amorim, L.; Costa, E.; Antunes, N.; Fonseca, B.; Ribeiro, M. “Experience Report: Evaluating the Effectiveness of Decision Trees for Detecting Code Smells,” 2015 IEEE 26th International Symposium on Software Reliability Engineering, ISSRE 2015, pp. 261–269, Jan. 2016. [CrossRef]
- Kaur, A.; Jain, S.; Goel, S. “A Support Vector Machine Based Approach for Code Smell Detection,” Proceedings - 2017 International Conference on Machine Learning and Data Science, MLDS 2017, vol. 2018-January, pp. 9–14, Jul. 2017. [CrossRef]
- Sarafim, D.S.; Delgado, K.V.; Cordeiro, D. “Random Forest for Code Smell Detection in JavaScript,” Anais do Encontro Nacional de Inteligência Artificial e Computacional (ENIAC), pp. 13–24, Nov. 2022. [CrossRef]
- Kaur, A.; Jain, S.; Goel, S. SP-J48: a novel optimization and machine-learning-based approach for solving complex problems: special application in software engineering for detecting code smells. Neural Comput. Appl. 2019, 32, 7009–7027. [Google Scholar] [CrossRef]
- H. Grodzicka, A. H. Grodzicka, A. Ziobrowski, Z. Łakomiak, M. Kawa, and L. Madeyski, “Code Smell Prediction Employing Machine Learning Meets Emerging Java Language Constructs,” Lecture Notes on Data Engineering and Communications Technologies, vol. 40, pp. 137–167, 2020. [CrossRef]
- Dewangan, S.; Rao, R.S.; Mishra, A.; Gupta, M. A Novel Approach for Code Smell Detection: An Empirical Study. IEEE Access 2021, 9, 162869–162883. [Google Scholar] [CrossRef]
- Fontana, F.A.; Mäntylä, M.V.; Zanoni, M.; Marino, A. Comparing and experimenting machine learning techniques for code smell detection. Empir. Softw. Eng. 2015, 21, 1143–1191. [Google Scholar] [CrossRef]
- Mhawish, M.Y.; Gupta, M. Generating Code-Smell Prediction Rules Using Decision Tree Algorithm and Software Metrics. Int. J. Comput. Sci. Eng. 2019, 7, 41–48. [Google Scholar] [CrossRef]
- Baarah, A.; Aloqaily, A.; Salah, Z.; Zamzeer, M.; Sallam, M. Machine Learning Approaches for Predicting the Severity Level of Software Bug Reports in Closed Source Projects. Int. J. Adv. Comput. Sci. Appl. 2019, 10. [Google Scholar] [CrossRef]
- M. N. Pushpalatha and M. Mrunalini, “Predicting the Severity of Closed Source Bug Reports Using Ensemble Methods,” Smart Innovation, Systems and Technologies, vol. 105, pp. 589–597, 2019. [CrossRef]
- Pecorelli, F.; Di Nucci, D.; De Roover, C.; De Lucia, A. A large empirical assessment of the role of data balancing in machine-learning-based code smell detection. J. Syst. Softw. 2020, 169. [Google Scholar] [CrossRef]
- Guggulothu, T.; Moiz, S.A. Code smell detection using multi-label classification approach. Softw. Qual. J. 2020, 28, 1063–1086. [Google Scholar] [CrossRef]
- Draz, M.M.; Farhan, M.S.; Abdulkader, S.N.; Gafar, M.G. Code Smell Detection Using Whale Optimization Algorithm. Comput. Mater. Contin. 2021, 68, 1919–1935. [Google Scholar] [CrossRef]
- Liu, H.; Jin, J.; Xu, Z.; Bu, Y.; Zou, Y.; Zhang, L. Deep Learning Based Code Smell Detection. IEEE Trans. Softw. Eng. 2019. [Google Scholar] [CrossRef]
- Yadav, P.S.; Dewangan, S.; Rao, R.S. “Extraction of Prediction Rules of Code Smell Using Decision Tree Algorithm,” IEMECON 2021 – 10th International Conference on Internet of Everything, Microwave Engineering, Communication and Networks, 2021. [CrossRef]
- H. Gupta, T. G. H. Gupta, T. G. Kulkarni, L. Kumar, L. B. M. Neti, and A. Krishna, “An Empirical Study on Predictability of Software Code Smell Using Deep Learning Models,” Lecture Notes in Networks and Systems, vol. 226, pp. 120–132, 2021. [CrossRef]
- S. Dewangan and R. S. Rao, “Code Smell Detection Using Classification Approaches,” Lecture Notes in Networks and Systems, vol. 431, pp. 257–266, 2022. [CrossRef]












| Machine Learning Algorithm | Validation Accuracy (%) |
|---|---|
| K-Nearest Neighbor (KNN) | 96.03 |
| Support Vector Machine (SVM) | 93.65 |
| Decision Tree (DT) | 94.44 |
| Stochastic Gradient Descent (SGD) | 94.44 |
| Logistic Regression (LR) | 94.44 |
| eXtreme Gradient Boosting (XGBoost) | 96.83 |
| Proposed Model 1 (3-Layer Stacked Autoencoder) | 97.62 |
| Proposed Model 2 (2-Layer Stacked Autoencoder) | 97.62 |
| Metrics | Deep Learning Models | ||||
|---|---|---|---|---|---|
| CNN | LSTM | 3-Layer Stacked Autoencoder | 2-Layer Stacked Autoencoder | ||
| Test Loss | 0.1502 | 0.2635 | 0.1198 | 0.3361 | |
| Test Acc. (%) | 95.24 | 92.86 | 97.62 | 97.62 | |
| Precision (%) | 0 | 0.96 | 0.96 | 0.99 | 0.99 |
| 1 | 0.93 | 0.86 | 0.95 | 0.95 | |
| Recall (%) | 0 | 0.96 | 0.93 | 0.98 | 0.98 |
| 1 | 0.93 | 0.93 | 0.97 | 0.97 | |
| F1-Score (%) | 0 | 0.96 | 0.95 | 0.98 | 0.98 |
| 1 | 0.93 | 0.89 | 0.96 | 0.96 | |
| Training/Testing Time | 19.71 | 106.387 | 265.744 | 27.952 | |
| Machine Learning Algorithm | Validation Accuracy (%) |
|---|---|
| K-Nearest Neighbor (KNN) | 90.48 |
| Support Vector Machine (SVM) | 95.24 |
| Decision Tree (DT) | 97.62 |
| Stochastic Gradient Descent (SGD) | 92.86 |
| Logistic Regression (LR) | 94.05 |
| eXtreme Gradient Boosting (XGBoost) | 97.62 |
| Proposed Model 1 (3-Layer Stacked Autoencoder) | 98.81 |
| Proposed Model 2 (2-Layer Stacked Autoencoder) | 98.81 |
| Metrics | Deep Learning Models | ||||
|---|---|---|---|---|---|
| CNN | LSTM | 3-Layer Stacked Autoencoder | 2-Layer Stacked Autoencoder | ||
| Test Loss | 0.2096 | 0.2220 | 0.0539 | 0.0883 | |
| Test Acc. (%) | 95.24 | 90.48 | 98.81 | 98.81 | |
| Precision (%) | 0 | 0.95 | 0.98 | 0.98 | 0.98 |
| 1 | 0.89 | 0.73 | 1.00 | 1.00 | |
| Recall (%) | 0 | 0.97 | 0.89 | 1.00 | 1.00 |
| 1 | 0.85 | 0.95 | 0.95 | 0.95 | |
| F1-Score (%) | 0 | 0.96 | 0.93 | 0.99 | 0.99 |
| 1 | 0.87 | 0.83 | 0.97 | 0.97 | |
| Training/Testing Time | 19.71 | 19.278 | 136.968 | 300.45 | |
| Machine Learning Algorithm | Validation Accuracy (%) |
|---|---|
| K-Nearest Neighbor (KNN) | 90.48 |
| Support Vector Machine (SVM) | 92.86 |
| Decision Tree (DT) | 98.81 |
| Stochastic Gradient Descent (SGD) | 95.24 |
| Logistic Regression (LR) | 95.24 |
| eXtreme Gradient Boosting (XGBoost) | 98.81 |
| Proposed Model 1 (3-Layer Stacked Autoencoder) | 98.81 |
| Proposed Model 2 (2-Layer Stacked Autoencoder) | 98.81 |
| Metrics | Deep Learning Models | ||||
|---|---|---|---|---|---|
| CNN | LSTM | 3-Layer Stacked Autoencoder | 2-Layer Stacked Autoencoder | ||
| Test Loss | 0.1903 | 0.2439 | 0.0472 | 0.1091 | |
| Test Acc. (%) | 92.86 | 92.86 | 98.81 | 97.62 | |
| Precision (%) | 0 | 0.95 | 0.98 | 0.98 | 0.98 |
| 1 | 0.89 | 0.73 | 1.00 | 1.00 | |
| Recall (%) | 0 | 0.97 | 0.89 | 1.00 | 1.00 |
| 1 | 0.85 | 0.95 | 0.95 | 0.95 | |
| F1-Score (%) | 0 | 0.96 | 0.93 | 0.99 | 0.99 |
| 1 | 0.87 | 0.83 | 0.97 | 0.97 | |
| Training/Testing Time | 6.200 | 23.81 | 310.45 | 32.3 | |
| Machine Learning Algorithm | Validation Accuracy (%) |
|---|---|
| K-Nearest Neighbor (KNN) | 90.48 |
| Support Vector Machine (SVM) | 92.86 |
| Decision Tree (DT) | 98.81 |
| Stochastic Gradient Descent (SGD) | 95.24 |
| Logistic Regression (LR) | 95.24 |
| eXtreme Gradient Boosting (XGBoost) | 98.81 |
| Proposed Model 1 (3-Layer Stacked Autoencoder) | 98.81 |
| Proposed Model 2 (2-Layer Stacked Autoencoder) | 98.81 |
| Metrics | Deep Learning Models | ||||
|---|---|---|---|---|---|
| CNN | LSTM | 3-Layer Stacked Autoencoder | 2-Layer Stacked Autoencoder | ||
| Test Loss | 0.1903 | 0.2439 | 0.0472 | 0.1091 | |
| Test Acc. (%) | 92.86 | 92.86 | 98.81 | 97.62 | |
| Precision (%) | 0 | 0.95 | 0.98 | 0.98 | 0.98 |
| 1 | 0.89 | 0.73 | 1.00 | 1.00 | |
| Recall (%) | 0 | 0.97 | 0.89 | 1.00 | 1.00 |
| 1 | 0.85 | 0.95 | 0.95 | 0.95 | |
| F1-Score (%) | 0 | 0.96 | 0.93 | 0.99 | 0.99 |
| 1 | 0.87 | 0.83 | 0.97 | 0.97 | |
| Training/Testing Time | 6.200 | 23.81 | 310.45 | 32.3 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).