5. Discussion
The comparison among the three evaluated machine learning classifiers indicates that Logistic Regression was the most suitable model for the multiclass classification task. Across the selected evaluation metrics, it consistently achieved better performance than Decision Tree and Naive Bayes, which suggests that the decision boundaries required for this task were effectively captured by a linear classifier. This result also indicates that increasing classifier complexity did not necessarily lead to improved performance under the evaluated conditions. Therefore, Logistic Regression was retained as the reference classifier for the subsequent analysis.
Once Logistic Regression was selected as the reference classifier, an additional comparison was conducted between CodeBERT and OpenAI embeddings. The stronger results obtained with CodeBERT embeddings suggest that they provided a representation more closely aligned with the characteristics of the task. This may indicate that embeddings derived from a model specialized in source code captured more discriminative information than the more general-purpose OpenAI embeddings in this setting.
These results can be further interpreted in relation to Attentionsmelling as a closely related reference built on the same empirical basis [
10]. However, the comparison should be understood at the methodological level rather than as a strict experimental replication, since Attentionsmelling evaluates a prompt-based GPT-4o pipeline, whereas the present study examines multiclass discriminative learning over fixed vector representations. Even under these differences, both studies suggest that code smell detection is strongly influenced by the quality of the information provided to the model. In Attentionsmelling, performance improves when the model receives refined prompts and structured code metrics, whereas in the present study performance improves when the classification pipeline relies on code-specialized embeddings rather than more general-purpose representations. At the class level, both studies also indicate that Feature Envy remains the most challenging category, reinforcing the view that this smell requires richer contextual information for reliable discrimination.
The co-occurrence analysis provides additional context for interpreting the classification results. As shown in
Table 5, the most frequent cross-type combinations were God Class + Long Method and God Class + Feature Envy, whereas Feature Envy + Long Method was rarely observed. However, the confusion matrix analysis showed that the main source of misclassification was precisely the pair Feature Envy–Long Method, and the class-wise results further indicated that Feature Envy was the most difficult category in both representation approaches. This contrast suggests that classification difficulty was not determined only by the explicit frequency of co-occurrence in the dataset, but also by structural similarities captured in the source code representations. In this sense, the present findings are consistent with prior evidence that code smell co-occurrences are a relevant phenomenon and that relationships between smell categories cannot be reduced to isolated occurrences alone [
3]. From this perspective, the co-occurrence results complement the predictive evaluation by indicating that overlap among code smell categories may arise both from their joint presence in the same code context and from shared characteristics that reduce class separability in the embedding space.
Taken together, these findings allow the two research questions of this study to be addressed.
5.1. RQ1. Which of the Evaluated Embeddings Provides the Best Separation Among the Three Classes Considered?
The results indicate that CodeBERT provided the best separation among the three analyzed classes. This conclusion is supported by the overall classification results, where CodeBERT achieved higher balanced accuracy, macro F1-score, and MCC than OpenAI. The same pattern was observed in the confusion matrices, where CodeBERT reduced the number of misclassifications, particularly between Feature Envy and Long Method, which represented the main source of confusion in both approaches.
This difference can also be interpreted in light of the nature of the compared embedding models. Because CodeBERT is pretrained on source code corpora, it is more likely to encode structural and syntactic regularities that are directly relevant to code smell discrimination. In contrast, OpenAI embeddings, while still able to capture meaningful distributional patterns, are not explicitly optimized for software engineering tasks. The superiority of CodeBERT in the present study therefore suggests that domain-specific pretraining provides a more suitable latent space for separating smell categories, especially when the task requires distinguishing between structurally related smells such as Long Method and Feature Envy.
The dimensionality reduction analyses were consistent with these results. In the PCA projection, the CodeBERT embeddings showed a more structured spatial organization, with Feature Envy concentrated mainly in the lower region of the plane and Long Method more frequently located in the upper and central regions. In the t-SNE projection, the same general tendency was observed, with a clearer grouping of Feature Envy and a broader distribution of Long Method, while God Class remained between both classes with partial overlap. Although some overlap is still present, both visualizations suggest that CodeBERT generates a representation space with a clearer class organization than OpenAI.
A complementary insight emerges from the cumulative variance analysis of the PCA decomposition. While the two-dimensional projections shown in
Figure 5 capture only a limited portion of the total variance, the number of components required to reach 80% of the explained variance differs substantially between the two embedding models. In particular, CodeBERT requires only nine principal components to reach this threshold, whereas the OpenAI embeddings require sixteen components.
This result suggests that CodeBERT organizes the information relevant to code smell discrimination in a more compact latent space. In other words, a smaller number of orthogonal directions is sufficient to capture most of the variability of the representation. From a machine learning perspective, this property facilitates the separation of classes using relatively simple classifiers, which is consistent with the superior performance observed for CodeBERT in the classification experiments. Conversely, the higher number of components required by the OpenAI embeddings indicates a more dispersed representation space, where discriminative information is spread across a larger number of dimensions.
This observation is also consistent with the t-SNE visualizations presented in Figure X, where the CodeBERT embeddings exhibit clearer local grouping patterns among the code smell categories.
Additional insight can be obtained from the confusion matrix analysis presented in
Section 4.3. The main source of misclassification for both embedding approaches occurs between Feature Envy and Long Method. This pattern suggests that these two smells share structural characteristics that make them difficult to distinguish using purely distributional representations of source code. Long methods frequently involve multiple interactions with external objects or classes, which may generate token patterns similar to those observed in Feature Envy instances.
Nevertheless, the confusion matrices also show that CodeBERT substantially reduces this ambiguity. In particular, the number of Long Method instances misclassified as Feature Envy decreases markedly when using CodeBERT embeddings. This result indicates that code-specialized representations capture structural relationships in the source code more effectively than general-purpose embeddings. In contrast, the God Class category exhibits relatively low confusion with the other smells, suggesting that its structural characteristics are more distinct and easier to capture within the embedding space. This observation is also consistent with the PCA and t-SNE visualizations, where partial overlap between Long Method and Feature Envy can be observed, particularly in the OpenAI embedding space.
Therefore, the results consistently indicate that CodeBERT provides the most effective separation among the three code smell classes considered in this study.
5.2. RQ2. Is This Difference Maintained When Performance Is Analyzed Separately for Each Code Smell?
The class-wise results indicate that the advantage of CodeBERT remains when performance is analyzed separately for each code smell. For Long Method, both approaches achieved strong results, but CodeBERT obtained higher recall and F1-score, indicating a more accurate identification of instances in this class. For God Class, OpenAI reached the highest precision, whereas CodeBERT achieved higher recall and F1-score, which reflects a more balanced performance. The largest difference was observed for Feature Envy, where CodeBERT improved precision, recall, and F1-score with respect to OpenAI.
These class-wise patterns are also consistent with the confusion matrix analysis. In particular, the largest reduction in errors with CodeBERT was observed in the confusion between Feature Envy and Long Method, which supports the improvement detected for Feature Envy in the class-wise metrics. This result is especially relevant because Feature Envy was the most challenging class in both approaches.
A possible explanation for these differences can be related to the structural characteristics of the analyzed code smells. Long Method and God Class are primarily associated with size-related properties of the code, such as excessive method length or large classes containing many responsibilities. These characteristics tend to generate relatively consistent structural patterns that can be captured by both embedding approaches. In contrast, Feature Envy is defined by an abnormal dependency structure in which a method relies excessively on the data of another class. This behavior may produce more subtle contextual patterns that are harder to capture using general-purpose embeddings. Consequently, the stronger performance of CodeBERT for Feature Envy suggests that code-specialized representations are better suited to modeling these structural relationships between program elements.
Therefore, the results confirm that the advantage of CodeBERT is consistently preserved across the individual code smell categories, although the magnitude of the improvement varies depending on the structural characteristics of each smell. The difference observed at the global level was preserved in the class-wise analysis, with CodeBERT showing better performance for the three code smell categories, although the improvement was more pronounced for Feature Envy than for Long Method and God Class.