The following subsections analyze the results obtained at each stage of the proposed process, including a detailed comparison of the models and their quantitative performance metrics.
5.1. Direct Image-Based Classification
The results in
Table 2 correspond to the classifiers’ outputs when the original images—resized but without any feature extraction—are used directly as input. Various fuzzy classification methods and membership functions were evaluated under these conditions. The reported metrics (sensitivity, specificity, precision, F1-score, true negatives, false positives, false negatives, true positives) provide a comprehensive assessment of each configuration’s performance.
Overall, the results indicate limited effectiveness of the classifiers on raw image data. While some metrics appear high in certain cases, a deeper analysis reveals misleading performance. For example, the Fuzzy DT classifier exhibits a sensitivity of 100% and specificity of 0% in several configurations, indicating that it classifies all samples as positive and fails to detect negative cases, which is a clear bias rather than effective classification.
Similarly, Fuzzy BN and Fuzzy CM show consistently poor results. For instance, Fuzzy BN with a trapezoidal membership function achieves only 38.03% sensitivity and 48.40% F1-score. Fuzzy CM yields one of the lowest performances, with an F1-score of 30.94%, far below acceptable levels.
Although some configurations of Fuzzy SVM and Fuzzy KNN yield better results—such as Fuzzy SVM with a Gaussian membership function achieving an F1-score of 95.52%—these are exceptions. Most methods present critical shortcomings in at least one key metric (sensitivity, specificity, or precision).
These findings suggest that using the original images directly as input to fuzzy classifiers does not provide reliable results. Significant improvements are necessary, both in the choice of fuzzy method and in the tuning of membership functions and other parameters. Additionally, isolated high metric values should be interpreted cautiously and in the context of all relevant measures to properly assess model performance.
5.2. Feature Extraction Based on Images
5.2.1. Feature Selection Analysis Based on Quality Metric Q
This section presents the results of applying a GA for feature selection using the quality metric Q as the fitness function. The process was performed on features extracted from five different CNN models: MobileNetV2, ResNet50, VGG16, EfficientNetV2, and a custom CNN. The objective is to identify the most representative and discriminative subsets of features that enhance class separability and improve classification performance in image-based tasks.
Feature Quality Visualization Based on Metric Q
Table 3 presents PCA-based visualizations of the selected features for each evaluated CNN model. Each subfigure shows the projection of the data onto the first two principal components, PC1 and PC2, allowing visualization of class distribution and separability in the reduced feature space.
Table 3(a) displays the selected features extracted by the custom CNN model, showing a clear separation between classes.
Table 3(b) corresponds to EfficientNetV2, where clusters exhibit some degree of overlap.
Table 3(c) illustrates the feature representation for MobileNetV2, while
Table 3(d) shows the projection for ResNet50. Finally,
Table 3(e) presents the feature distribution for VGG16, demonstrating high class separability.
As shown in
Table 4, the maximum
Q values achieved by each model are reported alongside the number of features selected by the GA. The
Q metric evaluates the quality of clustering produced by the selected features, with higher values indicating better class separability and representational capacity.
The results indicate that the VGG16 model attained the highest clustering quality, achieving a score of with 95 selected features. This suggests that the feature space of VGG16, when optimized by the GA, offers strong class discrimination. In comparison, MobileNetV2 selected a relatively large number of features (74), but its maximum Q score was notably lower, reflecting less effective class separability.
A visual comparison of these outcomes is presented in
Figure 6, which illustrates both the clustering quality
Q and the number of selected features for all evaluated CNN models.
Quantitative Comparison of Feature Quality and Dimensionality:
VGG16 obtained the highest clustering quality (), showing the best class separability in its selected features.
MobileNetV2 and VGG16 selected a similar number of features (74 and 95), but VGG16 achieved significantly better quality, indicating stronger feature discrimination.
ResNet50 and EfficientNetV2 had lower Q scores, with ResNet50 performing the worst (), suggesting weaker feature separability.
The Custom CNN model selected fewer features (44) and achieved a moderate quality score (), balancing feature compactness and discriminative ability.
Overall, the feature subset from VGG16 was the most effective, achieving the highest clustering quality among the evaluated models.
5.2.2. Feature Selection Using a GA Based on Clustering Metrics (Silhouette Index, Calinski–Harabasz, and Davies–Bouldin)
To evaluate the effectiveness of the feature subsets selected by the GA, three widely recognized clustering validation metrics were utilized: the Silhouette Score, Calinski–Harabasz Index, and Davies–Bouldin Index. These metrics were computed on the selected feature subsets obtained from each CNN-based representation following dimensionality reduction.
Table 5 summarizes the clustering quality metrics and the number of selected features for each CNN model after feature selection using the GA.
The VGG16 model stands out as the best performer overall, achieving the highest Silhouette Score (0.7343), a strong Calinski–Harabasz score (8233.05), and a low Davies–Bouldin Index (0.4292). This combination reflects excellent cluster compactness and separation, indicating that the selected features from VGG16 provide the most discriminative representation for the classification task. MobileNetV2 showed competitive results, with a Silhouette Score close to that of VGG16 (0.7229). However, it required a larger number of selected features (44), which may imply lower efficiency in feature reduction compared to VGG16’s 16 features. The Custom CNN model exhibited the highest Calinski–Harabasz score (9038.60) and the lowest Davies–Bouldin Index (0.4254), suggesting very well-separated clusters. However, its slightly lower Silhouette Score (0.6541) implies that cluster cohesion was somewhat reduced relative to VGG16. EfficientNetV2 demonstrated moderate and balanced performance across all three metrics but did not excel in any particular measure. This suggests an average clustering quality for its selected features. Finally, ResNet50 showed the weakest clustering quality among the evaluated models, with the lowest Silhouette Score (0.5678) and comparatively higher Davies–Bouldin Index, indicating less effective clustering and feature separability.
Figure 7 provides a visual comparison of the three clustering validation metrics (Silhouette Score, Calinski–Harabasz Index, and Davies–Bouldin Index) across the CNN-based feature subsets. This visualization further highlights the differences in clustering quality among the models and supports the quantitative results presented in
Table 5.
In summary, these clustering metrics confirm that VGG16 provides the most effective feature subset for this problem, balancing cluster compactness and separation while using a compact feature set.
Visual Analysis of Cluster Structure:
Table 6 presents PCA visualizations of the selected feature subsets from each CNN model. These 2D projections illustrate the distribution and separability of clusters based on the selected features.
Dimensionality reduction via Principal Component Analysis (PCA) was applied to the selected feature subsets to visually assess their clustering structure. Feature representations from VGG16 and the Custom CNN model resulted in clearly separated clusters, consistent with their favorable clustering metrics. For MobileNetV2, the projected data revealed reasonably defined clusters with some overlap, indicating moderate separability. In contrast, ResNet50 exhibited poorly separated clusters in the PCA space, aligning with its lower performance on clustering validation metrics.
Overall, among the evaluated CNN models, VGG16 demonstrates the greatest ability to produce compact and well-separated clusters while using a relatively small number of features, making it a particularly suitable candidate for subsequent classification tasks.
Fitness Evolution Across Generations
The convergence behavior of the GA was monitored by tracking the best fitness value—represented by the Silhouette Score—over successive generations for each CNN model.
Figure 8 illustrates this evolution. VGG16 exhibited a rapid improvement in fitness, reaching its optimal value within the early generations and maintaining stability thereafter. In contrast, MobileNetV2 demonstrated a more gradual convergence pattern. Although its fitness increased steadily, the final value remained slightly lower compared to VGG16. ResNet50 showed limited improvement throughout the generations, suggesting that the quality of its extracted features may have hindered effective optimization. Meanwhile, EfficientNetV2 and the Custom CNN model displayed smooth and consistent growth in fitness values, indicating stable convergence behavior and the presence of moderately informative features.
5.2.3. Fuzzy Classification Using Optimal Feature Set
Based on the feature selection process described in previous sections, the feature subset that yielded the best clustering quality and dimensionality balance—obtained via a GA—was used as input for various fuzzy classification methods. The classification was performed using different membership functions (Gaussian, Trapezoidal, and Triangular) to evaluate the sensitivity of each method to fuzzification strategies.
5.2.4. Fuzzy Classification Using Optimal Feature Set
Based on the feature selection process described in previous sections, the feature subset that yielded the best clustering quality and dimensionality balance—obtained via a GA—was used as input for various fuzzy classification methods. The classification was performed using different membership functions (Gaussian, Trapezoidal, and Triangular) to evaluate the sensitivity of each method to fuzzification strategies.
Performance Summary:
Table 7 presents the classification results of fuzzy methods using different membership functions. The evaluation is based on the same metrics previously described.
Table 7 presents a detailed breakdown of fuzzy classification performance by class and method, including metrics such as Sensitivity, Specificity, Precision, F1-Score, Accuracy, and confusion matrix components (TN, FP, FN, TP). Accuracy is consistent across classes for each method as it reflects overall performance, while the other metrics provide insight into per-class behavior. Notably, Fuzzy KNN and Fuzzy DT with triangular and trapezoidal membership functions maintain high and balanced metrics, whereas FGK shows very low sensitivity despite high specificity, indicating a bias toward negative classification.
Among the methods evaluated, Fuzzy KNN consistently achieved high classification performance across all membership functions, with F1-Scores exceeding 95% for Class 1 and balanced sensitivity and specificity values above 90%. Notably, the triangular and trapezoidal membership functions yielded nearly identical results, indicating robustness to the choice of fuzzification shape.
Fuzzy Decision Tree (Fuzzy DT) also demonstrated strong performance, especially with triangular and trapezoidal membership functions, achieving F1-Scores above 95% for Class 1 and balanced precision-recall trade-offs. Sensitivity and specificity were well aligned, supporting reliable classification across classes.
Fuzzy SVM showed competitive results, with the Gaussian membership function delivering a strong balance between sensitivity (around 95%) and precision (around 96%), while trapezoidal and triangular functions provided slightly lower but still acceptable metrics. Accuracy values were consistently above 93%.
In contrast, Fuzzy Bayesian Network (Fuzzy BN) exhibited variable results. While it performed well with the Gaussian function (F1-Score around 96% for Class 1), the triangular membership function led to a sharp decline in performance (F1-Score around 78% for Class 1 and below 67% for Class 0), likely due to poor generalization and high false negatives in one of the classes. The trapezoidal function showed moderate performance between these extremes.
FGK (Fuzzy Genetic Kernel) produced highly imbalanced results, with excellent specificity (around 99%) but very low sensitivity (around 17%), resulting in an overall low F1-Score (approximately 28%). This indicates a strong bias toward negative classification and poor detection of positive instances.
Finally, Fuzzy CM achieved high F1-Scores close to 96%, with balanced sensitivity and specificity around 93–94%, and consistent accuracy above 93%, making it a competitive alternative.
Overall, the results underscore the importance of selecting both an appropriate fuzzy classification model and membership function. Fuzzy KNN and Fuzzy DT, particularly with triangular or trapezoidal membership functions, stand out as the most effective approaches due to their robust and balanced classification performance across both classes.
Comparative Analysis Using ROC and Precision-Recall Curves
To evaluate the performance of fuzzy classifiers applied directly to images without prior feature extraction, two key visual metrics were used: the ROC curve (Receiver Operating Characteristic) and the Precision-Recall curve (PRC). These curves allow analysis of the models’ ability to differentiate between classes and maintain precision under various decision thresholds.
The Fuzzy KNN models with trapezoidal and triangular membership functions showed the best overall performance. Both configurations achieved an area under the curve (AUC) close to 0.94 in the ROC curve and an Average Precision (AP) of 0.96 in the PRC. This reflects excellent capability to distinguish between classes, even in imbalanced data contexts, as well as strong stability in prediction precision.
Regarding Fuzzy Naive Bayes, performance strongly depended on the membership function used. With a Gaussian function, the model reached an AUC of approximately 0.93 and an AP of 0.96, indicating good discrimination and high precision. However, with trapezoidal and triangular functions, performance dropped significantly, with AUCs of 0.63 and 0.52 and APs of 0.80 and 0.74, respectively. This highlights the sensitivity of the model to the shape of the membership function.
The Fuzzy CM model achieved an AUC close to 0.94, comparable to the best models, but its AP was lower, around 0.76. This suggests that, although it can discriminate between classes, its prediction precision may be less reliable depending on the decision threshold.
Finally, the FGK model exhibited the weakest performance, with an AUC of about 0.16 and an AP of 0.69, reflecting poor discriminative ability and limited predictive precision. This indicates that FGK is not suitable for use directly on raw images without preprocessing or feature extraction, possibly due to its high sensitivity to the structure of the input space.
These results are visually summarized in
Table 8,
Table 9 and
Table 10, which include confusion matrices, ROC curves, and precision-recall curves for each fuzzy classifier and membership function.