4. Experimental Results and Discussion
In this section, we evaluate the proposed ResNet-CBAM framework and compare it to alternative architectures. We perform both quantitative and qualitative evaluations of performance, stability and interpretability. These analyses cover traditional performance metrics such as accuracy, precision, recall, and F1-score, as well as more sophisticated metrics such as the Receiver Operating Characteristic (ROC) area under the curve (AUC), Cohen’s Kappa, Matthews Correlation Coefficient (MCC), and Jaccard Index.
Exploratory data visualization through confusion matrices and Grad-CAM heatmaps are also incorporated for insights on model performance and failures. The findings show that incorporating attention mechanisms helps with feature selectivity and boosts classification accuracy for several types of waste.
4.1. Experimental Setup and Software Configuration
We used the PyTorch 2.x framework running on a Python 3.10 environment for all experiments. The proposed model and baseline models were trained on Google Colab Premium with an NVIDIA A100 Tensor Core GPU (40 GB VRAM), which facilitated fast training and convergence.
The ResNet-CBAM model and other baseline models were developed using the torch, torchvision, and torch.utils.data libraries. Data augmentation and preprocessing were done using the library torchvision.transforms (resize, normalize, random geometric) OpenCV and Pillow were used for image processing, while Matplotlib and Seaborn were used to plot training loss, confusion matrix and Grad-CAM.
Models were trained with Adam optimizer with initial learning rate , batch size of 32, and for 25 epochs. To bring an apples-to-apples comparison, all models were trained from scratch with the same settings.
To enhance the consistency of the experiments, we repeated training and testing using different random initializations for each model and reported the average performance. This approach ensures replicability of the results.
4.2. Comparative Evaluation Using Performance Metrics
To benchmark the results obtained from the proposed ResNet-CBAM design, a comparative performance analysis was performed with four other baseline models namely, ResNet-101, EfficientNet-B0, MobileNet-V3 and LSTM. The models were evaluated using the standard metrics of classification namely Accuracy, Precision, Recall and F1-score which together give an overall picture of model reliability, sensitivity and performance across the waste types.
The results are shown in
Table 6. Our proposed ResNet-CBAM model performs best across all metrics, with an accuracy of 0.9309, precision of 0.9044, recall of 0.9262 and F1-score of 0.9122. This suggests a good balance between precision and recall, which implies the inclusion of distinctive features in the model effectively discriminates among different types of waste.
The ResNet-101 baseline model achieves competitive results (accuracy = 0.92, F1-score = 0.91), showcasing the power of deep residual networks for capturing features. Yet, this model is deprived of a dedicated attention mechanism, so it does not explicitly emphasize the informative parts, especially in challenging visually complex scenes.
EfficientNet-B0 and MobileNet-V3 perform moderately as a result of their efficiency-accuracy trade-off. These models are highly efficient but have relatively lower recall and F1-scores, suggesting that they are less sensitive to subtle variations when distinguishing between waste types.
The LSTM model consistently performs the worst in all any metrics, suggesting that sequence-based networks are not optimal for spatially heavy image classification.
In summary, the strong performance of the proposed ResNet-CBAM model may be attributed to the use of channel and spatial attention mechanisms for feature refinement, which are instrumental in improving feature discrimination ability and inter- and intra-class variance. These findings confirm that attention-based feature refinement strategies lead to performance gains over such traditional deep learning models in the context of MSW classification.
4.3. Training Dynamics and Convergence Behavior
We examined the convergence stability, generalization ability and convergence efficiency of the proposed ResNet-CBAM model over 25 training epochs to understand training dynamics.
Figure 3 illustrates the training and validation accuracy and loss curves, respectively.
As we can see (
Figure 3), the model exhibits fast convergence in the early stages of training. The training accuracy rises rapidly during the first five epochs, from around 50% to more than 85%, suggesting successful initial feature extraction. Similarly, the validation accuracy increases rapidly, remaining close to the training curve.
Once the training process reaches the 10th epoch, the training and validation accuracy curves start to plateau at around 92%-94%. The narrow separation between the two curves indicates a balance between model bias and variance, and no significant overfitting.
This is also reflected in the loss curves. Training and validation losses steadily drop with each epoch, initially showing significant improvement followed by levels off towards the end of training. There are some slight variations in the validation loss in the middle epochs, which are due to noisy training (due to mini-batch-based training and data augmentation). But these fluctuations do not result in a gap between the training and validation curves.
Crucially, the absence of a noticeable discrepancy between training and validation loss suggests robust generalization. The low-convergence point of the two curves indicates that the model successfully trains to learn discriminative features from the data without overfitting the training data.
In summary, the convergence behavior indicates the combined use of CBAM attention mechanisms enhances, rather than impairs, the optimization process. The observed performance is consistent between the training and validation data, indicating that the model is suitable for real-world challenges in complex MSW classification.
4.4. Graphical Analysis
To gain a full picture of classification performance, we considered several graphical and statistical measures, including Receiver Operating Characteristic (ROC) and Precision vs Recall (PR) curves, and agreement measures such as Cohen’s Kappa, Matthews Correlation Coefficient (MCC) and Jaccard Index. These varying metrics offer distinct perspectives on class separability, the effects of class imbalance and the general consistency of prediction.
4.4.1. Receiver Operating Characteristic (ROC) Analysis
The ROC curves for the proposed ResNet-CBAM model, shown in
Figure 4, provides class-level results. The ROC analysis measures the relationship between the True Positive Rate (TPR) and the False Positive Rate (FPR) at different classification thresholds, and the Area Under the Curve (AUC) provides an overall measure of the model’s discriminative performance.
Our model exhibits high AUC scores across all classes, suggesting good separability. The categories with close-to-optimal separability are food waste, leaf waste and metal cans, while paper waste and plastic bottles show relatively small overlap (less separability) in features, due to their visual similarities.
Precise classification across a range of thresholds is demonstrated by the high sensitivity and low false-positive rate for the majority of categories, as shown by the massing of curves at the top-left of the ROC space.
4.4.2. Precision-Recall (PR) Analysis
The class-wise Precision-Recall curves presented in
Figure 5 are useful in situations where data are imbalanced. In PR analysis, the precision (positive predictive value) and recall (sensitivity) are of particular interest.
The proposed classification model exhibits most AP values > 0.92 and approaching 1.0. Leaf waste and food waste exhibit near-perfect precision-versus-recall (PVR) relationships, whereas lower AP scores for wood waste and plastic bottles suggests somewhat harder classification tasks due to greater class overlap.
More significantly, the consistent high precision values at higher recall levels indicate robust classification with a focus on sensitivity. This is due to the attention mechanism’s ability to select discriminative features.
4.4.3. Agreement and Overlap Metrics
In
Figure 6, we report agreement and overlap metrics, such as Jaccard Index (JI), Matthews Correlation Coefficient (MCC) and Cohen’s Kappa.
Our model has a macro Jaccard Index of 0.8911, showing a large overlap of predicted and ground truth labels. The MCC of 0.9146 indicates a high correlation between predicted and true labels, using all components of the confusion matrix. Furthermore, the Cohen’s Kappa of 0.9142 suggests a high level of agreement also.
These metrics indicate the model’s performances are balanced between minority and majority classes, demonstrating that the model is not affected by class imbalance.
In summary, the visual assessment shows that the proposed ResNet-CBAM architecture exhibits effective discriminative ability, precision/recall balance, and inter-class agreement, making it suitable for multi-class MSW classification.
4.5. Error Analysis
A class-wise confusion matrix
Figure 7 was used to perform a more in-depth error analysis of the proposed ResNet-CBAM model to better understand its classification performance. The confusion matrix offers a detailed breakdown of prediction results by showing the distribution of actual and predicted class labels, allowing us to detect potential misclassification patterns.
As can be seen, a large proportion of samples are predicted correctly in the diagonal, showcasing good overall predictive accuracy. Some classes, such as biodegradable leaf waste (389 correct predictions), biodegradable wood waste (57), and non-biodegradable metal cans (68) are nearly perfectly classified as shown by a few off-diagonal entries. This indicates that these classes have visual characteristics that are successfully modeled.
However, there are some misclassifications, which occur primarily between visually similar categories. For example, errors between biodegradable food waste and paper waste, wood waste, could be due to textural, color and organic matter similarities under different environmental factors. Likewise, classification confusion is seen between plastic bottles/paper bags and plastic bags/paper bags, where plastic shape transformations, transparency and brightness variations may limit the ability to discriminate between these classes.
A striking observation is that plastic bottles are misclassified as paper waste (16 times) suggesting that reflective or crushed plastic waste may be identified as paper. Further, there is some confusion between e-waste and wood or paper, which may imply a role of complex texture or occlusions on feature extraction.
Although the misclassifications may seem spread out, the overall spread is restricted and does not affect global performance scores. Furthermore, the types of errors observed are typical of real world challenges when classifying waste, including class overlap and intra-class variability.
These results emphasise that the proposed model with a enhanced architecture for better feature discrimination in ALI, but some visually complex examples remain challenging. Potential enhancements may include the use of multi-feature representations, higher resolution images or disease-specific data augmentation techniques to minimise class overlap.
In conclusion, confusion matrix analysis shows that the model has good per-class performance, with clear interpretability of remaining challenges in classification.
4.6. Quantitative Comparison Between Baseline and Proposed Framework
In order to evaluate the performance of the proposed ResNet-CBAM model, a comparative analysis was performed against four baseline models (ResNet-101, EfficientNet-B0, MobileNet-V3 and LSTM). The assessment includes traditional classification metrics (Accuracy, Precision, Recall and F1-score) and additional statistical measures (Jaccard index, Cohen’s Kappa coefficient and Matthews Correlation Coefficient (MCC)) to examine not only the accuracy of the models, but also robustness and generalisability of the models.
4.6.1. Comparison Using Standard Performance Metrics
The results of the standard model performance metrics are shown in
Figure 8. The ResNet-CBAM model we propose has the best performance across all the metrics, achieving Accuracy = 0.93, Precision = 0.91, Recall = 0.93, and F1-score = 0.92. The equal precision and recall demonstrate that the model is both highly accurate in predicting positives and highly sensitive.
The baseline model ResNet-101 also performs well, with reasonably high results. This suggests the power of deep residual learning for multi-scale feature learning. But the lack of attention mechanisms hampers its ability to concentrate on relevant areas, especially in complex visual contexts.
Modest results from EfficientNet-B0 and MobileNet-V3 reflect efficiency-performance trade-offs. Though these models are suitable for lightweight applications, their reduced recall and F1-scores imply less responsiveness to fine-scale details that discriminate between different types of waste.
The lowest metrics were found with the LSTM based model, which suggests that recurrent models may not be well-adapted to tasks with intricate spatial patterns in the image.
4.6.2. Comparison Using Advanced Evaluation Metrics
The comparison of advanced statistical metrics, Jaccard Index, Cohen’s Kappa and MCC is shown in
Figure 9. These measures offer further information on the classification performance, agreement between classes, and overall predictor consistency.
The proposed model outperforms all others in terms of these three metrics, achieving Jaccard Index = 0.84, Cohen’s Kappa = 0.92, and MCC = 0.91. The high values show high agreement between actual and predicted labels, and consistency across multiple classes.
ResNet-101, while exhibiting slightly weaker but stable performance, and EfficientNet-B0 and MobileNet-V3 have moderate agreement. The LSTM model displays the lowest performance once again, confirming that this approach is not suitable for image classification.
The high Cohen’s Kappa and MCC scores of the proposed model indicate that the prediction is making use of the information learned and not blinded by class imbalance or randomness. This is likely due to channel and spatial attention blocks, which help the model prioritise valuable features while filtering out irrelevant details.
Taken together, the findings suggest that the proposed ResNet-CBAM model offers better classification performance and stability than traditional models, especially under the conditions of class imbalance and high visual similarity.
4.7. Ablation Study
The effect of the components of the proposed approach was evaluated through an ablation study. It assesses the benefits of attention mechanisms on the classification task by incrementally adding channel and spatial attention to a ResNet-50 backbone. Four models were evaluated:
ResNet-50 (Baseline): Original residual network without attention.
ResNet + Channel Attention: Adds channel attention to highlight relevant channels.
ResNet + Spatial Attention: Employs spatial attention to focus on informative locations in the image.
ResNet-CBAM (Ours): Utilises channel and spatial attention together in a single module.
We start with a ResNet-50 model as a baseline network which shows good performance (see
Table 7), and residual learning is effective in feature extraction. Adding channel attention shows significant improvement in all metrics, suggesting improved feature channel selection.
Likewise, spatial attention also leads to improved performance, as it allows the model to prioritise attention on informative regions while ignoring background noise. This is especially advantageous in waste classification where context might be an issue.
The use of the CBAM attention mechanism provides the strongest gains, leading to 2.6% higher accuracy compared to the baseline. This improvement indicates that the attention mechanisms used for channel and spatial features are complementary, and collectively enhance feature representation and feature localization.
In summary, the ablation study shows that attention mechanisms play a positive role in improving model performance. The use of CBAM boosts discriminative feature learning and increases robustness particularly in cases where classes are similar and there exists visual variance.
4.8. Explainability and Feature Visualization Using Grad-CAM
To enhance the explainability of the ResNet-CBAM model, Gradient-weighted Class Activation Mapping (Grad-CAM) is used to show the areas of the model’s attention. Grad-CAM produces class-specific saliency maps that highlight the most important regions in the image for a given prediction, by calculating gradients flowing into the final convolutional layer before the global average pooling operation.
Figure 10 shows Grad-CAM results for two examples of biodegradable paper waste and biodegradable leaf waste. In the figures, the input image is displayed along with its corresponding heat map, with warmer regions representing areas that are more important for the prediction.
In the case of biodegradable paper waste, the heatmap mostly resembles text, folded parts, and boundaries of paper. These regions represent texture boundaries and patterns that differentiate paper from other visually similar but non-paper objects (e.g. plastic). But some blurriness is also seen in the remaining parts, which might imply that global context also partially contributes to the decision.
For the biodegradable leaf waste, the model highlights veins, irregularities and discolourations. These characteristics are potentially related to biological features of organic waste. The activation is mainly restricted to the object mask, which suggests a strong spatial localization.
In summary, the visualizations indicate that the model focuses toward capturing relevant and class-specific features, rather than solely relying on background information. This phenomenon can be explained by the combined use of channel and spatial attention mechanisms within the CBAM module, which help the model to focus on discriminative regions of an object, suppressing the impact of background appearances.
Although the Grad-CAM visualizations offer a qualitative understanding, it’s important to keep in mind that these visualizations are only approximate and may not reflect the model’s decision-making entirely. However, they provide additional evidence that our architecture learns useful and meaningful representations for the task of waste classification.
These results demonstrate the interpretability of the proposed approach, which is a key to the success of future applications in real-life AI-driven waste classification systems.
4.9. Comparative Analysis and Discussion
The comparative analysis in
Table 8 showcases recent studies in waste classification, showcasing the variation in datasets, model approaches and performance. It’s critical to note that while the studies are comparable, the differences in dataset size, number of classes, class distribution and evaluation metrics limit any direct comparison.
Weaker models based on traditional machine learning algorithms, as used by
Alsabt et al. (
2024), provide reasonably high accuracy (85%) but are limited in their ability to learn complicated spatial features in image data. Combined approaches incorporating deep learning and traditional classifiers, such as ResNet50 and SVM
Qiao (
2024), enhance feature learning but overlook feature selection in an attention-aware manner.
Deep convolutional networks (DenseNet169,
Zhang et al. (
2021)) and lightweight CNN variants
Nnamoko et al. (
2022) showcase the power of hierarchical feature representation learning but may struggle with intra-class variability and inter-class similarity issues. Likewise, practical systems like
Yi and Kim (
2024), designed with a focus on application, may have lower accuracy.
The ResNet-CBAM model proposed in this work reaches an accuracy of 93.09% on the Kaggle Waste Segregation dataset. Although this is a higher accuracy than other chosen studies, two studies used different set-ups from this one. However, this can be explained by the use of attention module, which enables the model to place emphasis on relevant features while ignoring the irrelevant background detail.
In conclusion, this comparative study sheds light on how attention-based frameworks can be used to enhance waste classification tasks using deep learning. But future research should incorporate cross-dataset validation and benchmarking to allow for more reliable and fair comparisons between different models.