1. Introduction
Colorectal cancer (also known as colon or rectal cancer) [
1] is one of the most common causes of death worldwide, and detecting gastrointestinal (GI) polyps plays an important role in its prevention [
2,
3]. With the rapid advancement of Artificial Intelligence (AI) and neural networks, the application of these technologies in the medical field is becoming increasingly promising.
Certain GI polyps, particularly adenomatous and serrated types, carry a risk of progressing to cancer if not identified and removed in time. Regular screening and timely intervention are essential to reduce the likelihood of malignant transformation [
4].
Therefore, integrating AI into diagnostic processes can provide significant support for physicians, enabling them to identify errors more quickly and accurately. The goal is not to replace human expertise but to complement it by providing specialists with modern tools that enhance their work and reduce human error rates [
5].
Compared to classical Machine Learning (ML) methods [
6,
7,
8], which perform well with structured and diverse datasets, Deep Learning (DL) architectures [
9], like artificial neural networks, are themselves biomimetic in nature, modeled after the human brain’s layered processing, making them powerful for interpreting biological data in domains such as medical image diagnostics or facial recognition [
10].
Artificial Neural Networks (ANNs) [
11] are an important subset of ML algorithms, inspired by the structure and functioning of the human brain [
12]. They consist of layers of artificial neurons that process data through weighted connections, allowing the system to learn and generalize from complex data.
Convolutional Neural Networks (CNNs) [
13] are a specialized category of artificial neural networks, primarily used in image processing and automatic visual pattern recognition. These networks are inspired by how the human brain processes visual information, enabling them to learn spatial and hierarchical features from raw images [
14,
15].
A CNN structure includes several types of layers: convolutional layers, activation layers (e.g., ReLU), pooling layers (e.g., max-pooling), followed by one or more fully connected layers. Convolutional layers apply filters (kernels) over images, automatically extracting important features such as edges, complex shapes, or text regions [
16].
CNNs have rapidly evolved due to their ability to automatically extract visual features without human intervention. Initially, models like
Alex Convolutional Neural Network (AlexNet) [
15],
Visual Geometry Group Network (VGGNet) [
16], and
Inception Network by Google (GoogleNet) [
17] were used for simple binary classification tasks.
Later, deeper networks such as
Residual Network (ResNet) [
18] and
Densely Connected Convolutional Network (DenseNet) [
19] significantly improved the classification [
20], detection, and segmentation of brain lesions, including gliomas, meningiomas and tumors, in multimodal Magnetic Resonance Imaging (MRI) scans such as T1-weighted, T2-weighted, Fluid Attenuated Inversion Recovery (FLAIR), and contrast-improved sequences.
Recent research has focused on hybrid neural architectures such as
Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM) models, which combine spatial feature extraction with temporal pattern recognition [
13,
21,
22]. These models are particularly effective in medical imaging tasks where both spatial and sequential data characteristics are present. Transfer learning techniques have also gained traction, leveraging pre-trained models like VGGNet, ResNet, and Inception trained on large-scale datasets such as ImageNet (over 14 million images) [
23,
24] and adapting them to smaller medical datasets, often containing fewer than 10.000 annotated images. This approach obtained improvements in classification accuracy, sensitivity, and specificity, especially in tasks like brain tumor grading, breast cancer subtype identification, and lesion segmentation in MRI and histopathology images [
25].
In medical imaging, transparency in AI decisions is crucial. Explainable Artificial Intelligence (XAI) techniques such as
Gradient-weighted Class Activation Mapping (Grad-CAM), Local Interpretable Model-Agnostic Explanations (LIME), and
SHapley Additive exPlanations (SHAP) enable visualization of image regions that significantly influence model predictions, helping radiologists interpret how decisions are made [
26]. The study [
11] contributes to the evolving field of AI-driven medical image analysis by systematically comparing multiple CNN architectures, both classical and advanced, including
Capsule Network (CapsNet), FastAI-based models. These architectures are applied to a specialized endoscopic image dataset. Studies highlight the applicability of Grad-CAM in localizing suspicious lesions in mammograms or Computed Tomography (CT) scans [
20], while LIME and SHAP have been successfully used in predicting COVID-19 severity from lung radiographs [
7].
Numerous studies have demonstrated the efficiency of CNNs in medical image analysis, especially in detecting and classifying anomalies in endoscopic procedures [
6]. For instance, a recent study [
27] introduced the Ksavir dataset and evaluated several automatic classification methods for GI images, providing an essential starting point for performing researches.
This research presents an integrative deep learning framework of advanced CNN architectures for the automatic classification of gastrointestinal lesions from endoscopic images. The proposed methodology integrates biomimetic modeling with Explainable Artificial Intelligence (XAI) techniques, including Grad-CAM, LIME, and SHAP.
The architectures evaluated in this work include classical models such as
AlexNet and
VGGNet, as well as more advanced frameworks like
GoogLeNet (Inception Network),
ResNet (Residual Network),
DenseNet (Densely Connected Convolutional Network), and
CapsNet (Capsule Network). These models were benchmarked using the Ksavir dataset [
27] and
Transfer learning is employed to adapt large-scale pre-trained models. The classes used represent a mix of anatomical landmarks, pathological findings, and procedural outcomes and are defined as follows:
Dyed-lifted-polyps: Images showing polyps that have been stained and elevated using submucosal injection, aiding in visual contrast and resection planning.
Dyed-resection-margins: Post-polypectomy images highlighting the margins of resected areas, stained to assess completeness of removal.
Esophagitis: Inflammatory lesions of the esophageal mucosa, often appearing as erythematous streaks or erosions near the Z-line.
Polyps: Unstained mucosal protrusions, typically benign growths that may serve as precursors to colorectal cancer.
Ulcerative-colitis: Chronic inflammatory changes in the colon, characterized by mucosal ulceration, granularity, and vascular pattern loss.
Normal-cecum: Anatomical landmark at the beginning of the large intestine, often used to confirm complete colonoscopy.
Normal-pylorus: The muscular opening between the stomach and duodenum, appearing as a round, symmetric structure in healthy individuals.
Normal-z-line: The gastroesophageal junction, where the squamous epithelium of the esophagus transitions to the columnar epithelium of the stomach.
The novelty of this study lies in its biomimetic approach to model selection, wherein biologically inspired design principles, such as hierarchical layer organization and local feature abstraction, are used to evaluate and compare advanced CNN architectures. The goal is to identify the most computationally model for the automatic classification of gastrointestinal lesions in endoscopic images, as applied to the Ksavir dataset.
The main contributions of this study to the state-of-the-art in medical image classification are as follows:
Benchmarking of CNN Architectures: A set of CNN models, including classical architectures like
AlexNet and
VGGNet, and advanced designs such as
GoogLeNet (Inception),
ResNet,
DenseNet, and
CapsNet were tested on the Ksavir dataset [
27], composed of real-world endoscopic images, to assess their effectiveness in gastrointestinal lesion categorization.
Optimizations: DenseNet121 and ResNet50 were fine-tuned using Transfer learning and dynamic class weighting, while CapsNet was improved with attention mechanisms to improve feature localization and reduced overfitting, especially in classes with limited samples. Thus, choosing ResNet50 represents a contribution of this study, guided by both optimization parameter tuning and empirical performance metrics, including validation accuracy and loss behavior across multiple folds.
Biomimetic Model Selection: The study introduces a biomimetic framework for selecting CNN architectures, inspired by the hierarchical and layered processing of the human visual cortex [
14]. This approach guided the prioritisation of models that emulate biological feature abstraction, such as residual and capsule-based networks.
Explainability Integration: To improve performance on sparse and imbalanced medical datasets, this study combines Explainable Artificial Intelligence (XAI) techniques, namely Grad-CAM, LIME, and SHAP with Transfer learning from large-scale datasets such as ImageNet. The novelty lies in the task-specific adaptation of XAI methods to guide model and error analysis, enabling iterative feedback during training.
Benchmarking: The comparative analysis between ResNet50, DenseNet121, and MobileNetV2 offered the best trade-off between accuracy, inference speed, and generalisation. In contrast, deeper models such as NASNetLarge and EfficientNetB8 showed signs of overfitting and slower inference.
The structure of the work is divided into five chapters.
Section 1 provides a general introduction to the topic, explains why the subject was chosen, and presents the current context of AI in the medical field. It also describes general objectives, methodology, and dataset used.
Section 2 covers theoretical foundations of machine learning and neural networks, focusing on CNNs and Transfer learning, with details on the CNN architectures used, describes their selection process, and mentions models excluded after preliminary testing. It also explains the research methodology, dataset, experimental stages, infrastructure, and applied evaluation methods.
Section 3 presents accuracy and loss graphs for each tested CNN architecture. These graphs illustrate the model’s behavior during training and validation.
Section 4 highlights observed limitations and includes case studies and real-world applications of CNNs in medical practice.
Section 5 concludes the paper with a summary of findings, practical significance of the research, and possible directions for future studies.
2. Materials and Methods
2.1. Dataset and Methodology
For this research, a public dataset called
Ksavir [
27] was used, a Multi-Class Image-Dataset for Computer Aided GI Disease Detection, containing 4.000 GI images divided into eight polyp classes. It was created for research purposes to develop automatic disease detection algorithms for gastrointestinal conditions. The dataset was published by Simula Research Laboratory in collaboration with Vestre Viken Health Trust in Norway, a network of four hospitals that provides medical care to approximately 470,000 people [
28].
The dataset includes manually annotated and verified images from real diagnoses by endoscopy specialists. The images originate from actual diagnostic procedures and include a variety of anatomical landmarks, pathological findings, or endoscopic procedures observed in upper digestive tract examinations.
This dataset consists of two files: one file containing eight classes of polyp types:
Dyed-lifted-polyps,
Dyed-resection-margins,
Esophagitis,
Polyps,
Ulcerative-colitis,
Normal-cecum,
Normal-pylorus,
Normal-z-line. Each class with
500 images covering key categories such as polyps, ulcers, esophagus conditions, hemorrhages, and others, with each class containing 500 images of variable dimensions and different lighting conditions. In addition to raw images, the dataset includes a separate file containing extracted features for each image [
29].
Dataset Collection and Annotation. The images in the Ksavir dataset were collected under real clinical conditions at partnering gastroenterology centers. Each image is accompanied by expert-level annotations, performed in collaboration with certified endoscopy specialists and validated by the research team involved in this study [
30]. All images were uniformly scaled during this study to
pixels (or
for Inception-based models), enabling standardised input dimensions for CNN-based analysis and preserving spatial integrity across architectures.
In
Figure 1, the graphical abstract presents a biologically inspired pipeline for deep learning-based classification of gastrointestinal lesions. The workflow begins with raw endoscopic image acquisition (Ksavir dataset), followed by a preprocessing stage including augmentation and normalization. Subsequently, CNN models such as ResNet50, DenseNet121, MobileNetV2, and others are trained and fine-tuned to perform multi-class classification of gastrointestinal pathologies. The predicted class labels—e.g., "dyed-lifted-polyps", "esophagitis", "ulcerative-colitis"—are evaluated using a confusion matrix and a suite of performance metrics.
2.2. Pre-Trained Models in endoscopic imaging
Transfer learning is an essential technique in ML, allowing the reuse of knowledge acquired by a model trained on a large dataset (such as ImageNet) [
23,
24] to solve a new task, typically with a smaller dataset. In medical image classification, Transfer learning is particularly important due to the difficulty of collecting large and balanced datasets. These challenges are further compounded by small sample sizes and class imbalance, which can significantly affect the reliability of classification tasks.
The following pre-trained neural network architectures were examined during this study:
ResNet50 [
18],
DenseNet121 [
31],
MobileNetV2 [
32],
InceptionV3 [
33],
Xception [
34],
VGG16, VGG19 [
35],
SqueezeNet [
36], and FastAI’s
xResNet18 [
37].
Residual Network (ResNet) [
18] is a deep learning architecture that solves the problem of vanishing gradients through the use of
residual connections (or shortcut links). These connections allow data and gradients to flow more easily across layers, enabling effective training of very deep networks.
ResNet50 [
18] was selected for this study after preliminary tuning and benchmarking trials. It is a specific implementation of the ResNet family, composed of
50 layers, including convolutional, pooling, and fully connected layers. The number
50 refers to the depth of the network. Deeper variants such as
ResNet18,
ResNet101 [
38] and
ResNet152 indicate the total layer count and demonstrate stronger feature representation but suffer from increased training time, susceptibility to overfitting, and resource constraints. Lighter models like
ResNet18 lacked sufficient feature depth for subtle texture discrimination. This architecture allows the model to learn complex features without losing performance on validation sets.
DenseNet121 [
31] connects each layer to all previous layers, ensuring efficient reuse of extracted features and better information propagation. The number
121 refers to the total number of layers, including convolutional, pooling, and fully connected components. This depth supports rich hierarchical representations [
14] while maintaining parameter efficiency due to its dense connectivity. Thus, DenseNet121 was chosen as an optimal architecture, capable of capturing subtle visual differences between gastrointestinal classes with balanced generalization and training performance.
MobileNetV2 [
32] is optimised for computational efficiency, using depthwise separable convolutions and inverted residual blocks. Unlike DenseNet, MobileNet does not encode its depth in the name; instead,
V2 simply refers to the
second major version of the MobileNet architecture [
39]. It introduces structural improvements over the original MobileNet, achieving better accuracy with lower latency. MobileNetV2 provides a strong balance between speed, low memory consumption, and accuracy, making it suitable for deployment on low-resource hardware such as mobile and embedded devices.
2.3. Pre-trained Architectures and Transfer Learning
All the architectures used in this study were pre-trained on the
ImageNet dataset [
23,
24], which enabled faster and more efficient fine-tuning on the Ksavir dataset. This approach leverages learned representations from large-scale image classification tasks and is particularly valuable when working with smaller or imbalanced medical datasets.
The following CNN architectures were evaluated: InceptionV3 - GoogLeNet Inception version 3, VGG16/VGG19 - Visual Geometry Group networks with 16 and 19 layers, Xception, Extreme Inception, SqueezeNet - Lightweight CNN with fire modules, FastAI xResNet18 - Extended Residual Network with 18 layers.
Among these, the most efficient and stable models—based on validation accuracy and reduced overfitting were ResNet50, DenseNet121, and MobileNetV2.
2.4. Inception Architecture
The suffix V3 in InceptionV3 refers to the third major version of the Inception architecture, originally developed by Google. Earlier versions include:
InceptionV1 (GoogLeNet): Introduced parallel convolutions of varying sizes (1×1, 3×3, 5×5) and auxiliary classifiers to improve gradient flow.
InceptionV2: Replaced expensive 5×5 convolutions with stacked 3×3 layers, introduced batch normalization, and improved computational efficiency.
InceptionV3: Built upon V2 with additional optimisations such as
factorised 7×7 convolutions,
label smoothing, and
RMSprop optimisation. It also included deeper modules and more efficient grid size reduction strategies, resulting in improved accuracy and reduced training cost [
33].
The selection of deep learning architectures in this study was initially guided by bibliographic research [
15,
19,
40] and confirmed generalization and stability during training in medical image classification tasks. The models ResNet50 with highest validation accuracy of 90%, followed by DenseNet121 (87.5%) and MobileNetV2 (86.5%) were evaluated through extensive experimentation on the Ksavir dataset.
To ensure comparability, all models were trained and validated using the same dataset split, image preprocessing pipeline, and evaluation metrics, including accuracy, precision, recall, and F1-score.
Additionally, the selected models demonstrated excellent compatibility with the development environment used (TensorFlow/Keras), enabling the easy implementation of components such as augmentation and metric monitoring. These architectures benefit from pre-trained versions on ImageNet [
23,
24], and the use of Transfer learning significantly reduced training time.
2.5. Preprocessing Techniques
During the preprocessing stage, images were normalized, converted into values between 0 and 1, and subjected to augmentation operations, including translations, zooming, and horizontal and vertical rotations. These transformations are not merely visual tricks but help models become more flexible and capable of recognizing objects even when they appear in different positions or angles compared to the training set [
28].
The data were pre-processed through normalization, augmentation, and feature extraction to improve model generalization and robustness. For training, the Ksavir dataset was randomly split into three subsets: 70% for training, 15% for validation, and 15% for testing, ensuring that each class was equally represented across all splits. Multiple CNN models—including ResNet50, DenseNet121, MobileNetV2, InceptionV3, VGG16/VGG19, FastAI xResNet18, and Xception were trained using TensorFlow, Keras, and FastAI frameworks, leveraging state-of-the-art optimization techniques.
The performance of each model was compared, and the best-performing models were tested again to verify their final score.
Essentially, augmentation allows the model to see multiple variations of the same image, contributing to better learning and reducing the risk of overfitting on the dataset.
In addition to the architectures selected based on a bibliographic review of state-of-the-art CNN models for medical image classification, approximately ten other architectures were evaluated during preliminary experimentation conducted as part of this research. These experimental trials allowed for comparative benchmarking on the Ksavir dataset, enabling the identification of models with optimal performance in terms of accuracy, generalization, and computational efficiency. These included
NASNetLarge [
41,
42], EfficientNetB0-B8 [
43], ConvNeXt [
44], VGG11 [
35], InceptionResNetV2 [
45], AlexNet [
15], ResNet101 [
38], DenseNet201 [
46], Xception [
34] with extensive augmentation, and several
custom CNNs [
25]. These models were ultimately excluded due to severe overfitting, significant discrepancies between training and validation accuracy.
Certain architectures, such as NASNetLarge and EfficientNetB7, performed well on the training set but significantly worse on the validation set, indicating poor generalization to new data (overfitting). Other networks, such as ConvNeXt or DenseNet20, were too complex and required excessive time and memory for training.
3. Experimental Results and Performance Analysis
3.1. Experimental Environment
All experiments were conducted on a Lenovo IdeaPad Gaming 3 15IAH7 laptop running Windows 11 (64-bit). The system configuration included an Intel Core i5-12500H CPU (12 cores, up to 4.50GHz), 16GB DDR4 RAM, and a dedicated NVIDIA GeForce RTX 3050 RGB GPU with 4GB GDDR6 VRAM. The GPU architecture is based on Ampere, supporting CUDA cores and mixed-precision training via Tensor Cores. Deep learning libraries such as TensorFlow 2.13, Keras, and DNN were used with GPU acceleration enabled for all CNN models.
3.2. Model Training-Validation-Testing
Model Training Duration. Training time ranged from 2 to 4 hours per model, depending on architectural depth and internal complexity. Lightweight networks such as MobileNetV2 and SqueezeNet converged within 2 hours, while deeper architectures like DenseNet121, ResNet50, and Xception required up to 4 hours. These values reflect experiments conducted with optimised batch sizes, early stopping, and adaptive learning rate scheduling.
The development of automatic image classification methods followed experimental steps, where the dataset was split into training (70%), validation (15%), and testing (15%) sets. This distribution provides a good balance between model learning and performance evaluation.
In the first stage, the CNN model learns to identify characteristic patterns for each class in the dataset. The training set contains labeled images, and the learning algorithm gradually adjusts the model’s internal weights to reduce classification errors. Training is performed over multiple epochs (between 20 and 50 epochs), and performance is monitored using loss metrics and accuracy. After each training epoch, the model’s performance is evaluated on a validation set. This step checks whether the model generalizes well to new data and prevents overfitting.
As suggested in prior literature [
47,
48], adjustments to hyperparameters such as learning rate and batch size can improve model generalization. In the present study, these adjustments were performed iteratively based on the validation score and training dynamics, ensuring optimal convergence during experimentation.
Further displays the evaluation metrics accuracy and loss plots for each CNN architecture tested. These graphs show the model behavior during training and validation. For the ResNet50 model the performance expressed in accuracy is 88% for the training data and 90% on the test data. The graph is plotted in
Figure 2a.
The plot in
Figure 2b shows the categorical cross-entropy loss values computed on the training set across successive epochs for the ResNet50 model. The downward trend reflects effective learning and progressive error minimization during model training.
For the DenseNet121 model, the accuracy obtained during training was 84% and on the validation set it reached 87%. This model provided a good balance between performance and computational efficiency. The evolution of the accuracy during training is presented in
Figure 2c.
The MobileNetV2 model achieved an accuracy of 86% on training data and 87% on validation data. Due to its small size and low resource consumption, it is suitable for deployments on devices with limited computing power.
Figure 2d illustrates the accuracy, and
Figure 2e shows the corresponding losses.
The FastAI model achieved an accuracy of 64% in training and 70% on validation data. Although it has an automated hyperparameter adjustment system, it did not provide competitive results compared to the other models. Its accuracy plot is shown in
Figure 2f.
InceptionV3 provided an accuracy of 83% on the training set and 85% on the validation set. This model was distinguished by its ability to learn features at multiple scales, but the training time was longer.
Figure 3a and
Figure 3b show the accuracy and loss plots.
The accuracy obtained by VGG16 and VGG19 models was 60% on training and 68% on validation. These architectures have a simple structure, but the large number of parameters led to a pronounced overfitting tendency. Their plot is represented in
Figure 3c.
SqueezeNet had an accuracy of 57% on training and 63% on validation. Although extremely compact and fast, the model failed to capture essential features of medical images sufficiently well.
Figure 3d shows the accuracy plot.
The Xception model achieved an accuracy of 82% on training and 85% on validation. Due to its architecture based on convolution separation, it provided a good balance between speed and performance.
Figure 3e and
Figure 3f show the evolution of accuracy and loss.
In addition to the final accuracy, other important aspects of each network were also tracked, such as the stability of the curves over time, the difference between the maximum and minimum values during training, and the consistency between epochs. The more stable models, considered ResNet50 and DenseNet121, had linear loss curves without major fluctuations, indicating constant learning. In contrast, networks such as VGG16 or SqueezeNet had instabilities and a greater tendency towards overfitting.
3.3. Evaluation Methodology: Accuracy, Confusion Matrix and Inference Time
Model performance evaluation is conducted using multiple classification metrics to better understand the behaviour of the CNNs, such as ResNet50, DenseNet121, and MobileNetV2 on unseen data from the test set.
After training and validation, the final step is model testing. The model is evaluated on a test set composed of unseen images to simulate its performance in real scenarios. The following evaluation metrics were calculated during this stage: Accuracy, Confusion Matrix, Precision, Sensitivity (Recall), and F1-score for each class.
These metrics were chosen to identify different aspects of model performance for multi-class medical image classification, where class imbalance and clinical risk make single-metric evaluation insufficient.
Accuracy provides a general overview of correct classifications but can be misleading when classes are imbalanced. For example, if one polyp type is overrepresented, a high accuracy may mask poor performance on rare classes.
Confusion Matrix visualises misclassifications across all classes. It reveals patterns such as false positives or confusion between visually similar polyp types, guiding further refinement or reannotation.
Precision reflects the proportion of true positives among predicted positives for each class. In clinical settings, high precision is crucial to minimise false diagnoses.
Sensitivity (Recall) indicates the proportion of true positives detected among all actual instances of a class. This is especially important in medicine, where failing to detect a pathology (false negative) can be more dangerous than over-detection.
F1-score balances precision and recall. It is particularly valuable when the dataset is unbalanced or when both false positives and false negatives carry clinical risk.
While all metrics contribute to a comprehensive evaluation, Sensitivity and F1-score are often considered more critical in medical imaging tasks. This is because missing a lesion or polyp (false negative) may have direct implications for diagnosis and treatment outcomes.
Model Comparison Inference. Once trained, most models demonstrated rapid inference performance. Specifically, MobileNetV2, DenseNet121, and ResNet50 consistently processed individual images in less than 1 second, even on consumer-grade hardware without TPU or multi-GPU setups. Heavier models such as EfficientNetB7, NASNetLarge, and InceptionResNetV2 exceeded 1 second per image, making them unsuitable for real-time deployment under constrained resources.
For each model, the time required to process a single image was measured. On average, inference is completed in less than one second, demonstrating the accessibility of these models even on limited hardware. It can be observed that the model has a high classification rate for some classes but frequent misclassifications between others. The inference time averaged under 0.5 seconds per image, even on consumer-grade hardware.
The
ResNet50 model obtained the highest accuracy on the validation set, followed by
MobileNetV2 and
DenseNet121, according to the findings in
Table 1. The models’ performance was compared and evaluated using confusion matrix and accuracy analyses. Bolded values indicate highest performance per column.
Although the training time was significant (between 2 and 4 hours per model), the results were stable, and the differences between training and validation accuracy were small in the case of well-optimised models. This indicates good generalisation capability.
Confusion Matrix Interpretation. To evaluate the performance of each model, the
confusion matrix was used. This matrix provides insights into the best and worst classified classes [
49,
50].
The ratio of correct predictions to the total number of tested examples provides an initial overview of model efficiency, making it easy to interpret. However, when dealing with
imbalanced classes, accuracy can be misleading; therefore, it must be complemented by other metrics [
47].
Figure 4.
Confusion matrix.
Figure 4.
Confusion matrix.
The right-side values are most likely the color bars; they define the range of values used in the heatmap shading. Each cell in the confusion matrix represents the count of predictions (e.g., the number of samples from class A that were incorrectly predicted as class B). The heatmap uses colors to visualize those counts. The values on the right -20,0,...,70 correspond to color intensity, not the actual data. The negative value (-20) likely suggests either an artifact in normalization, scaling, or it’s a misconfigured threshold, since confusion matrices usually don’t include negative values. As can be observed, the model that performed the best in this analysis, ResNet50, demonstrates confusion between visually similar classes while reaching high classification rates for some of them. Overlapping textures or color profiles, especially between polyps and inflamed mucosa, may be the cause of these misclassifications. On the other hand, visually recognized and well-represented classifications like normal or ulcerative-colitis consistently shown good precision and recall ratings.
4. Discussion
4.1. Improvements
To improve CNN model performance on gastrointestinal polyp classification from endoscopic images, the study implemented several targeted:
Customised Data Augmentation: While augmentation is a standard technique, this study applies a task-specific augmentation pipeline, including rotation, scaling, flipping, and contrast adjustment—optimised for gastrointestinal polyp morphology. This approach reduces overfitting.
Dynamic Class: A novel weighting scheme was implemented based on real-time class distribution during training, rather than static frequency-based weights. This improves learning stability across imbalanced classes.
Explainability: Grad-CAM was integrated as a feedback mechanism during model refinement. This dual use helped identify misclassified regions and guided architectural adjustments [
51].
While VGG16, SqueezeNet, and NASNetLarge showed promising learning rates early in training, extensive augmentation and architectural regularisation were critical to overcoming their overfitting behavior.
4.2. Error Analysis
Errors during CNN training were classified into two major types:
Confusions between visually similar classes: The most frequent misclassifications involved visually similar classes, such as polyps and inflamed mucosa, which shared overlapping color gradients, mucosal textures, and ill-defined boundaries.
Class Imbalance Sensitivity: These include inter-class confusion (e.g., misidentifying hyperplastic polyps as adenomatous ones) and incorrect localisation or attention to irrelevant image regions. Precision and recall were lower for classes with fewer representative samples, such as uncommon polyp subtypes. This results from bias caused during training.
These classification challenges were compounded by limitations in dataset diversity, inconsistent lighting, and the absence of pixel-level annotations. Solutions may include using advanced sampling strategies, synthetic data generation, or incorporating spatial attention mechanisms to focus learning on relevant image regions.
4.3. Limitations
Despite rigorous experimentation, the study encountered several constraints:
Dataset Size: The Ksavir dataset contains a relatively limited number of images for certain polyp subtypes, which can impair generalisation and lead to classifier bias.
Overfitting in Complex Models: Architectures like EfficientNetB8, InceptionResNetV2 and NASNetLarge demonstrated high variance between training and validation metrics, indicating overfitting. These models were excluded from the final evaluation.
Domain Limitation: The trained models performed well on Ksavir data, but may experience performance degradation when applied to endoscopic images from different institutions, due to lighting conditions, device variability, and annotation inconsistency.
Hardware Constraints: Training and tuning were conducted on an RTX 3050 GPU (4GB), which limited the ability to perform extensive hyperparameter tuning or ensemble testing across large architectures.
4.4. Proposals for future work (integration with segmentation, multi-label classification).
This work focused exclusively on image classification, but the following directions for future development could be considered:
- Integration of a segmentation component: where the model not only classifies but also highlights the affected area in the image. This could be achieved by U-Net or Mask R-CNN type models, which would add a significant plus to the system.
- Multi-label classification: multiple features or lesion types may appear in some images, so future models should be trained to recognize multiple classes simultaneously in a single image.
- Creation of a graphical user interface (GUI): allowing the clinician to load an image and get instant prediction, with the option to visualize the probability of each class.
- Model Explainability (XAI): to increase the confidence of physicians, techniques such as Grad-CAM could be integrated to visually show which region in the image was the basis for the network decision.
- Use of Vision Transformer (ViT) models: future work could investigate the replacement of CNNs with ViTs, attention-based models that provide competitive results in medical imaging, especially in fine lesion detection [
52].
- Deployment of generative models (GANs): GANs can be used to generate synthetic medical images, increasing training sets and improving model generalization, especially in sparse or imbalanced classes [
53].
- Implementation of MLOps solutions for clinical deployment: future research should include the development of automated pipelines for the integration of models into a real medical system, with monitored and updateable versions.
- Multimodal applications: future research may include models that combine medical images with clinical textual or demographic data, increasing decision context and accuracy.
Future studies might research cross-hospital training using federated learning, semi-supervised learning to integrate unlabeled data, and diagnostic integration through explainable AI interfaces customized based on physician input.
5. Conclusions
In this study, a biomimetic DL-based image classification model was developed to automatically identify GI lesions from endoscopic images using CNNs.
There were compared several deep learning models for automatic classification of gastrointestinal polyps in endoscopic images. It was performed with the aim to identify the architecture that provides the best results in accuracy and generalization ability. Actual state-of-the-art CNN models such as ResNet50 [
18], DenseNet121 [
31], MobileNetV2 [
32] were used for training. Moreover, other architectures were also tested, some returned worse results due to overfitting.
Results show that the ResNet50 model obtained the best accuracy on the validation set, followed by MobileNetV2 and DenseNet121. The models were evaluated by accuracy and confusion matrix, their performances were compared in the previous chapter.
In order to assist users such as physicians, in
Figure 6, an interface was also designed that allows the selection of one of the trained models mentioned above. The interface allows the user to select an image they wish to assign to a class and visualize which class the image belongs to, based on the previously selected model.
Even though the training time was significant (between 2 and 4 hours for each model), some stable results were obtained, and the differences between the accuracy on the training and validation set were small for the well-optimized models, indicating good generalization.
In practice, the models performed well even on consumer-grade hardware, being tested on a Lenovo IdeaPad Gaming 3 laptop (RTX 3050, i5-12500H). The average inference time for an image was under one second, allowing these systems to be integrated into a real-time workflow without significant differences. It was observed that most classifications were correct, especially for classes well represented in the training data. Classes with large variations or few examples were harder to classify, indicating the need for additional data.
In conclusion, this research presents an efficient biomimetic AI-based pipeline for medical image classification, demonstrating strong potential for aiding colorectal cancer screening and improving diagnostic in clinical environments.