Submitted:
19 May 2025
Posted:
23 May 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Literature Review
2.1. Deep Learning in Skin Lesion Classification
3. Materials and Methods
3.1. Datasets
- Dermofit Image Library (Dermofit) – A dermoscopic image set from the University of Edinburgh containing 1,300 high-quality lesion images across 10 classes [28,28]. It includes a substantial number of SK images (as part of the benign keratosis class) along with other diagnoses (e.g., melanocytic nevus, basal cell carcinoma, etc.). All images are biopsy-proven and collected under standardized conditions with consistent image size. We used the SK and non-SK images from Dermofit, with the non-SK category including other pigmented lesions.
- BCN20000 Dataset (Barcelona Dermoscopic) – A large dataset of 18,946 dermoscopic images from Hospital Clínic de Barcelona, spanning 8 diagnostic categories plus an out-of-distribution category [20,20]. This dataset, published as "Dermoscopic Lesions in the Wild," reflects more real-world conditions with varied image qualities and lesions in challenging locations (nails, mucosa, etc.) [20]. It contains a benign keratosis class which covers SK and related benign lesions. We included images labeled as seborrheic keratosis/benign keratosis (BKL) as positives, and images from other classes (melanoma, nevus, etc.) as negatives for our binary classification.
- Buenos Aires Dermatology Dataset (BuenosAires) – A dataset of 2,652 clinical lesion images collected in Argentina, published in Scientific Data [21,21]. This dataset is noteworthy for its inclusion of diverse skin types and an under-represented patient population (Latin American). It contains images of common skin tumors including SK, and was intended to evaluate AI tools in that population [21]. We utilized this as an external source of clinical (non-dermoscopic) images to test model generalization. The SK images from this dataset were included as positive examples, with other diagnoses (e.g., melanoma, basal cell carcinoma, nevus) as negative examples.
3.2. Model Architectures and Implementation
- ResNet-34 – A 34-layer deep residual network as proposed by He et al. [25]. ResNets incorporate identity shortcut connections (residual links) that help train deeper networks by mitigating vanishing gradient issues. We chose ResNet-34 for its balance of depth and computational load; it has shown strong performance on image classification tasks with moderate complexity and was a top performer for overall AUC in our experiments.
- EfficientNet-B1 – A CNN architecture from Tan and Le (2019) that scales depth, width, and resolution in a balanced way [27]. EfficientNet-B1 is one of the smaller models in the family, with about 7.8 million parameters. We selected EfficientNet-B1 due to its high accuracy per parameter; these models achieved state-of-the-art results on ImageNet with much fewer parameters than traditional CNNs. We hypothesized its superior feature extraction might yield high specificity, as indeed observed.
- VGG16 – A 16-layer CNN by Simonyan and Zisserman (2015) known for its simplicity (stacked convolutional layers and pooling) [26,26]. VGG16 served as a representative of older, high-capacity networks. It has over 138 million parameters and tends to have high learning capacity but can overfit on small datasets. In our study, VGG16 exhibited the highest sensitivity (recall for SK), perhaps due to its large capacity capturing subtle SK features, though at the expense of more false positives.
- Vision Transformer (ViT) – A transformer-based image classification model introduced by Dosovitskiy et al. (2021) [24]. We used a ViT model with a Base configuration (12 transformer layers, 768-dimensional embeddings, 12 attention heads) pre-trained on ImageNet. The ViT splits an image into patches (we used patches) and processes them with a pure transformer encoder, relying on self-attention to model global relationships. ViTs require large training data to generalize well; to make it effective on our data, we applied extensive augmentation and fine-tuning. We also incorporated improvements from recent literature (such as optimization tweaks and early stopping) which led to a dramatically improved performance compared to a naive ViT training attempt. The ViT model in its improved form achieved the highest accuracy in our experiments.
4. Results
- ResNet-34: This model had the highest AUC (0.9742) among the original CNNs, indicating excellent overall discrimination ability [29]. Its accuracy was 91.58%, and it achieved a high specificity of 94.55%. In practice, ResNet-34 was very effective at correctly identifying non-SK lesions (few false positives), which is valuable for avoiding misclassification of other lesions as SK. However, the sensitivity was 0.6287, meaning it only detected about 62.9% of true SK cases. This relatively low sensitivity suggests that ResNet-34 missed a significant fraction of SK lesions (false negatives), an area for improvement since missing an actual melanoma (if misdiagnosed as SK) is a serious concern. The strong specificity and AUC indicate that ResNet-34 learned features that separate classes well on average, but it struggled with some SK examples, possibly those that looked atypical or very similar to other lesion types.
- EfficientNet-B1: EfficientNet-B1 achieved the highest validation accuracy during training (94.41% on the test set, matching its validation) and the highest specificity of all models (98.49%). Its performance profile shows it was extremely conservative in labeling lesions as SK, resulting in very few false positives. This model is thus particularly useful for ruling out SK; in a clinical scenario, it would seldom mislabel a benign lesion as SK (which could lead to unnecessary concern or biopsy). However, the sensitivity was only 0.4388, the lowest among the models. In fact, EfficientNet-B1 missed more than half of the SK cases. Its AUC was also somewhat lower (0.7118 in the initial validation result, though in our test with improvements it was higher at 0.9718 as shown in Table 2, indicating that the issue was more with threshold choice than underlying ranking). The pattern here is a high threshold model that prioritized specificity at the expense of sensitivity [30]. This behavior might be attributable to the training process: EfficientNet, being very accurate overall, may have latched onto features that distinguish obvious non-SK lesions and become overconfident in classifying borderline cases as non-SK. While the high accuracy and specificity are encouraging, the low sensitivity would be problematic for a standalone diagnostic tool because it could miss actual SK lesions (or in a broader sense, could miss malignant lesions if SKs were misclassified in reverse scenario).
- Vision Transformer (ViT): The ViT model, after applying our improvements, showed a dramatically different performance compared to the initial baseline ViT (which, in earlier experiments without those improvements, had only around 68.63% accuracy [16]). The improved ViT achieved 97.28% accuracy and an AUC of approximately 0.99, indicating near-perfect discrimination. Notably, the ViT balanced both sensitivity and specificity: sensitivity was about 0.95 and specificity about 0.98. This means the transformer correctly identified 95% of SK cases and 98% of non-SK cases. Both false negatives and false positives were minimal. This result is a significant finding of our study: it highlights that transformer-based models, when properly fine-tuned even on a moderate-sized dataset, can outperform traditional CNNs in this domain. The ViT’s high sensitivity is particularly important; it implies the model rarely misses SK lesions. In a context where SK is benign but melanoma is deadly, a high sensitivity ensures that lesions that could be melanoma (and not SK) are not falsely dismissed. The ViT also maintained very high specificity, so it did not over-call SK either. We attribute the ViT’s success to its ability to capture global texture patterns and contextual details of lesions that CNNs might overlook. SKs often have a characteristic “stuck on” appearance with keratinous surface texture that might span large portions of the image; a transformer can integrate information across the entire lesion region via self-attention. Furthermore, our use of extensive data augmentation likely helped the ViT generalize better given the limited SK examples available.
- VGG16: This model performed the worst overall, with 73.01% accuracy and an AUC of 0.7761. Despite being a deep network, VGG16 seems to have overfit the training data (it achieved high training accuracy but much lower validation/test accuracy) and did not generalize well. Interestingly, VGG16 had the highest sensitivity (0.7172) among the three CNNs, meaning it caught about 71.7% of SK cases, outperforming ResNet and EfficientNet in that regard. This suggests VGG16 was somewhat more “aggressive” in labeling SK (perhaps due to overfitting on SK features in training), which led to many false positives as reflected by its low specificity of 0.7311. In other words, it often predicted lesions to be SK even when they were not, yielding a lot of misclassifications in the non-SK group. In a clinical context, VGG16’s behavior would result in numerous benign lesions or other lesion types being incorrectly flagged as SK, which could be acceptable since SK is benign, but if one thinks in terms of melanoma triage, those false SK could include melanomas being wrongly labeled as SK—a dangerous error. The relatively poor performance of VGG16 in our study highlights the importance of modern architectures and regularization; its large number of parameters and lack of batch normalization (in the original VGG design) likely made it less effective given our dataset size and diversity.
4.1. Analysis of Results in Context
5. Discussion
5.1. Comparison with Existing Work
5.2. Clinical Implications
5.3. Limitations and Future Work
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Kondejkar, T.; Al-Heejawi, S.M.A.; Breggia, A.; Ahmad, B.; Christman, R.; Ryan, S.T.; Amal, S. Multi-Scale Digital Pathology Patch-Level Prostate Cancer Grading Using Deep Learning: Use Case Evaluation of DiagSet Dataset. Bioengineering 2024, 11, 624. [Google Scholar] [CrossRef]
- Balasubramanian, A.A.; Al-Heejawi, S.M.A.; Singh, A.; Breggia, A.; Ahmad, B.; Christman, R.; Amal, S. Ensemble Deep Learning-Based Image Classification for Breast Cancer Subtype and Invasiveness Diagnosis from Whole Slide Image Histopathology. Cancers 2024, 16, 2222. [Google Scholar] [CrossRef] [PubMed]
- Mudavadkar, G.R.; Deng, M.; Al-Heejawi, S.M.A.; Arora, I.H.; Breggia, A.; Ahmad, B.; Christman, R.; Amal, S. Gastric Cancer Detection with Ensemble Learning on Digital Pathology: Use Case of Gastric Cancer on GasHisSDB Dataset. Diagnostics 2024, 14, 1746. [Google Scholar] [CrossRef] [PubMed]
- Jain, M.N.; Al-Heejawi, S.M.A.; Azzi, J.R.; Amal, S. Digital Pathology and Ensemble Deep Learning for Kidney Cancer Diagnosis: Dartmouth Kidney Cancer Histology Dataset. Appl. Biosci. 2025, 4, 8. [Google Scholar] [CrossRef]
- Wu, J.; Hu, W.; Wen, Y.; Tu, W.; Liu, X. Skin Lesion Classification Using Densely Connected Convolutional Networks with Attention Residual Learning. Sensors 2020, 20, 7080. [Google Scholar] [CrossRef]
- Azeem, M.; Kiani, K.; Mansouri, T.; Topping, N. SkinLesNet: Classification of Skin Lesions and Detection of Melanoma Cancer Using a Novel Multi-Layer Deep Convolutional Neural Network. Cancers 2024, 16, 108. [Google Scholar] [CrossRef]
- Lopez, A.R.; Giro-I-Nieto, X.; Burdick, J.; Marques, O. Skin lesion classification from dermoscopic images using deep learning techniques. In Proceedings of the 13th IASTED International Conference on Biomedical Engineering (BioMed); IEEE: Piscataway, NJ, USA, 2017. [Google Scholar]
- Esteva, A.; Kuprel, B.; Novoa, R.A.; Ko, J.; Swetter, S.M.; Blau, H.M.; Thrun, S. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017, 542, 115–118. [Google Scholar] [CrossRef]
- Zhang, J.; Xie, Y.; Wu, Q.; Xia, Y. Medical image classification using synergic deep learning. Med. Image Anal. 2019, 54, 10–19. [Google Scholar] [CrossRef]
- Gessert, N.; Sentker, T.; Madesta, F.; Schmitz, R.; Kniep, H.; Baltruschat, I.; Werner, R.; Schlaefer, A. Skin lesion classification using CNNs with patch-based attention and diagnosis-guided loss weighting. IEEE Trans. Biomed. Eng. 2019, 67, 495–503. [Google Scholar] [CrossRef]
- Yan, Y.; Kawahara, J.; Hamarneh, G. Melanoma recognition via visual attention. In International Conference on Information Processing in Medical Imaging; Springer: Cham, Switzerland, 2019; pp. 793–804. [Google Scholar]
- Roy, V.K.; Thakur, V.; Baliyan, N.; Goyal, N.; Nijhawan, R. A framework for seborrheic keratosis skin disease identification using Vision Transformer. In Machine Learning for Cyber Security; Malik, P., Nautiyal, L., Ram, M., Eds.; De Gruyter: Berlin, Boston, 2023; pp. 117–128. [Google Scholar]
- Shakya, M.; Patel, R.; Joshi, S. A comprehensive analysis of deep learning and transfer learning techniques for skin cancer classification. Sci. Rep. 2025, 15, 4633. [Google Scholar] [CrossRef]
- Khan, S.; Ali, H.; Shah, Z. Identifying the role of vision transformer for skin cancer—A scoping review. Front. Artif. Intell. 2023, 6, 1202990. [Google Scholar] [CrossRef] [PubMed]
- Zhang, X.; Liu, Y.; Ouyang, G.; Chen, W.; Xu, A.; Hara, T.; Zhou, X.; Wu, D. DermViT: Diagnosis-Guided Vision Transformer for Robust and Efficient Skin Lesion Classification. Bioengineering 2025, 12, 421. [Google Scholar] [CrossRef] [PubMed]
- Yang, G.; Luo, S.; Greer, P. Boosting Skin Cancer Classification: A Multi-Scale Attention and Ensemble Approach with Vision Transformers. Sensors 2025, 25, 2479. [Google Scholar] [CrossRef]
- Harangi, B. Skin lesion classification with ensembles of deep convolutional neural networks. J. Biomed. Inform. 2018, 86, 25–32. [Google Scholar] [CrossRef]
- Haenssle, H.A.; Fink, C.; Schneiderbauer, R.; Toberer, F.; Buhl, T.; Blum, A.; Hammon, M.; Engelmann, U.; Hölzel, C.; Reader Group, M.S. Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Ann. Oncol. 2018, 29, 1836–1842. [Google Scholar] [CrossRef]
- Brinker, T.J.; Hekler, A.; Enk, A.H.; Berking, C.; Hauschild, A.; Schilling, B.; Haferkamp, S.; Schadendorf, D.; Holland-Letz, T.; Klode, J. Deep neural networks are superior to dermatologists in melanoma image classification. Eur. J. Cancer 2019, 119, 11–17. [Google Scholar] [CrossRef] [PubMed]
- Hernández-Pérez, C.; Combalia, M.; Podlipnik, S.; Rotemberg, V.; Halpern, A.C.; Reiter, O.; Carrera, C.; Barreiro, A.; Helba, B.; Puig, S.; et al. BCN20000: Dermoscopic Lesions in the Wild. Sci. Data 2024, 11, 641. [Google Scholar] [CrossRef]
- Ricci Lara, M.A.; Rodríguez Kowalczuk, M.V.; Eliceche, M.L.; Ferraresso, M.G.; Luna, D.R.; Benítez, S.E.; Mazzuoccolo, L.D. A dataset of skin lesion images collected in Argentina for the evaluation of AI tools in this population. Sci. Data 2023, 10, 712. [Google Scholar] [CrossRef]
- Nie, Y.; Sommella, P.; Carratù, M.; O’Nils, M.; Lundgren, J. A Deep CNN Transformer Hybrid Model for Skin Lesion Classification of Dermoscopic Images Using Focal Loss. Diagnostics 2022, 12, 3031. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; others. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations; 2021.
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
- Tan, M.; Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning; PMLR: Long Beach, CA, USA, 2019; pp. 6105–6114. [Google Scholar]
- Ballerini, L.; Fisher, R.B.; Aldridge, B.; Rees, J. A color and texture based hierarchical K-NN approach to the classification of non-melanoma skin lesions. In Color Medical Image Analysis; Springer: Dordrecht, The Netherlands, 2013; pp. 63–86. [Google Scholar]
- Menon, D.; Rinner, C.; Shein, A.; Pivnik, A.; Sun, R.; Stanekova, D.; DermaTeam; Geisler, S.; Ruëff, F.; Koelmel, K.; et al. HAM10000 dataset: A large collection of multi-source dermatoscopic images of common pigmented skin lesions. arXiv 2018, arXiv:1803.10417.
- Codella, N.C.F.; Rotemberg, V.; Tschandl, P.; Celebi, M.E.; Dusza, S.W.; Gutman, D.; Helba, B.; Kalloo, A.; Liopyris, K.; Marchetti, M.A.; et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (ISIC). arXiv 2019, arXiv:1902.03368. [Google Scholar]
| Study | Focus Area | Dataset(s) | Model/Method | Accuracy | Sensitivity | Specificity |
|---|---|---|---|---|---|---|
| Wu et al. (2020) | General lesions (melanoma vs. others) | Dermoscopic images (ISIC) | DenseNet + Attention Residual | ∼94% | High | Moderate |
| Azeem et al. (2024) | Melanoma vs. Nevus vs. SK (mobile) | PAD-UFES-20 (smartphone images) | SkinLesNet (custom 4-layer CNN) | 96% | Not reported | Not reported |
| Lopez et al. (2017) | Skin lesion classification (small set) | Small dermoscopy dataset | Transfer Learning (VGG16, ResNet) | ∼90% | Not detailed | Not detailed |
| Esteva et al. (2017) | General skin cancer (incl. SK) | ISIC + Clinical images (129k) | Inception v3 (pre-trained) | Dermatologist-level | – | – |
| Zhang et al. (2019) | Multi-class lesion (ensemble) | ISIC 2017 | Synergic CNN Ensemble | ∼91% | Moderate | Moderate |
| Gessert et al. (2019) | Lesion classification with attention | ISIC + others | CNNs with Patch-Based Attention | 89–95% | Improved | Improved |
| Yan et al. (2019) | Melanoma recognition | ISIC | CNN + Visual Attention | – | – | – |
| Our Study (2025) | SK vs. Other lesions | DermoFit, BCN20000, Argentine | Multiple: ResNet-34, EfficientNet-B1, VGG16, ViT | 97.28% (ViT) | 0.9500 | 0.9800 |
| Model | Accuracy (%) | AUC | Sensitivity | Specificity |
|---|---|---|---|---|
| ResNet-34 | 91.58 | 0.9742 | 0.6287 | 0.9455 |
| EfficientNet-B1 | 94.41 | 0.9718 | 0.4388 | 0.9849 |
| ViT (Improved) | 97.28 | 0.9920 | 0.9500 | 0.9800 |
| VGG16 | 73.01 | 0.7761 | 0.7172 | 0.7311 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
