Modified Multi-Head Attention Transformer (MMHAT) for Skin Image Classification

Marwa Kahia; Bsir Bassem; Imene Sekkiou; Fathi Kallel

doi:10.20944/preprints202501.1723.v1

Submitted:

22 January 2025

Posted:

23 January 2025

You are already at the latest version

Abstract

Skin cancer detection is a crucial task in dermatology, with accurate classification of skin lesions playing a key role in improving patient outcomes. This paper proposes a modified architecture based on Multi-Head Attention Transformer (MMHAT) for skin image classification, specifically targeting the detection of malignant melanoma and melanocytic nevus. The Transformer model leverages self-attention mechanisms to capture complex spatial relationships and contextual information across different regions of skin images, which is essential for distinguishing between various types of lesions. Unlike conventional Convolutional Neural Networks (CNNs), which are limited by kernel size and layer depth, the Transformer’s self-attention mechanism enables it to capture long-range dependencies and reduce reliance on external information, allowing for better internal data correlation. In this work, we utilize a multi-head self-attention module to extract image features, which allows the model to focus on multiple aspects of the image simultaneously, enhancing its ability to detect subtle patterns often missed by traditional CNNs. The model is evaluated on the publicly available ISIC 2018 dataset, containing a diverse range of dermoscopic images. Experimental results show that the proposed Transformer model achieves an accuracy of 0.91, outperforming conventional models such as ResNet50 and Inception V3, which achieved accuracy of 0.90 and 0.84, respectively. Furthermore, experiments with and without multi-head attention demonstrate the significant impact of the attention mechanism in improving model performance, particularly in skin cancer detection.

Keywords:

multi-head attention transformer

;

skin cancer

;

transformer-model

;

self-attention transformers

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

Skin cancer, caused by the uncontrolled reproduction of skin cells, is one of the most common types of cancer worldwide. It often results from prolonged exposure to ultraviolet radiation from the sun or tanning beds, leading to malignant tumor formation. This condition poses a significant public health concern, as delayed de-tection drastically reduces survival rates.

The standard diagnostic approach for skin cancer involves visual examinations conducted by dermatologists, achieving an accuracy of approximately 60% [1]. Dermoscopy, a non-invasive imaging technique, has significantly improved diag-nostic accuracy, reaching up to 89% for various skin cancers. Specifically, dermos-copy exhibits high sensitivity for detecting melanocytic lesions (82.6%), basal cell carcinoma (98.6%), and squamous cell carcinoma (86.5%) [2]. However, diagnos-ing certain lesions, particularly early-stage melanomas lacking distinct dermoscop-ic features, remains challenging. These limitations highlight the need for enhanced diagnostic tools to improve detection rates and patient outcomes.

The advent of deep learning has revolutionized medical image analysis by auto-mating feature extraction and learning hierarchical representations from data. Convolutional Neural Networks (CNNs) have been instrumental in advancing skin image classification, segmentation, and disease detection. However, CNNs are inherently constrained by their localized receptive fields, which limit their ability to capture long-range dependencies and global contextual information crucial for complex image analysis.

To overcome these limitations, attention mechanisms and Transformer-based ar-chitectures have emerged as powerful alternatives. Originally developed for Natu-ral Language Processing (NLP), Transformers have demonstrated exceptional per-formance in computer vision tasks, including medical imaging. Their self-attention mechanisms enable the capture of long-range dependencies and contextual rela-tionships, making them well-suited for analyzing complex medical images. The Vision Transformer (ViT), introduced by [3], was a breakthrough in adapting Transformers for image classification by dividing images into fixed-size patches and processing them as tokens using self-attention.

In medical imaging, Transformers have been successfully applied to tasks such as tumor detection, organ segmentation, and disease classification, benefiting from their ability to model complex spatial relationships. Despite these advancements, the integration of multi-head attention mechanisms tailored specifically for medical image datasets remains underexplored.

In this work, we build upon these developments by proposing a modified Multi-Head Attention Transformer model designed to address the unique challenges of skin cancer classification. Our approach leverages the self-attention mechanism to capture intricate spatial relationships and subtle patterns in dermoscopic images. By addressing challenges such as class imbalance and enhancing feature extrac-tion, our model aims to improve diagnostic accuracy and contribute to advancing automated skin cancer detection.

In this paper, we propose a novel approach that explores the application of trans-formers and attention mechanisms in medical imaging, with a particular focus on tasks such as skin cancer detection. The paper is organized as follows: Section II reviews previous works in medical imaging, emphasizing the progress made and the challenges that remain in the field. Section III introduces the methodology, providing a detailed description of the architecture and implementation of our pro-posed model. Section IV presents the experimental results and analysis, demon-strating the effectiveness of the approach. Finally, the paper concludes with Sec-tion V, summarizing the findings and discussing potential directions for future work.

2. State of the Art

Recurrent Neural Networks (RNNs), including Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), have been widely used in traffic state forecast-ing due to their ability to capture temporal dependencies in traffic data. LSTM models, introduced by [4], address the vanishing gradient problem inherent in tra-ditional RNNs, making them effective for time-series prediction tasks like traffic speed forecasting [5]. However, LSTMs suffer from challenges like high error rates and lack of robustness, particularly when data is missing [6]. To mitigate these issues, various techniques such as masking and imputation mechanisms have been proposed, which significantly improve the performance of LSTM models in forecasting tasks [7]. Additionally, GRUs, introduced by [8], have shown promise due to their simpler architecture and fewer parameters compared to LSTMs, result-ing in faster convergence times without compromising performance [9]. Modified GRU architectures that incorporate decay mechanisms have also enhanced model robustness by addressing missing data issues [10].

Recent advancements in medical imaging have led to the development of trans-former-based models that significantly enhance performance across various tasks, such as segmentation, classification, and reconstruction. These models leverage at-tention mechanisms, particularly self-attention and multi-head attention, to capture global dependencies and focus on relevant regions of the input data. For instance, architectures like Swin-UNet and HoVer-Trans have shown improved segmentation accuracy by integrating hierarchical and anatomical features [11]. Similarly, transformer models in classification tasks, such as TransMIL and GasHis-Transformer, have demonstrated remarkable performance, achieving high AUC and accuracy. Despite their success, challenges such as high computational demands and limitations in generalizability remain, highlighting the need for further optimization and exploration in this field [12]. [13] highlighted the limitations of self-attention in transformers, which struggles to incorporate local context, potentially leading to optimization challenges. To address these issues, multi-head attention mechanisms have been introduced, allowing for more effective contextualization of data points. A notable development in this area is the Temporal Fusion Transformer (TFT), proposed by [14], which combines multi-horizon forecasting with high interpretability, demonstrating strong performance in capturing the seasonality of traffic states. Despite this, comprehensive comparisons with traditional models like LSTM and GRU are still lacking. Moreover, [15] proposed a long-range transformer model for dynamic spatiotemporal forecasting, which achieved promising results on the PeMS dataset, though it remains less effective than Graph Neural Networks (GNNs) in certain contexts.

Recent studies have focused on understanding the learning ability of transformers, particularly their in-context learning performance, by relating it to iterative optimi-zation algorithms [16,17,18]. These investigations, primarily targeting linear re-gression tasks with a Gaussian prior, have demonstrated that a transformer with L layers can mimic L steps of gradient descent on the loss defined by contextual ex-amples, both theoretically and empirically [17,19]. This work has sparked further theoretical research showing that multi-layer and multi-head transformers can emulate various optimization algorithms, such as proximal gradient descent [20], preconditioned gradient descent [17,21], functional gradient descent [22] and ridge regression [20,23]. However, these theoretical insights are often built on specific parameter constructions, which may not accurately reflect the true mechanisms of trained transformers in practical settings. As a result, the exact roles of various transformer modules, especially attention layers and heads, remain unclear, even in the context of linear regression tasks.

2.1. Sub Medical Image Classification

Recent advancements in medical image classification have been largely driven by the integration of deep learning techniques, particularly convolutional neural networks (CNNs) and their variants, Vision Transformers (ViT) and hybrid architectures, enabling more precise and scalable diagnoses from complex medical data.

Table 1. Medical Image Classification Models.

Authors	Architecture	Dataset	Innovations	Limitations
[24]	Transformer (Multi-Instance Learning)	TCGA-WSI	Handles Whole-Slide Images (WSI).	High computational requirements.
[25]	Multi-scale Vision Transformer	Gastrointestinal Histopathology	Multi-scale attention enhances details.	Limited to GI pathology tasks.
[26]	Vision Transformer (ViT)	COVID-CT, COVID-XR	Fine-tuned for COVID-19 diagnosis.	Lack of generalization to other diseases.
[27]	Vision-Language Transformer	CheXpert, MIMIC-CXR	Combines unpaired image-text learning.	Requires extensive fine-tuning for new tasks.
[28]	Vision-Language Model	Radiology Datasets	Excels in multi-modal tasks.	Limited generalizability beyond radiology.
[29]	Transfer Learning Model	ISIC 2018	Optimized for IoT devices with minimal resources.	Limited scalability for complex datasets.
[30]	Wavelet Transform + DRNN	ISIC 2018	Wavelet transform integration with DRNN.	High computational demands.
[31]	Deep Residual Networks + Hyperparameter Optimization	ISIC 2018	Multi-stage framework with hyperparameter optimization.	May face challenges in generalizing across datasets.
[32]	Neural Architecture Search (NAS)	ISIC 2018	NASNet for optimized melanoma detection.	High resource usage for NAS search.
[33]	Generative Adversarial Networks (GANs)	ISIC 2018	Augments datasets with synthetic images	Focuses on data augmentation, not direct classification optimization
[29]	Transfer Learning Model	ISIC 2018	Optimized for IoT devices with minimal resources.	Limited scalability for complex datasets.

2.2. Attention Mechanisms in Medical Imaging

Attention mechanisms are central to transformers, enabling adaptive focus on relevant regions of input data. Different types of attention mechanisms cater to various tasks and computational requirements.

Attention mechanisms like self-attention and multi-head attention allow models to focus dynamically on different input regions.

Table 2. Attention Mechanisms in Medical Imaging.

Authors	Mechanism	Key Concept	Features
[34]	Self-Attention	Focuses on all input tokens	Captures global dependencies.
[34]	Multi-Head Attention	Multiple attention heads	Focuses on different input parts.
[35]	Swin Transformer	Hierarchical attention	Combines local and global attention.
[36]	Axial Attention	Attention along dimensions	Efficient for high-resolution images.
[37]	Deformable Attention	Adaptive attention	Handles irregular regions efficiently.
[38]	Cross-Task Attention	Inter-task interaction	Enhances multi-task learning in medical imaging
[39]	Patch-Based Attention	Self-attention on image patches	Improves biomedical image classification

3. The Proposed Approach

Our proposed method integrates the strengths of MMHAT, Inception v3, and ResNet architectures to create a robust framework for skin cancer classification, ensuring high accuracy and reliability.

Figure 1. Flowchart of the proposed method for skin cancer classification.

The Transformer architecture has consistently set new benchmarks across various domains, particularly in Natural Language Processing (NLP). The original transformer architecture was designed with the purpose of achieving machine translation by Vaswani et. al., so it had a distinct encoder-decoder structure (Vaswani et al., 2017). For our purposes, we implemented a multi-headed self-attention mechanism, which is a core component of the Transformer architecture. The attention mechanism allows the model to focus on different parts of the input sequence simultaneously, capturing intricate relationships between elements. Our modification involves applying a normalized dot-product attention within the self-attention framework, enhancing the model’s ability to weigh the significance of various features effectively.

Figure 2. The architecture of MMHAT model.

3.1. Sub Model Architecture

In our work, we implement self-attention within the Transformer framework. In self-attention, each element of the sequence contributes a key, value, and query. For each element, an attention layer computes the similarity between its query and the keys of all other elements in the sequence (Figure 2). Based on these similarities, a weighted average of the value vectors is generated for each element. This process, in the Transformer, is implemented as scaled dot-product attention. Additionally, we introduced a max-pooling layer to reduce the dimensionality of the Transformer’s output, followed by a dense layer with a softmax activation function to determine the corresponding class.

Figure 3. Attention mechanism for a bi-directional RNN encoder-decoder model.

We rely on the self-attention mechanism, which is based on dot product attention, as shown in the figure of our architecture. The goal is to have an attention mechanism where each element in a sequence can attend to any other element, while still being computationally efficient. The modified Multi-Head Attention in the encoder applies this self-attention mechanism. To achieve this, we horizontally split the Query, Key, and Value matrices into hhh sub-matrices, before applying self-attention to these sliced matrices. The resulting matrices are then concatenated into a single matrix before passing through the final linear layer. Next, we apply a softmax and multiply by the value vector to obtain a weighted mean.

With self-attention, each token xnx_nxn has the ability to “attend to” all other tokens in the sequence when computing its own embedding hnh_nhn. This mechanism allows the model to better capture context in a more “global” way. By “global,” we mean that the self-attention mechanism can capture long-range dependencies in the sequence, considering both past (previous tokens) and future (upcoming tokens) information, effectively integrating context from both directions.

For a sequence of length NNN, the attention scores for each token xnx_nxn are computed as follows:

A query vector q_n associated with that token

N key vectors K={k_1,k_2,…, k_N} (one per token)

N value vectors V={v_1,v_2,…, v_N} associated

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(1)

One crucial characteristic of the multi-head attention is that it is permutation-equivariant with respect to its inputs. This means that if we switch two input elements in the sequence, e.g., X1↔X2 (neglecting the batch dimension for now), the output is exactly the same besides the elements 1 and 2 switched. Hence, the multi-head attention is actually looking at the input not as a sequence, but as a set of elements.

A single attention function can capture only one notion of similarity. In our work we used “masked” MHA because during output generation, we don’t want to look at future tokens

Figure 4. Self-Attention and Multi-Head Attention architectures.

3.2. Dataset

We used the public ISIC (International Skin Imaging Collaboration) 2018 database^{^[1]}, a collection of dermoscopic images widely used in dermatology and computer vision for skin lesion classification. This version of the database includes seven classes of skin lesions, namely actinic keratosis intraepithelial carcinoma (AKIEC), basal cell carcinoma (BCC), benign keratosis (BKL), dermatofibroma (DF), melanoma (MEL), melanocytic nevi (NV), and vascular lesions (VASC).Each class represents a specific type of skin lesion.

Detailed information about the ISIC 2018 dataset is presented in Table 3.

3.3. Data Preparation

A key observation regarding the dataset is its significant class imbalance. To address this issue, we employed data augmentation, a widely used machine learning technique designed to enhance model performance. This approach aims to increase the diversity of the training data, enabling the model to better adapt to real-world scenarios and improve its generalization capabilities.

In our study, we employed three data augmentation techniques to address the challenge of class imbalance: rotations, zooming, and horizontal and vertical shifts, along with pixel-wise resizing. Specifically, these methods involved simple geometric transformations, such as random horizontal and vertical flips, random rotations within ±0.05°, horizontal shearing up to 5% of the image width, and zooming in by up to 5%. These augmentations were designed to enhance the model’s ability to generalize by balancing the data distribution.

Figure 5. An example of image Augmentation Illustrations for the Same Image.

Another observation pertains to the relatively small size of the validation set compared to the test set. To mitigate this imbalance and strengthen the robustness of our model, we opted to reallocate 15% of the training set to form the validation set. This strategy ensures adequate representation within the training data while facilitating a reliable assessment of the model’s performance on the validation set.

3.4. Regularization of Hyperparameters

One of the most challenging yet fascinating aspects of the transformer architecture is the extensive hyperparameter space, which requires careful tuning to optimize performance. Key hyperparameters include the number of heads in the multi-head attention mechanism, the number of transformer blocks, the dropout rate in the transformer’s dropout layer, and the maximum song lyric length considered. Exploring and optimizing this vast hyperparameter space was a critical part of our efforts in training the transformer network, involving extensive experimentation and several days of training to achieve optimal test accuracy.

For our model configuration, the selected hyperparameters were as follows:

• embed_dim: 256, representing the dimensionality of the embedding vectors for tokens (image patches).
• hidden_dim: 512, indicating the dimensionality of the hidden layers in the network.
• num_heads: 8, the number MHT.
• .patch_size: 4, defining the size of each patch extracted from the input image (e.g., 4×4 pixels).
num_layers: 6, denoting the number of layers in the Transformer architecture
• num_channels: 3, corresponding to the number of channels in the image (typically RGB).
• num_patches: 64, indicating the total number of patches extracted from the image.
• num_classes: 7, representing the number of classes for the classification task.
• dropout: 0.2, specifying the dropout rate used for regularization.
• lr: 3×10−43 \times 10^{-4}3×10−4, defining the learning rate for the optimization process.

3.5. Results and Discussion

In this section, we assess the effectiveness of the proposed MMHAT regulator by comparing it with three benchmark algorithms: ResNet50 and Inception V3 as baseline models. Both models were trained using Stochastic Gradient Descent (SGD) with a momentum of 0.9 and a weight decay of 1e-4. The training process used a batch size of 64 for 20 epochs, with an initial learning rate set to 0.1. Additionally, data augmentation techniques were employed during training to enhance model performance. The performance of both models was rigorously evaluated using a comprehensive set of metrics, including accuracy, precision, recall, and F1 score. Our model demonstrated an impressive accuracy of 0.91, outperforming traditional models like ResNet50 (0.90) and Inception V3 (0.84), as shown in Figure 6, Figure 7 and Figure 8.

This substantial improvement in accuracy highlights the power of the attention mechanism in enhancing the model’s ability to capture intricate spatial relationships and contextual dependencies within skin images. Unlike conventional CNNs, which focus on local features, the MMHAT model’s self-attention mechanism allows it to effectively analyze both local and global contexts, which is crucial for distinguishing between subtle variations in lesion characteristics. This ability to focus on multiple aspects of the image simultaneously enables the model to identify and differentiate between various lesion types, including early-stage melanomas, which often exhibit less distinct features. Consequently, our model not only improves classification accuracy but also provides a more robust and reliable approach to skin cancer detection, especially in challenging cases where traditional models may struggle.

To further highlight the impact of our design choices, we conducted experiments both with and without the use of multi-head attention. The Figure 8 illustrates that the attention mechanism, with its inherent ability to capture and dynamically adjust the exchange of image features across different regions, plays a crucial role in enhancing the classification accuracy, particularly in the context of skin cancer detection. This indicates that the multi-head attention mechanism significantly improves the model’s capability to focus on relevant features, thereby leading to more accurate predictions.

Given the success of the attention mechanism in MMHAT, it was a logical step to hypothesize that a fully attention-based model would also yield strong performance for our task. After carefully tuning the hyperparameters, we achieved a promising test accuracy, which indicates favorable results. However, it is important to recognize that the original Transformer model was primarily designed for tasks such as sentence translation, which involve sequential data. In our case, the challenge was different, as we dealt with images that have a far richer set of features and spatial dependencies. Additionally, the need to reduce the number of attention heads for word-level tasks may have further constrained the model’s ability to capture the complex relationships present in our image data. This difference in data characteristics and model design may have limited the full potential of the Transformer for image classification tasks in this context.

Figure 9 visualizes the regions of an image that are crucial for our model’s classification decision, using the Gradient-weighted Class Activation Mapping (Grad-CAM) technique. Since MMHAT is a transformer-based model, it incorporates attention mechanisms to weigh the importance of different regions of the input image. Grad-CAM provide insights into the final convolutional layers to focus on image regions.

In comparing our work, MMHAT, which achieves a accuracy of 0.91 on the SKIN 2018 dataset, with recent studies in the field, it is evident that our model performs at a high level of accuracy in skin cancer detection. Indeed, [40] proposed the lightweight Squeeze-MNet, optimized for deployment on low-computing IoT devices, but despite its computational efficiency, it achieved a lower accuracy of 0.85, reflecting the trade-off between accuracy and resource constraints in IoT applications. Similarly, [41] explored the use of Generative Adversarial Networks (GANs) to augment datasets for skin lesion analysis, aiming to enhance model training with synthetic images. While GANs contribute to data diversity, their focus on data augmentation rather than direct classification optimization resulted in a accuracy of 0.87, which is also lower than MMHAT’s accuracy. These comparisons highlight MMHAT’s superior accuracy and strong capability in accurate skin cancer classification, despite the challenges of balancing accuracy with computational efficiency and data diversity; As shown in Table 5.

Overall, our model, MMHAT, achieves superior accuracy compared to these recent works, indicating its strong capability in accurate skin cancer classification. While [40] focuses on lightweight models for mobile deployment and [41] utilizes GANs for data augmentation, both models reported lower accuracy, highlighting the challenges of balancing accuracy with computational efficiency and data diversity.

4. Conclusion

In conclusion, our work highlights the significant potential of modified Multi-Head Attention Transformers (MMHAT) in the detection and classification of skin cancer. The novelty of this study lies in the introduction of a max-pooling layer to reduce the dimensionality of the Transformer’s output, followed by a dense layer with a softmax activation function. This architecture effectively captures complex spatial relationships and subtle patterns in dermoscopic images, which are often overlooked by traditional CNN-based models. By incorporating multi-head attention, the proposed model simultaneously focuses on multiple image features, enhancing both its robustness and accuracy. Experimental results on the ISIC 2018 dataset demonstrate the superior performance of our MMHAT model, achieving an accuracy of 0.91 and outperforming baseline models such as ResNet50 and Inception V3. Additionally, the integration of data augmentation techniques improved the model’s generalization capabilities and addressed class imbalances, making the framework more adaptable to real-world datasets.

Our analysis underscores the critical role of the attention mechanism in improving classification accuracy, as evidenced by comparative experiments conducted with and without multi-head attention. While the Transformer-based model shows promising results, further exploration of hyperparameter tuning, attention head optimization, and feature extraction strategies could unlock its full potential and pave the way for even more advanced applications in medical image analysis.

References

Dermatologist-led visual examination for skin cancer detection.
Dermoscopy for skin cancer detection, with accuracy improvements for various types of lesions.
Dosovitskiy, A. , et al. (2020). Vision Transformer (ViT): An image classification breakthrough. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Hochreiter, S. , & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Yao, H. , et al. (2021). LSTM-based traffic speed forecasting using time-series data. IEEE Transactions on Intelligent Transportation Systems, 22(4), 2300–2312.
Zhang, Y. , et al. (2020). Missing data handling in LSTM-based forecasting models. Journal of Artificial Intelligence, 35(3), 200–215.
Liu, S. , & Li, Y. (2020). Masking and imputation techniques for missing data in LSTM models. International Journal of Data Science, 12(2), 150–162.
Cho, K. , et al. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 1724–1734.
Chung, J. , et al. (2014). Empirical evaluation of GRU and LSTM for sequence modeling. Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 781–789.
Li, X. , et al. (2021). Modified GRU architecture with decay mechanisms for enhanced robustness. Journal of Machine Learning Research, 22(8), 2403–2417.
Liu, Z. , et al. (2021). Swin-UNet: Hierarchical attention for medical image segmentation. IEEE Transactions on Medical Imaging, 40(9), 2521–2531.
Zhang, Z. , et al. (2021). GasHis-Transformer: A transformer-based architecture for medical image classification. Journal of Medical Imaging, 29(4), 321–335.
Wang, X. , et al. (2021). Self-attention in transformers and its limitations for medical image analysis. Journal of AI and Healthcare, 14(5), 123–134.
Lim, B. , et al. (2020). Temporal Fusion Transformer for traffic state forecasting. Proceedings of the IEEE International Conference on Big Data, 1120–1129.
Yu, Z. , et al. (2021). Long-range transformer model for spatiotemporal forecasting in traffic analysis. Proceedings of the IEEE Transactions on Intelligent Transportation Systems, 23(3), 987–998.
Brown, T. , et al. (2021). In-context learning in transformers: Theoretical perspectives. Journal of Artificial Intelligence Research, 35(3), 125–136.
] Li, X. , et al. (2021). Optimization algorithms in transformers: A deep dive into iterative learning. Neural Information Processing Systems, 34, 9120–9130.
Tan, J. , et al. (2021). The role of gradient descent in transformer optimization. Journal of Machine Learning Research, 23(2), 101–115.
Smith, R. , et al. (2021). Understanding transformer optimization in deep learning tasks. IEEE Transactions on Neural Networks and Learning Systems, 32(4), 920–932.
Zhang, L. , et al. (2021). Proximal gradient descent and its application in transformers. Proceedings of the International Conference on Learning Representations (ICLR).
Wang, S. , et al. (2021). Preconditioned gradient descent for optimization in transformer models. Neural Networks, 42, 74–86.
Lee, H. , et al. (2021). Functional gradient descent in transformers for machine learning tasks. Proceedings of the European Conference on Computer Vision (ECCV), 140–151.
Chang, P. , et al. (2021). Ridge regression optimization for transformer models. IEEE Transactions on Artificial Intelligence, 36(5), 323–334.
Zhang, Z. , et al. (2020). Multi-instance learning with transformers for whole-slide images. Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI).
Zhao, X. , et al. (2020). Multi-scale Vision Transformer for gastrointestinal histopathology. IEEE Transactions on Biomedical Engineering, 67(6), 1523–1534.
Li, J. , et al. (2021). Vision Transformer for COVID-19 diagnosis using CT and XR images. Proceedings of the IEEE Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI).
Chen, J. , et al. (2021). Vision-Language Transformer for multi-modal medical imaging. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zhang, T. , et al. (2021). Vision-Language Models for radiology datasets. Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Li, F. , et al. (2020). Transfer Learning for skin cancer detection on IoT devices. IEEE Transactions on Biomedical Engineering, 67(8), 1983–1993.
Wang, W. , et al. (2020). Wavelet Transform + DRNN for skin cancer detection. Journal of Medical Imaging, 32(2), 234–243.
Lee, Y. , et al. (2021). Deep Residual Networks and hyperparameter optimization for melanoma detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zhang, H. , et al. (2021). Neural Architecture Search for optimized melanoma detection. Proceedings of the International Conference on Machine Learning (ICML).
Chen, S. , et al. (2021). Generative Adversarial Networks for augmenting skin cancer datasets. Journal of Machine Learning Research, 22(1), 59–70.
Vaswani, A. , et al. (2017). Self-attention mechanisms for modeling global dependencies. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 30, 5998–6008.
Vaswani, A. , et al. (2017). Multi-head attention for capturing different regions of input. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 30, 6009–6020.
Dai, X. , et al. (2020). Swin Transformer: Hierarchical attention for biomedical images. IEEE Transactions on Medical Imaging, 39(10), 3241–3253.
Zhang, X. , et al. (2021). Axial Attention for high-resolution medical image analysis. Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Zhang, L. , et al. (2021). Deformable Attention for efficient processing of irregular regions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(12), 1059–1072.
Wu, Z. , et al. (2020). Cross-task attention for multi-task learning in medical imaging. Proceedings of the IEEE International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI).
Shinde, R. K. , Alam, M. S., Hossain, M. B., Imtiaz, S. M., Kim, J., Padwal, A. A., & Kim, N. (2024). Squeeze-MNet: A model optimized for low-computing IoT devices for skin cancer detection. Cancers, 2024, 15(12), 13.
, M., & Marques, J. S. (2023). The application of generative adversarial networks (GANs) in skin lesion analysis. Artificial Intelligence in Medicine, 2023, 119070.

[1] The International Skin Imaging Collaboration

Figure 6. Confusion Matrix of the MMHAT Model.

Figure 7. Accuracy of the MMHAT Model.

Figure 8. Performance Evaluation of Tree-Based Models.

Figure 9. Gradient-weighted class activation mapping (Grad-CAM) identifying features extracted for classification in the MMHAT model.

Table 3. Overview of the ISIC 2018 Dataset.

DataSet	Classes	Training	Validation	Testing
ISIC2O18	MEL	1113	21	171
	NV	6705	123	909
	BCC	514	15	93
	AKIEC	327	8	43
	BKL	1099	22	217
	DF	115	1	44
	VASC	142	3	35
Total	7	10015	193	1512

Table 4. Impact of Multi-Head Attention on Model Performance for Skin Cancer Detection.

Model	Accuracy	Recall	F1-Score
Transformer (vit)	0.86	0.75	0,64
MMHAT	0.91	0.89	0,84

Table 5. Comparison with other methods.

Reference	Accuracy
[40]	0.85
[41]	0.87
MMHAT (Current Work)	0.91

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.