Submitted:
30 October 2025
Posted:
03 November 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Convolutional Neural Networks (CNNs)
- ResNet (Residual Network): Introduced to solve the vanishing gradient problem in very deep networks, ResNet [1] employs "residual connections" or shortcuts that allow gradients to bypass layers, enabling the training of networks with hundreds or even thousands of layers. This architecture proved that depth is a critical component of model performance.
- EfficientNet: This model family introduced a more principled approach to model scaling. Instead of scaling only one dimension (depth, width, or resolution), EfficientNet uses a compound scaling method to uniformly scale all three dimensions, achieving a better balance between accuracy and computational resources.
1.2. Vision Transformers (ViT)
2. Related Work
3. Methodology
3.1. Datasets
- MNIST: A dataset of 60,000 training and 10,000 test images of handwritten digits (0-9). The images are grayscale and 28x28 pixels.
- Fashion-MNIST: A dataset with the same structure as MNIST, but containing images of 10 classes of clothing items (e.g., T-shirt, Trouser, Ankle boot).
- CIFAR-10: A dataset of 50,000 training and 10,000 test color images (32x32 pixels) across 10 classes (e.g., Airplane, Automobile, Dog, Cat).
3.2. Models and Fine-Tuning Strategies
- ResNet50 & EfficientNet-B0: For these CNNs, we adopt the traditional classifier-only fine-tuning approach. We load models pre-trained on ImageNet, freeze all convolutional backbone layers, and replace the final fully-connected layer with a new classifier head tailored to the 10 classes of each dataset. Only the parameters of this new head are trained.
- ViT-B/16 with LoRA: We use a pre-trained Vision Transformer Base model with 16x16 patches. Instead of full fine-tuning, we apply Low-Rank Adaptation (LoRA). The original model weights are frozen. We inject small, trainable low-rank matrices into the query and value projection matrices of the self-attention mechanism in each Transformer Encoder layer. The update to a weight matrix is represented by a low-rank decomposition , where and are the trainable matrices, and the rank . For this study, we use a rank and an alpha scaling factor of 32. Similar to the CNNs, the final classifier head is also replaced and trained.
3.3. Training and Evaluation
4. Results
4.1. Benchmark 1: MNIST


4.2. Benchmark 2: Fashion-MNIST


4.3. Benchmark 3: CIFAR-10


5. Conclusions
References
- Raghu, M.; Unterthiner, T.; Kornblith, S.; Zhang, C.; Dosovitskiy, A. Do vision transformers see like convolutional neural networks? Advances in neural information processing systems 2021, 34, 12116–12128. [Google Scholar]
- Dingeto, H.; Kim, J. Comparative Study of Adversarial Defenses: Adversarial Training and Regularization in Vision Transformers and CNNs. Electronics 2024, 13, 2534. [Google Scholar] [CrossRef]
- Bbouzidi, S.; Hcini, G.; Jdey, I.; Drira, F. Convolutional neural networks and vision transformers for fashion mnist classification: A literature review. arXiv preprint arXiv:2406.03478, arXiv:2406.03478 2024.
- Mukhamediev, R.I. State-of-the-Art Results with the Fashion-MNIST Dataset. Mathematics 2024, 12, 3174. [Google Scholar] [CrossRef]
- Nocentini, O.; Kim, J.; Bashir, M.Z.; Cavallo, F. Image classification using multiple convolutional neural networks on the fashion-MNIST dataset. Sensors 2022, 22, 9544. [Google Scholar] [CrossRef] [PubMed]
- Abd Alaziz, H.M.; Elmannai, H.; Saleh, H.; Hadjouni, M.; Anter, A.M.; Koura, A.; Kayed, M. Enhancing fashion classification with vision transformer (ViT) and developing recommendation fashion systems using DINOVA2. Electronics 2023, 12, 4263. [Google Scholar] [CrossRef]
- Chen, F.; Chen, N.; Mao, H.; Hu, H. Assessing four neural networks on handwritten digit recognition dataset (MNIST). arXiv preprint arXiv:1811.08278, arXiv:1811.08278 2018.
- Krizhevsky, A.; Hinton, G.; et al. Convolutional deep belief networks on cifar-10. Unpublished manuscript 2010, 40, 1–9. [Google Scholar]
- Huynh, E. Vision transformers in 2022: An update on tiny imagenet. arXiv preprint arXiv:2205.10660, arXiv:2205.10660 2022.
- Ulku, I.; Tanriover, O.O.; Akagündüz, E. LoRA-NIR: Low-Rank Adaptation of Vision Transformers for Remote Sensing with Near-Infrared Imagery. IEEE Geoscience and Remote Sensing Letters 2024. [Google Scholar] [CrossRef]








| Model | Val Accuracy | F1-Score | AUC-Micro | AUC-Macro | Trainable Pct |
|---|---|---|---|---|---|
| vit_b16_lora | 0.9912 | 0.9912 | 1.0000 | 1.0000 | 0.58% |
| resnet50 | 0.9902 | 0.9902 | 0.9999 | 0.9999 | 8.53% |
| efficientnet_b0 | 0.9882 | 0.9882 | 0.9999 | 0.9999 | 10.96% |
| Model | Val Accuracy | F1-Score | AUC-Micro | AUC-Macro | Trainable Pct |
|---|---|---|---|---|---|
| vit_b16_lora | 0.9501 | 0.9500 | 0.9982 | 0.9982 | 0.58% |
| efficientnet_b0 | 0.9231 | 0.9229 | 0.9960 | 0.9960 | 10.96% |
| resnet50 | 0.9202 | 0.9201 | 0.9953 | 0.9953 | 8.53% |
| Model | Val Accuracy | F1-Score | AUC-Micro | AUC-Macro | Trainable Pct |
|---|---|---|---|---|---|
| vit_b16_lora | 0.9729 | 0.9729 | 0.9992 | 0.9992 | 0.58% |
| resnet50 | 0.8341 | 0.8336 | 0.9859 | 0.9858 | 8.53% |
| efficientnet_b0 | 0.8048 | 0.8038 | 0.9806 | 0.9805 | 10.96% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).