Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Optimizing Mobile Vision Transformers for Land Cover Classification

Version 1 : Received: 2 October 2023 / Approved: 3 October 2023 / Online: 3 October 2023 (10:39:42 CEST)

How to cite: Rozario, P.; Gadgil, R.; Gomes, R.; Lee, J.; Keller, P.; Sipos, G.; McDonnell, G.; Impola, W.; Rudolph, J. Optimizing Mobile Vision Transformers for Land Cover Classification. Preprints 2023, 2023100126. https://doi.org/10.20944/preprints202310.0126.v1 Rozario, P.; Gadgil, R.; Gomes, R.; Lee, J.; Keller, P.; Sipos, G.; McDonnell, G.; Impola, W.; Rudolph, J. Optimizing Mobile Vision Transformers for Land Cover Classification. Preprints 2023, 2023100126. https://doi.org/10.20944/preprints202310.0126.v1

Abstract

Image classification in Remote Sensing and Geographic Information Systems (GIS) containing various land-cover classes is essential for efficient and sustainable land-use estimation, and other tasks like object detection, localization and segmentation. Deep Learning (DL) techniques have shown a tremendous potential in the GIS domain. While Convolutional Neural Networks (CNN) have dominated most of the image analysis domain, a new architecture called transformers have proved to be a unifying solution for several AI-based processing pipelines. Vision Transformers (ViT), a variant of transformers can have comparable and in some cases better accuracy than a CNN. However, they suffer from a significant drawback associated with an excessive use of training parameters. In this research we explore several modifications in the vision transformer architectures, especially MobileViT that can be optimized while boosting accuracy. To verify our proposed approach these new architectures are trained on four land-cover datasets AID, EuroSAT, UC-Merced, and WHU-RS19. Experiments reveal that combination of lightweight convolutional layers including ShuffleNet along with depthwise separable convolutions and average pooling can reduce the trainable parameters by 17.85% and yet achieve higher accuracy than the base MobileViT. It is also observed that utilizing a combination of convolution layers along with multi-headed self attention layers in MobileViT variants provide better performance in capturing local and global features unlike the standalone ViT architecture that utilizes almost 95% more parameters than the proposed MobileViT variant.

Keywords

vision transformers; Mobile ViT; ShuffleNet; CNN; Land cover classification

Subject

Environmental and Earth Sciences, Remote Sensing

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.