Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Fine-Grained Image Recognition by Integrating Transformer Encoder Blocks in a Robust Single Stage Object Detector

Version 1 : Received: 2 June 2023 / Approved: 2 June 2023 / Online: 2 June 2023 (08:09:14 CEST)

A peer-reviewed article of this Preprint also exists.

Ali, U.; Oh, S.; Um, T.-W.; Hann, M.; Kim, J. Fine-Grained Image Recognition by Means of Integrating Transformer Encoder Blocks in a Robust Single-Stage Object Detector. Appl. Sci. 2023, 13, 7589. Ali, U.; Oh, S.; Um, T.-W.; Hann, M.; Kim, J. Fine-Grained Image Recognition by Means of Integrating Transformer Encoder Blocks in a Robust Single-Stage Object Detector. Appl. Sci. 2023, 13, 7589.

Abstract

Fine-grained image classification remains an ongoing challenge in the computer vision field, which is particularly intended to identify objects within sub-categories. It is a difficult task since there is a minimal and substantial intra-class variance. The current methods address the issue by first locating selective regions with Region Proposal Networks (RPN), object localization, or part localization, followed by implementing a CNN Network or SVM classifier to those selective regions. This approach, however, makes the process simple by implementing a single-stage end-to-end feature encoding with a localization method, which leads to improved feature representations of individual tokens/regions by integrating the transformer encoder blocks into the Yolov5 backbone structure. These Transformer Encoder Blocks, with their self-attention mechanism, effectively captured the global dependencies and enabled the model to learn relationships between distant regions. This improved the model ability to understand context and captured long-range spatial relationships in the image. We also replaced the Yolov5 detection heads with three transformer heads at the output for object recognition using the discriminative and informative features maps from transformer encoder blocks. We established the potential of the single stage detector for the fine-grained image recognition task, by achieving state of the art 93.4% accuracy, as well as outperforming the existing Yolov5 model. The effectiveness of our approach is assessed using the Stanford car dataset, which includes 16,185 images of 196 different classes of vehicles with significantly identical visual appearances.

Keywords

Fine-grained Image Recognition; Yolov5; Transformer Encoder Block; Attention Mechanism

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.