Submitted:
02 June 2023
Posted:
02 June 2023
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
3. Proposed Model
3.1. Yolov5 Backbone
3.2. Improved Backbone with Vision Transformer
3.2.1. Patch Embedding
3.2.2. Transformer Encoder Block
3.3. Improved Yolov5
4. Model Training and Results
4.1. Dataset
4.2. Experimental Environment
4.3. Evaluation Matrics and Model Training
4.4. Model Adaptability over Test Images
4.5. Compaision with State of the Art
5. Limitations
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- K. Pang, Y. Yang, T. M. Hospedales, T. Xiang, and Y. Z. Song, “Solving Mixed-Modal Jigsaw Puzzle for Fine-Grained Sketch-Based Image Retrieval,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 10344–10352, 2020. [CrossRef]
- W. Zhou et al., “Fashion recommendations through cross-media information retrieval,” J Vis Commun Image Represent, vol. 61, pp. 112–120, May 2019. [CrossRef]
- W. Min, S. Jiang, and R. Jain, “Food Recommendation: Framework, Existing Solutions, and Challenges,” IEEE Trans Multimedia, vol. 22, no. 10, pp. 2659–2671, Oct. 2020. [CrossRef]
- J. Bao, D. Chen, F. Wen, H. Li, and G. Hua, “CVAE-GAN: Fine-Grained Image Generation through Asymmetric Training,” Proceedings of the IEEE International Conference on Computer Vision, vol. 2017-October, pp. 2764–2773, Dec. 2017. [CrossRef]
- L. Jing, X. Yang, and Y. Tian, “Video you only look once: Overall temporal convolutions for action recognition,” J Vis Commun Image Represent, vol. 52, pp. 58–65, Apr. 2018. [CrossRef]
- N. Xu, A. A. Liu, J. Liu, W. Nie, and Y. Su, “Scene graph captioner: Image captioning based on structural visual representation,” J Vis Commun Image Represent, vol. 58, pp. 477–485, Jan. 2019. [CrossRef]
- Z. Qin, Y. Zhang, S. Meng, Z. Qin, and K. K. R. Choo, “Imaging and fusing time series for wearable sensor-based human activity recognition,” Information Fusion, vol. 53, pp. 80–87, Jan. 2020. [CrossRef]
- K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, Sep. 2014, Accessed: May 22, 2023. [Online]. Available: https://arxiv.org/abs/1409.1556v6.
- B. Zhao, X. Wu, J. Feng, Q. Peng, and S. Yan, “Diversified Visual Attention Networks for Fine-Grained Object Classification,” IEEE Trans Multimedia, vol. 19, no. 6, pp. 1245–1256, Jun. 2017. [CrossRef]
- Dubey, O. Gupta, P. Guo, R. Raskar, R. Farrell, and N. Naik, “Pairwise Confusion for Fine-Grained Visual Classification,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11216 LNCS, pp. 71–88, May 2017. [CrossRef]
- Yu, X. Zhao, Q. Zheng, P. Zhang, and X. You, “Hierarchical Bilinear Pooling for Fine-Grained Visual Recognition,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11220 LNCS, pp. 595–610, Jul. 2018. [CrossRef]
- R. Ji et al., “Attention Convolutional Binary Neural Tree for Fine-Grained Visual Categorization,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 10465–10474, Sep. 2019. [CrossRef]
- J. Fu, H. Zheng, and T. Mei, “Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition,” Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-January, pp. 4476–4484, Nov. 2017. [CrossRef]
- X. Zhang, H. Xiong, W. Zhou, W. Lin, and Q. Tian, “Picking deep filter responses for fine-grained image recognition,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016-December, pp. 1134–1142, Dec. 2016. [CrossRef]
- Q. Jiao, Z. Liu, L. Ye, and Y. Wang, “Weakly labeled fine-grained classification with hierarchy relationship of fine and coarse labels,” J Vis Commun Image Represent, vol. 63, Aug. 2019. [CrossRef]
- Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” Oct. 2020, Accessed: May 23, 2023. [Online]. Available: https://arxiv.org/abs/2010.11929v2.
- S. Zheng et al., “Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 6877–6886, Dec. 2020. [CrossRef]
- J. Chen et al., “TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation,” Feb. 2021, Accessed: May 23, 2023. [Online]. Available: https://arxiv.org/abs/2102.04306v1.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-End Object Detection with Transformers,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12346 LNCS, pp. 213–229, May 2020. [CrossRef]
- Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal Speed and Accuracy of Object Detection,” Apr. 2020, Accessed: May 24, 2023. [Online]. Available: https://arxiv.org/abs/2004.10934v1.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Trans Pattern Anal Mach Intell, vol. 39, no. 6, pp. 1137–1149, Jun. 2015. [CrossRef]
- N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based R-CNNs for Fine-grained Category Detection,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8689 LNCS, no. PART 1, pp. 834–849, Jul. 2014. [CrossRef]
- J. Krause, H. Jin, J. Yang, and F. F. Li, “Fine-grained recognition without part annotations,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 07-12-June-2015, pp. 5546–5555, Oct. 2015. [CrossRef]
- S. Huang, Z. Xu, D. Tao, and Y. Zhang, “Part-Stacked CNN for Fine-Grained Visual Categorization,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016-December, pp. 1173–1182, Dec. 2015. [CrossRef]
- H. Zhang et al., “SPDA-CNN: Unifying semantic part detection and abstraction for fine-grained recognition,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016-December, pp. 1143–1152, Dec. 2016. [CrossRef]
- D. Wang, Z. Shen, J. Shao, W. Zhang, X. Xue, and Z. Zhang, “Multiple Granularity Descriptors for Fine-Grained Categorization,” in 2015 IEEE International Conference on Computer Vision (ICCV), IEEE, Dec. 2015, pp. 2399–2406. [CrossRef]
- Y. Zhang et al., “Weakly supervised fine-grained categorization with part-based image representation,” IEEE Transactions on Image Processing, vol. 25, no. 4, pp. 1713–1725, Apr. 2016. [CrossRef]
- E. Eshratifar, D. Eigen, M. Gormish, and M. Pedram, “Coarse2Fine: A Two-stage Training Method for Fine-grained Visual Classification,” Mach Vis Appl, vol. 32, no. 2, Sep. 2019. [CrossRef]
- P. Zhuang, Y. Wang, and Y. Qiao, “Learning Attentive Pairwise Interaction for Fine-Grained Classification,” AAAI 2020 - 34th AAAI Conference on Artificial Intelligence, pp. 13130–13137, Feb. 2020. [CrossRef]
- H. Zheng, J. Fu, Z. J. Zha, and J. Luo, “Learning Deep Bilinear Transformation for Fine-grained Image Representation,” Adv Neural Inf Process Syst, vol. 32, Nov. 2019, Accessed: May 25, 2023. [Online]. Available: https://arxiv.org/abs/1911.03621v1.
- J. He et al., “TransFG: A Transformer Architecture for Fine-grained Recognition,” Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022, vol. 36, pp. 1174–1182, Mar. 2021. [CrossRef]
- C. Yu, X. Zhao, Q. Zheng, P. Zhang, and X. You, “Hierarchical Bilinear Pooling for Fine-Grained Visual Recognition,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11220 LNCS, pp. 595–610, Jul. 2018. [CrossRef]
- G. Jocher et al., “ultralytics/yolov5: v7.0 - YOLOv5 SOTA Realtime Instance Segmentation,” Nov. 2022. [CrossRef]











| Parameter | Values |
|---|---|
| Batch Size | 16 |
| Learning Rate | 0.01 |
| Learning Rate Decay | 0.999 |
| Momentum | 0.937 |
| Learning Rate Decay Step | 5.e-4 |
| Epoch | 300 |
| Workers | 8 |
| Model | Precision | Recall | mAP @0.5 |
mAP 0.5:0.95 |
|---|---|---|---|---|
| Yolov5l_SGD | 0.890 | 0.885 | 0.912 | 0.863 |
| Yolov5l_ADAM | 0.911 | 0.880 | 0.912 | 0.864 |
| Yolov5x_SGD | 0.896 | 0.889 | 0.917 | 0.874 |
| Yolov5x_ADAM | 0.901 | 0.891 | 0.919 | 0.868 |
| Yolov5_tr_SGD | 0.931 | 0.892 | 0.921 | 0.873 |
| Yolov5l_tr_ADAM | 0.934 | 0.895 | 0.927 | 0.878 |
| Model | Pre-process(ms) | Inference Speed(ms) | NMS/ Image(ms) |
Image Size |
|---|---|---|---|---|
| Yolov5l_SGD | 0.3 | 15.5 | 0.6 | 640 × 640 |
| Yolov5l_ADAM | 0.3 | 15.5 | 0.6 | 640 × 640 |
| Yolov5x_SGD | 0.3 | 28.7 | 0.7 | 640 × 640 |
| Yolov5x_ADAM | 0.4 | 28.4 | 0.7 | 640 × 640 |
| Yolov5_tr_SGD | 0.8 | 39.2 | 0.9 | 640 × 640 |
| Yolov5l_tr_ADAM | 0.8 | 39.2 | 0.9 | 640 × 640 |
| Methods | Train Anno | Backbone | Image Resolution |
Accuracy |
|---|---|---|---|---|
| RA-CNN | VGG-19 | 448 × 448 | 92.5% | |
| BoT | Alex-Net | Not given | 92.5% | |
| WPA | BBox | CaffeNet | 224×224 | 92.6% |
| MA-CNN | VGG-19 | 448 × 448 | 92.8% | |
| PA-CNN | VGG-19 | 448 × 448 | 93.3% | |
| M2DRL | VGG-16 | 448 × 448 | 93.3% | |
| Yolov5-Trans | BBox | CSP-Darknet53 | 640×640 | 93.4% |
| DFL-CNN | VGG-16 | 448 × 448 | 93.8% | |
| TASN | ResNet-50 | 224 × 224 | 93.8% | |
| Hsnet | Parts | GoogleNet | 224 × 224 | 93.9% |
| MGE-CNN | ResNet-50 | 448 × 448 | 93.9% | |
| NTS-Net | ResNet-50 | 448 × 448 | 93.9% | |
| GCL | ResNet-50+BN | 448 × 448 | 94.0% | |
| FDL | ResNet-50 | 448 × 448 | 94.3% | |
| S3N | ResNet-50 | 448 × 448 | 94.7% | |
| DF-GMM | ResNet-50 | 448 × 448 | 94.8% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).