Submitted:
14 May 2024
Posted:
15 May 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Review of Traditional Computer Vision Algorithms
2.1. The Basic Theory of Computer Vision
2.2. VisionVisual Framework Based on ECO for Target Detection and Long-term Tracking
2.3. Adaptive Long Term Tracking Framework Based on Computer Vision


3. Review of deep learning-based computer vision
3.1. Computer Vision Datasets and Metrics

3.2. Review of Deep Learning Computer Vision Based on Convolutional Neural Networks
3.3. ACDet:A Vector Detection Model for Drug Packaging Based on Convolutional Neural Network
3.4. Exploration and Future Trends
4. Review of visual Simultaneous Localization and Mapping (SLAM)algorithms
4.1. The Basic Principles of SLAM
4.2. The basic Principles of SLAM
4.3. Discuss the Current Challenges and Future Research Directions of Visual SLAM
4.4. Visual Framework for Unmanned Factory Applications with Multi-Driverless Robotic Vehicles and UAVs
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- P. Dollár, C. Wojek, B. Schiele, and P. Perona, ‘‘Pedestrian detection: An evaluation of the state of the art,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 4, pp. 743–761, Apr. 2012. [CrossRef]
- A. Geiger, P. Lenz, and R. Urtasun, ‘‘Are we ready for autonomous driving? The KITTI vision benchmark suite,’’ in Proc. Int. Conf. Pattern Recognit., Jun. 2012, pp. 3354–3361. [CrossRef]
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, ‘‘ImageNet large scale visual recognition ch.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, ‘‘Microsoft COCO: Common objects in context,’’ in Computer Vision—ECCV, D. Fleet, T. Pajdla, B. Schiele, and.
- T. Tuytelaars, Eds. Cham, Switzerland: Springer, 2014, pp. 740–755.
- A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, and V. Ferrari, ‘‘The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,’’ 2018, arXiv:1811.00982. [Online]. Available: https://arxiv.org/abs/1811.00982.
- P. Zhu, L. Wen, X. Bian, H. Ling, and Q. Hu, ‘‘Vision meets drones: A challenge,’’ 2018, arXiv:1804.07437. [Online]. Available: https://arxiv.org/abs/1804.07437.
- P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, vol. 1. IEEE, 2001, pp. I–I.N. [CrossRef]
- Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp. 886–893. [CrossRef]
- P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively trained, multiscale, deformable part model,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008, pp. 1–8.
- Bolme D S, Beveridge J R, Draper B A, et al. Visual object tracking using adaptive correlation filters[C].2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition,2010: 2544-2550.
- Henriques J F, Caseiro R, Martins P, et al.High-Speed Tracking with Kernelized Correlation Filters[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2015, 37 (3): 583-596. [CrossRef]
- Xu T, Feng Z H, Wu X J, et al.Learning Adaptive Discriminative Correlation Filters via Temporal Consistency Preserving Spatial Feature Selection for Robust Visual Object Tracking[J].IEEE Transactions on Image Processing,2019, 28 (11): 5596-5609. [CrossRef]
- Huang Z, Fu C, Li Y, et al. Learning Aberrance Repressed Correlation Filters for Real-Time UAV Tracking[C].2019 IEEE/CVF International Conference on Computer Vision (ICCV),2019: 2891-2900.
- R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on compputer vision and pattern recognition, 2014, pp. 580–587.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
- J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
- J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” arXiv preprint, 2017.
- A. Aboah, B. Wang, U. Bagci and Y. Adu-Gyamfi, "Real-time Multi-Class Helmet Violation Detection Using Few-Shot Data Sampling Technique and YOLOv8," 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 2023, pp. 5350-5358.
- Chen B X, Sahdev R, Tsotsos J K. Person Following Robot Using Selected Online Ada-Boosting with Stereo Camera[C].2017 14th Conference on Computer and Robot Vision (CRV),2017: 48-55.
- Liu, X., & Zhang, Z. (2021). A vision-based target detection, tracking, and positioning algorithm for unmanned aerial vehicle. Wireless Communications and Mobile Computing, 2021, 1-12. [CrossRef]
- TEED Z,DENG J. DROID SLAM: Dccp Visual SLAM for Monocular, Stereo, and RGB-D Cameras[J]. arXiy.:2108.10869.2021.
- Evjemo, L. D., Gjerstad, T., Grøtli, E. I., & Sziebig, G. (2020). Trends in smart manufacturing: Role of humans and industrial robots in smart factories. Current Robotics Reports, 1, 35-41. [CrossRef]
- L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietikainen, “Deep learning for generic object detection: A survey,” arXiv preprint arXiv:1809.02165, 2018.
- S. Agarwal, J. O. D. Terrail, and F. Jurie, “Recent advances in object detection in the age of deep convolutional neural networks,” arXiv preprint arXiv:1809.03193, 2018.
- A. Andreopoulos and J. K. Tsotsos, “50 years of object recognition: Directions forward,” Computer vision and image understanding, vol. 117, no. 8, pp. 827–891, 2013. [CrossRef]
- J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama et al., “Speed/accuracy trade-offs for modern convolutional object detectors,” in IEEE CVPR, vol. 4, 2017.
- K. Grauman and B. Leibe, “Visual object recognition(synthesis lectures on artificial intelligence and machine learning),” Morgan & Claypool, 2011.
- Zou Z, Chen K, Shi Z, et al. Object detection in 20 years: A survey[J]. Proceedings of the IEEE, 2023, 111(3): 257-276. [CrossRef]
- Jiao L, Zhang F, Liu F, et al. A survey of deep learning-based object detection[J]. IEEE access, 2019, 7: 128837-128868. [CrossRef]
- Liu Y, Dai Q. A survey of computer vision applied in aerial robotic vehicles[C]//2010 International Conference on Optics, Photonics and Energy Engineering (OPEE). IEEE, 2010, 1: 277-280.
- Danelljan M, Bhat G, Khan F S, et al. ECO: Efficient Convolution Operators for Tracking[C].2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2017: 6931-6939.
- Chen L, Li G, Zhao K, et al. A Perceptually Adaptive Long-Term Tracking Method for the Complete Occlusion and Disappearance of a Target[J]. Cognitive Computation, 2023, 15(6): 2120-2131. [CrossRef]
- M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010. [CrossRef]
- M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” International journal of computer vision, vol. 111, no. 1, pp. 98–136, 2015. [CrossRef]
- Mueller M, Smith N, Ghanem B. A benchmark and simulator for uav tracking. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer International Publishing; 2016. p. 445–61.
- Wu Y, Lim J, Yang MH. Online object tracking: a benchmark. Proceedings of the IEEE conference on computer vision and pattern recognition; 2013. p. 2411–8.
- R. De Charette and F. Nashashibi,"Real time visual traffic lights recognition based on spot light detection and adaptive traffic lights templates ,” in Intelligent Vehicles Symposium, 2009 IEEE. IEEE,2009, pp. 358-363.
- A. Geiger, P Lenz, and R. Urtasun, "Are we ready for autonomous driving? the kitti vision benchmark suite,"in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, Pp.3354-3361.
- R. Timofte, k. Zimmermann, and L. Van Gool, "Multiview traffic sign detection, recognition, and 3d localisation," Machine vision and applications, vol. 25, no. 3, pp.633 647,2014. [CrossRef]
- S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, andC. Igel, "Detection of traffic signs in real-world images:The german traffic sign detection benchmark," in Neural Networks (I/CNN), The 2013 International Joint Conference on. IEEE,2013,PP.1-8. [CrossRef]
- B. F. Klare, B. Klein, E. Taborsky, A. Blanton,J . Cheney,K. Allen, P.Grother, A. Mah, and A. K. Jain, "Pushing the frontiers of unconstrained face detection and recognition:Iarpa janus benchmark a," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015Pp.1931-1939.
- S. Yang, P. Luo, C.-C. Loy and X. Tang, "Wider face:A face detection benchmark," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016,PP.5525 5533.
- G. Cheng and J. Han, "A survey on object detection in optical remote sensing images,"ISPRS Journal of Pho-togrammetry and Remote Sensing, vol. 117, pp. 11-28, 2016. [CrossRef]
- W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, ‘‘SSD: Single shot multibox detector,’’ in Computer Vision— ECCV, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham, Switzerland: Springer, 2016, pp. 21–37.
- C. P. Papageorgiou, M. Oren, and T. Poggio, “A general framework for object detection,” in Computer vision, 1998. sixth international conference on. IEEE, 1998, pp. 555–562. [CrossRef]
- K. Kang, H. Li, J. Yan, X. Zeng, B. Yang, T. Xiao, C. Zhang, Z. Wang, R. Wang, X. Wang et al., “T-cnn: Tubelets with convolutional neural networks for object detection from videos,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 10, pp. 2896–2907, 2018. [CrossRef]
- ——, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004. [CrossRef]
- Y. Freund, R. Schapire, and N. Abe, “A short introduction to boosting,” Journal-Japanese Society For Artificial Intelligence, vol. 14, no. 771-780, p. 1612, 1999.
- T. Malisiewicz, A. Gupta, and A. A. Efros, “Ensemble of exemplar-svms for object detection and beyond,” in Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011, pp. 89–96.
- R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
- Wang S, Clark R, Wen H, et al. Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks[C]//2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017: 2043-2050.
- Xiao L, Wang J, Qiu X, et al. Dynamic-SLAM: Semantic monocular visual localization and mapping based on deep learning in dynamic environment[J]. Robotics and Autonomous Systems, 2019, 117: 1-16. [CrossRef]
- Duan R, Feng Y, Wen C Y. Deep pose graph-matching-based loop closure detection for semantic visual SLAM[J]. Sustainability, 2022, 14(19): 11864. [CrossRef]













| Dataset/Year | Cites | Description | Frames | URL |
|---|---|---|---|---|
| TLR [38] 2009 |
164 | Traffic scenes in Paris | 20,200 | http://www.lara.prd.fr/benchmarks/trafficlightsrecognition |
| KITTI [39] 2012 |
2620 | The traffic scene analysis in Germany. | 16,000 | http://www.cvlibs.net/dataset-s/kitti/index.php |
| BelgianTSD [40] 2012 |
224 | The traffic sign annotations of 269 types. With the 3D location | 138,300 | https://btsd.ethz.ch/shareddata/ |
| GTSDB [41] 2013 |
259 | Traffic scenes in different climates | 2,100 | http://benchmark.ini.rub.de/?section=gtsdb&subsecti-on=news |
| IJB [42] 2015 |
279 | IJB scenes for recognition and detection tasks. | 50,000 | https://www.nist.gov/programs-projects/face-challenges |
| WiderFace [43] 2016 |
193 | Face detection scene | 32,000 | http://mmlab.ie.cuhk.edu.hk/pr-ojects/WIDERFace/ |
| NWPU-VHR10[44] 2016 |
204 | Remote sensing detection scenario | 4,600 | http://jiong.tea.ac.cn/people/JunweiHan/NWPUVHR10dataset.html |
| Our own Datasets (Year) |
Train | Description | |
|---|---|---|---|
| images | objects | ||
| WMTIS(2019) | 1632 | 1808 | The infrared simulation weak target dataset constructed based on the infrared characteristics of military targets includes a series of challenging samples with scale changes, such as fighter jets, tanks, and warships, in desert, coastal, inland, and urban backgrounds |
| MB(2023) | 3345 | 9612 | Various types of medicine boxes made of various materials, covering all mainstream types of pharmacies, including challenges such as reflection caused by waterproof plastic film |
| EP(2022) | 25127 | 60393 | A comprehensive sample covering all types of packages in the logistics and express delivery industry, with sizes ranging from 5 centimeters to 3 meters and heights ranging from 0.5 millimeters to 1.2 meters in various shapes |
| FPP(2022~2024) | 9716 | 17435 | Multi target samples in complex industrial scenes face many challenges such as easy occlusion, uneven illumination, inconsistent imaging quality, and open scenes. This includes production personnel and samples of various types of products, collected through various methods such as ground robotic vehicles and UAV |
| Algorithmic network | Data | Backbone | AP |
|---|---|---|---|
| Fast R-CNN [46] | train | VGG-16 | 19.7 |
| Faster R-CNN [47] | trainval | VGG-16 | 21.9 |
| SSD321[48] | Trainval35K | ResNet-101 | 28.0 |
| YOLOv3[49] | Trainval35K | DarkNet-53 | 33.0 |
| RefineDet512+[50] | Trainval35K | ResNet-101 | 41.8 |
| NAS-FPN [51] | Trainval35K | AmoebaNet | 48.0 |
| Dataset | Release date | Collection situation | |
|---|---|---|---|
| KITTI | 2011 | Camera/Radar/IMU/GNSS | |
| TUM RGBD | 2012 | RGB Camera/Depth camera | |
| EUROC | 2016 | Binocular Camera/RGB Camera/UAV | |
| TUM VI | 2018 | RGB Camera/IMU | |
| Openloris | 2019 | RGB Camera/IMU/Radar | |
| Urbanloco | 2020 | Multiple Cameras/IMU/Radar/GNSS | |
| TartanAir | 2020 | Outdoor simulation dataset | |
| Brno Urban | 2020 | RGB Camera/IMU/Radar/Infrared | |
| TUM-VIE | 2021 | Binocular Camera/IMU | |
| M21XGR | 2022 | Multiple Cameras/laserGNSS |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).