1. Related Work
Before the application of deep learning techniques in the field of object detection and recognition, traditional algorithms primarily relied on image-based features [
25]. These methods involved a combination of manually designed image features, sliding window approaches, and the classification algorithm Support Vector Machine (SVM) to achieve the task. The specific approach included manual feature extraction for object recognition, followed by regression algorithms for object localization to accomplish the detection and recognition tasks. Representative features included Histogram of Oriented Gradients (HOG) [
3], Scale Invariant Feature Transform (SIFT) [
5], and Harr-like features [
6].
However, traditional object detection methods face several challenges in practical applications. Firstly, these methods rely on manually designed features, and the performance of the detection algorithm is solely dependent on the developer’s expertise, making it difficult to leverage the advantages of large-scale data. Secondly, the sliding window approach used in traditional detection algorithms involves exhaustive traversal of all possible object positions, resulting in high algorithm redundancy, computational complexity, and space complexity. As a result, traditional algorithms struggle to meet the requirements for high accuracy and real-time performance in practical applications. Furthermore, traditional detection algorithms only exhibit good detection performance for specific classes of objects.
Currently, deep learning-based object detection algorithms can be categorized into two main types based on the features of their network structures: the first type is the two-stage deep learning-based object detection algorithm [
25], and the second type is the end-to-end deep learning-based object detection algorithm, also known as the one-stage object detection algorithm [
26]. In comparison, two-stage object detection algorithms generally achieve higher accuracy than one-stage object detection algorithms. However, one-stage object detection algorithms often have higher recognition efficiency due to their unique structural characteristics.
Figure 1 illustrates the development roadmap of object recognition algorithms in recent years.
The two-stage algorithm. Two-stage algorithms, also known as region-based object detection algorithms, follow a series of steps. Firstly, a large number of candidate regions are selected from the input image. Then, convolutional neural networks (CNNs) are utilized to extract features from each candidate region. Finally, a classifier is employed to determine the category of each candidate region. In 2014, Girshick proposed the R-CNN [
8], which established the framework for two-stage object detection algorithms. This algorithm was the first to combine deep learning techniques with traditional object candidate region generation algorithms. However, R-CNN has clear drawbacks. In the first step, the algorithm generates approximately 2000 candidate regions from the image. Each candidate region undergoes CNN computations and is finally classified using an SVM, resulting in slow detection speed even with GPU acceleration. In 2015, He et al. addressed the limitations of R-CNN by proposing a new network architecture called SPP-Net [
9]. SPP-Net calculates feature maps only once for the entire image and then gathers features within the regions of arbitrary sub-images, generating fixed-length feature representations. SPP-Net accelerated R-CNN testing by 10-100 times and reduced training time by three times. In the same year, Girshick integrated the idea of SPP-Net to design a new training algorithm called Fast R-CNN [
10]. It overcame the drawbacks of R-CNN and SPP-Net while improving training speed and accuracy. Both R-CNN and Fast R-CNN employ traditional image processing algorithms to generate candidate regions from the original image, leading to high algorithmic complexity. To address these limitations, Chirshick et al. introduced Faster R-CNN [
28], which employs a Region Proposal Network (RPN) to generate candidate regions and delegates the selection process to the neural network. This method significantly reduces the search time for candidate regions and achieves a detection accuracy of 75.9(%) on the PASCAL VOC test set. Subsequently, numerous improved algorithms for Faster R-CNN have emerged. Dai designed Region-based Fully Convolutional Networks (R-FCN) [
2], and Ren demonstrated the importance of carefully designed deep networks for object classification and reconstructed Faster R-CNN using the latest ResNet backbone network [
11].
The one-stage algorithm. The main characteristic of two-stage algorithms is that they divide the entire detection process into two stages: Region Proposal and Detection. Many researchers have continuously improved the structure of such algorithms by pruning redundant parts. However, the two-stage nature of the algorithm itself limits its speed. Therefore, researchers have been studying end-to-end one-stage algorithms, where the image only needs to be input into the neural network, and the position and category information of the objects in the image can be directly obtained at the output end. In 2016, Redmon et al. proposed the YOLO (You Only Look Once) algorithm. This algorithm divides the image into a grid of 7×7 cells, and each cell predicts rectangular bounding boxes containing objects and their respective class probabilities using deep CNN. Each bounding box contains five pieces of data: the coordinates of the center of the bounding box, its width, height, and the confidence score of the object. YOLO integrates object detection and recognition, avoiding the redundant steps of region proposal, making it advantageous in terms of fast detection speed, achieving up to 45 frames per second. Additionally, YOLO supports the detection of non-natural images. However, compared to previous detection algorithms, YOLO may produce more localization errors. Subsequently, Redmon drew inspiration from the idea of Faster R-CNN and introduced the anchor mechanism into the algorithm. By using K-means clustering, better anchor templates are computed from the training set, resulting in an improved version called YOLO9000. In 2018, the author further enhanced the YOLO algorithm and proposed the YOLOv3 method, which designed a novel multi-scale Darknet53 network architecture to improve the feature extraction capability of the model. In 2020, Bochkovskiy et al. continued to optimize the network model and built the YOLOv4 backbone network. By adopting the Cross-Stage Partial Network (CSPNet) idea and designing the CSPDarkNet backbone network, the transmission of convolutional neural network feature information was effectively improved. The YOLOv4 model also incorporated a spatial pyramid pooling module in the neck network to enhance its recognition performance.
YOLOv5 is an improved algorithm based on YOLOv4, which brings better accuracy and detection speed. Compared to YOLOv4, YOLOv5 reduces model size and inference time while improving detection accuracy. It exhibits faster training speed and better generalization ability, making it suitable for real-time object detection in modern applications. To make YOLOv5 applicable to a wider range of practical problems, the authors designed four different-sized models: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x [
29]. Among them, YOLOv5x is the largest-scale network, offering the highest detection accuracy. However, its detection speed is relatively slower and it is more suitable for detection environments with higher hardware configurations. The performance comparison of the four models is shown in
Table 1, where mAP represents the mean average precision measured on the COCO dataset.
By comparing the results, it was observed that although YOLOv5s exhibits a slight decrease in algorithmic accuracy, its detection speed increases significantly. Additionally, it has the smallest parameter size, making it more easily deployable on computationally limited onboard computing devices. Therefore, this study focuses on improving YOLOv5s to achieve better detection performance in scenarios such as small and densely packed objects in unmanned aerial imagery.