1. Introduction
Infrared small target detection is a key technology for recognizing and detecting tiny targets in infrared images. Compared with visible light, infrared sensors have strong penetrating power, can resist external interference, and can work continuously under both day and night conditions. This technology is widely used in the military field, including nighttime ship inspection [
1] and military early warning [
2]. In the civilian field, it can also be used in application scenarios such as disaster monitoring [
3], crop detection [
4] and fault detection [
5].
Currently, infrared small targets typically have very low contrast (less than 0.15), a low signal-to-noise ratio (less than 1.5), and occupy less than 0.15% of the image's total pixels [
6]. These characteristics make the detection of infrared small targets a challenging problem. Currently, two primary detection methods are employed: detection-based tracking, which uses single-frame images followed by tracking, and tracking-based detection, which employs continuous frame images followed by detection [
7]. However, due to rapid changes in both target and background within real environments, especially during rapid motion of the infrared sensor, the trajectory captured by tracking may deviate from the target's actual path. This deviation increases computational complexity and limits the practical application of this method [
8]. Therefore, studying single-frame detection methods is particularly important.
Depending on the principles of the detection methods, single-frame infrared target detection approaches are primarily divided into three categories [
9], filtering-based methods, local contrast-based methods, and data structure-based methods. Although filtering-based methods are widely applied [
10,
11,
12], they are constrained by smooth and slowly changing backgrounds, lack robustness to variations in target size, and struggle with targets exhibiting low signal-to-noise ratios. Local contrast-based methods [
13] depend on the significant difference between targets and the background but are prone to generating False Positive when there is substantial interference in the background. Data structure-based methods, such as low-rank matrix recovery [
14], dictionary learning [
15], and tensor modeling [
16], enhance detection performance but perform poorly in handling heterogeneous and dynamic backgrounds and are sensitive to noise.
Unlike traditional machine learning methods that require expert knowledge and experience to manually design feature extractors, deep learning algorithms learn feature extraction automatically from datasets through end-to-end training [
13]. Current deep learning algorithms for target detection can be categorized into two-stage and single-stage algorithms. Two-stage algorithms, such as region-based convolutional neural network [
17] and its faster [
18], generate candidate regions before performing target classification and localization. Single-stage algorithms, including single shot multibox detector [
19] and You Only Look Once (YOLO) [
20], predict the target class and the corresponding bounding box location directly from the image. Compared to two-stage algorithms, single-stage algorithms offer the advantages of simplicity, speed, and efficiency. However, one directly applies deep learning methods to infrared small target detection still faces several challenges.
(i) The low resolution of infrared images and the small size of targets make it difficult to clearly distinguish and locate the targets within the images.
(ii) Repeated downsampling leads to the loss of small target information in deeper feature maps, significantly reducing the detection capability of the model.
(iii) There is a significant imbalance between targets and background in the images, causing the model to focus more on background features rather than target features.
To address the above problems, this paper is based on the 8th version of YOLO [
21] improvement. According to the characteristics of infrared small targets, we propose a new network model. The main works are as follows. First, the space-to-depth convolution (SPDConv) module replaces the maximum pooling and sampling layers [
22], which improves the feature representation capability by capturing the global feature information, thus enhancing the detection of small targets. Then, increases the pyramid level 2 (P2) feature map with minimum receptive field in the neck of the model to reduce the loss of position information during feature map downsampling; meanwhile, the x-small detection head [
23] is added to further enhance the detection accuracy of small targets. Thirdly, adaptive threshold focal loss function [
24] is proposed to replace the cross-entropy loss function in the original network model. By dynamically adjusting the loss weights, it enhances the model's focus on difficult-to-detect targets and significantly improves the overall detection performance. This proposed network model is named SPT-YOLO, where S refers to the SPDConv structure, P means for the P2 feature map added to the feature pyramid network, and T is the adaptive threshold focal loss function. Finally, self-collected infrared small target dataset, infrared small object detection (IR-SOD) covers diverse scenes and target classes, including cars and ships.
In summary, the contributions of this work can be outlined as follows.
(1) A non-strided convolution replaces the traditional max-pooling and strided convolution modules, enhancing the model's ability to capture complex details.
(2) The P2 feature map, which has the smallest receptive field, is integrated into the feature pyramid network, and an additional x-small detection head is introduced, effectively improving the detection performance for small targets.
(3) An adaptive threshold focal loss function replaces the cross-entropy loss function in the original model, dynamically adjusting weights to enhance focus on hard-to-detect targets, significantly improving overall performance. Furthermore, a self-collected infrared small object dataset, IR-SOD, is constructed, covering diverse scenes and target types, providing crucial support for model training and validation.