1. Introduction
Thanks to its unique advantages, such as all-time, all-weather, high-resolution, and long-range detection, SAR has been widely used in various fields, such as land analysis and target detection. Vehicle detection in SAR-ATR is of great significance in urban traffic, hotspot target focusing and other aspects.
In recent years, with the development of artificial intelligence, deep learning-based object detection algorithms [
1,
2] have dominated the field with their powerful capabilities in automatic feature extraction. Deep learning is a subject with data hunger. Historical experience has shown that big data is an important driver for the flourishing development of deep learning technology in various fields. With the rapid development of aerospace and sensor technology, an increasing number of high-resolution remote sensing images can be obtained. In the remote sensing field, visible light object detection has experienced vigorous development after the release of the DOTA[
3]. As the first publicly available SAR ship dataset, the introduction of the SSDD [
4] has directly promoted the application of deep learning in SAR object detection and has led to the emergence of more SAR ship datasets [
5,
6,
7,
8,
9], which is still one of the detection benchmark and exhibits strong vitality to this day.
However, due to the imaging mechanism of SAR images being distinct from visible light, its interpretation is unintuitive for the human eye. Ground clutter and scattering caused by object corner points can seriously interfere with human interpretation. This leads to the fact that the detection objects of the current SAR datasets are mainly large targets such as ships and planes in relatively pure backgrounds. In contrast, SAR datasets for vehicles are very rare. The community has long relied on the Moving and Stationary Target Acquisition and Recognition (MSTAR) [
10] released by the Sandia National Laboratory in the last century. However, the vehicle images in MSTAR are also separated from large-scene images and appear in the form of small patches. Due to the lack of complex background, it is only suitable for classification tasks, and its classification accuracy has reached more than 99%. Up to now, in the SAR-ATR field, MSTAR has been more widely used in few-shot learning and semi-supervised learning [
11,
12]. Meanwhile, the volume of the SAR dataset owning vehicle images with large scenes is quite small. The reason for this is that the small area of the vehicle requires higher resolution for SAR-ATR than aircrafts and ships, which leads to higher data acquisition costs. Moreover, vehicles exist in more complex clutter backgrounds, which increases the difficulty of manual interpretation and reduces the accuracy of target annotation.
Table 1 detailed information of existing public SAR vehicle datasets with large scenes. Unfortunately, in these datasets, there is no official localization annotation that can be obtained, so manual identification of annotations is required. Due to strong noise interference, the FARAD X BAND [
13]and FARAD KA BAND [
14] make it too difficult for humans to identify the position of vehicles, so the annotations cannot meet the accuracy requirements. The Spotlight SAR [
15] has only a very small number of vehicles, and pairs of pictures were taken at different time periods at the same location. The Mini SAR [
16] includes more vehicles, but it only contains 20 pictures, and it has the same problem of Spotlight SAR in duplicate scenes. The subsequent experiments proved that the small size of Mini SAR caused a large standard error of the results. These reasons make the above datasets difficult to become a reliable benchmark for SAR-ATR algorithms. In addition, there is the GOTCHA [
17], which contains vehicles and large scenes, but it is a fully polarized circular SAR dataset that is significantly different from the commonly used single polarized linear SAR. It only contains one scene and is mainly used for the classification of calibrated vehicles in the SAR-ATR field. The size of GOTCHA is difficult to meet the requirements of object detection, so it is not included in the table here for comparison.
In view of the scarcity of vehicle datasets in the SAR-ATR field, people have conducted a series of data generation work around MSTAR, which can be mainly divided into the following three methods. The first method is based on generative adversarial nets (GANs) [
18]. The generating network transforms the input noise into generative images that can deceive discriminative networks by fitting the distribution of real images. In theory, GANs [
19,
20,
21] can generate an infinite number of generative images (See
Figure 1a), thereby solving the problem of scarce real samples. However, unlike optical images, SAR imaging is strictly based on radar scattering mechanisms, and the black box properties of neural networks cannot prove that the generative samples comply with SAR imaging mechanisms. Moreover, due to the limitations of real samples, it is difficult to generate large-scene images. The second method is based on computer aided design (CAD) 3D modeling and electromagnetic calculation simulation [
22,
23,
24,
25]. Among them, the SAMPLE [
25] dataset released by Lewis et al. from the same institution of MSTAR, and it has advantages in model errors, as shown in
Figure 1b. The advantage of this method is that the imaging of synthetic samples is based on physical mechanisms, and the imaging under different conditions can be easily obtained by changing the simulation environment parameters. Compared with the original images, the simulation images can also remove the correlation between the targets and the background by setting random background noise, which prevents overfitting of the detection model. However, both of these methods have background limitations and it is currently difficult to simulate vehicles located in complex large-scale backgrounds. The third method is background transfer [
26,
27,
28]. Chen et al. believe that since the acquisition conditions of the chip image (Chip for short) and the clutter image (Clutter for short) in MSTAR are similar, Chips can be embedded in Clutters to generate vehicle images with large scenes , as shown in
Figure 1c. Like the first method, the synthetic images cannot strictly comply with SAR imaging mechanisms, and the current use of such methods directly paste Chips with their backgrounds onto Clutters, which looks quite abrupt visually and maintaining an association between the target and the background.
To generate large-scale SAR images with complex backgrounds, we constructed the Mix MSTAR using improved a background transfer method. Unlike the previous works, we overcame the abrupt visual appearance of synthetic images and demonstrated the fidelity and effectiveness of synthetic data. Our key contributions are as follows:
We improved the method of background transfer and generated realistic synthetic data by linearly fusing vehicle masks in Chips and Clutters, resulting in the fusion of 20 types of vehicles (5,392 in total) into 100 large background images. The dataset adopts rotation bounding box annotation and includes one Standard Operating Condition (SOC) and two EOCs partitioning strategies, making it a challenging and diverse dataset.
Based on the Mix MSTAR, we evaluated 9 benchmark models for general remote sensing object detection and analyzed their strengths and weaknesses for SAR-ATR.
To address potential artificial traces and data variance in synthetic images, we designed two experiments to demonstrate the fidelity and effectiveness of Mix MSTAR in SAR image features, demonstrating that Mix MSTAR can serve as a benchmark dataset for evaluating deep learning-based SAR-ATR algorithms.
The remaining article is composed of 4 sections.
Section 2 presents detailed methodology employed to construct the synthetic dataset, as well as extensive analysis of the dataset itself. In
Section 3, we introduce and evaluate nine rotate object detectors using the synthetic dataset as the benchmark. Sequently, a comprehensive analysis of the results is conducted.
Section 4 focuses specifically on the analysis and validation of two vital problems related the dataset, namely, artificial traces and data variance. Moreover, we provide an outlook on potential future of the synthetic dataset.
Section 5 concludes our work.
3. Results
After constructing Mix MSTAR, in order to further evaluate the dataset, in this section nine benchmark models are selected to evaluate the performance of mainstream rotated object detection algorithms on Mix MSTAR.
3.1. Models Selected
In the field of deep learning, the types of detectors can be roughly divided into single-stage, refinement stage, two-stage, and anchor-free algorithms.
The single-stage algorithm directly predicts the class and bounding box coordinates for objects from the feature maps. It tends to be computationally more efficient albeit at the potential cost of less precise localization.
The refinement stage algorithm is typically a supplementary step incorporated within an object detection process to enhance the precision of detected bounding box coordinates proposed initially. It refines the spatial dimensions of bounding boxes via a series of regressors learning to make small iterative corrections towards the ground truth box, thereby improving the performance of object localization.
The two-stage algorithm operates on the principle of segregation between object localization and its classification. First, it generates region proposals through its region proposal network (RPN) stage based on the input images. Then, these proposals are run through the second stage where the actual object detection takes place, discerning the object class and refining the bounding boxes. Due to this two-step process, these algorithms tend to be more accurate but slower.
Unlike traditional algorithms which leverage anchor boxes as prior knowledge for object detection, anchor-free algorithms operate by directly predicting the object’s bounding box without relying on predetermined anchor boxes. They circumvent drawbacks such as choosing the optimal scale, ratio, and number of anchor boxes for different datasets and tasks. Furthermore, they simplify the pipeline of object detection models and have been successful in certain contexts on both efficiency and accuracy fronts.
To make the evaluation results more convincing, the nine algorithms cover the four kinds of algorithms mentioned above.
3.1.1. RotatedRetinanet
Figure 7.
The architecture of Rotated Retinanet
Figure 7.
The architecture of Rotated Retinanet
Retinanet [
41] argues that the core reason why single-stage detectors underperform compared to two-stage models is due to the extreme foreground and background imbalance during training. To address this, the Focal Loss was proposed, which adds two weights to the binary cross-entropy loss to balance the importance of positive and negative samples and reduce the emphasis on easy samples so that the focus of training is on hard negatives. Retinanet is the first single-stage model with accuracy surpassing that of two-stage models. Based on it, Rotated Retinanet predicts an additional angle in the regression branch (x, y, w, h) without other modifications.
3.1.2. S2A-Net
Figure 8.
The architecture of S2A-Net
Figure 8.
The architecture of S2A-Net
S
2A-Net [
42] is a refinement stage model that proposes the Feature Alignment Module (FAM) on the basis of improving deformable convolution (DCN) [
43]. In the refinement stage, the horizontal anchor is refined to a rotated anchor by the Anchor Refinement Network (ARN), which is a learnable offset field module that is directly supervised by box annotations. Next, the feature map within the anchor is aligned and then convolved with the Alignment Convolution. This method eliminates the low-quality, heuristically defined anchors and addresses the uncorrelated problem between anchor boxes and axis-aligned features they cause.
3.1.3. R3Det
Figure 9.
The architecture of R3Det
Figure 9.
The architecture of R3Det
R
3Det [
44] is a refinement stage model that proposes the Feature Refinement Module (FRM) for reconstructing the feature map according to the refined bounding box. Each point in the reconstructed feature map is obtained by adding five feature vectors consisting of five points (four corner points and the center point in the refined bounding box) after interpolation. FRM can alleviate the feature misalignment problems that exist in refined single-stage detectors and can be added multiple times for better performance. Additionally, an approximate SkewIoU loss is proposed, which can better reflect the real loss of SkewIoU while maintaining differentiability.
3.1.4. ROI Transformer
Figure 10.
The architecture of ROI Transformert
Figure 10.
The architecture of ROI Transformert
ROI Transformer [
45] is a two-stage model that adds a learnable module from horizontal RoI (HRoI) to rotated RoI (RRoI). It generates HRoI based on a small number of horizontal anchors and proposes RRoI via the offset of the rotated ground truth relative to HRoI. This operation eliminates the need to preset a large number of rotated anchors with different angles for directly generating RRoI. In the next step, the proposed Rotated Position Sensitive RoI Align extracts rotation-invariant features from the feature map and RRoI to enhance subsequent classification and regression. This study also examines the advantages of retaining appropriate context in RRoI for enhancing the detector's performance.
3.1.5. Oriented RCNN
Figure 11.
The architecture of Oriented RCNN
Figure 11.
The architecture of Oriented RCNN
Oriented RCNN [
46] is built upon the Faster RCNN [
2] and proposes an efficient oriented RPN network. The oriented RPN uses a novel six-parameter Mid-point Offset Representation to represent the offsets of the rotated ground truth relative to the horizontal anchor box and generate a quadrilateral proposal. Compared with RRPN[
47], it avoids the huge amount of calculation caused by presetting a large number of rotating anchor boxes. Compared to ROI Transformer, it converts horizontal anchor boxes into oriented proposals in a single step, greatly reducing the parameter amount of the RPN network. Efficient and high-quality oriented proposals network make Oriented RCNN both high-accuracy and high-speed.
3.1.6. Gliding Vertex
Figure 12.
The architecture of Gliding Vertex
Figure 12.
The architecture of Gliding Vertex
Gliding Vertex [
48] introduces a robust OBB representation that addresses the limitations of predicting vertices and angles. Specifically, on the regression branch of RCNN, four extra length ratio parameters are used to slide the corresponding vertex on each side of the horizontal bounding box. This approach avoids the problem of order confusion when directly predicting the position of the four vertices and mitigates the high sensitivity issue caused by predicting the angle. Additionally, with the idea of divide and conquer, an area ratio parameter r is used to predict the obliquity of the bounding box. This parameter can guide the regression in Horizontal Bounding Box method or OBB method, resolving the confusion issue of nearly-horizontal objects.
3.1.7. ReDet
Figure 13.
The architecture of ReDet
Figure 13.
The architecture of ReDet
ReDet [
49] argues that the regular CNNs are not equivariant to the rotation, and that rotated data augmentation or RRoI Align can only approximate rotation invariance. To address this issue, ReDet uses e2cnn theory [
50] to design a new rotational equivariant backbone called ReResNet, which is based on ResNet [
1]. The new backbone features a higher degree of rotation weight sharing, allowing it to extract rotation-equivariant features. Additionally, the paper proposes Rotation-Invariant RoI Align which performs warping on the spatial dimension and then circularly switches channels to interpolate and align on the orientation dimension to produce completely rotation-invariant features.
3.1.8. Rotated FCOS
Figure 14.
The architecture of Rotated FCOS
Figure 14.
The architecture of Rotated FCOS
FCOS [
51] is an anchor-free, one-stage detector that employs a full convolution structural design. Unlike traditional detectors, FCOS eliminates the need for presetting anchors, thereby avoiding complex anchor operations, sensitive and heuristic hyperparameter settings, and the large number of parameters and calculations associated with anchors. FCOS employs the four distances (l, r, t, b) between the feature point and the four sides of the bounding box as the prediction format. The distance between the center point and the feature point is used to measure the bounding box's center-ness, which is then multiplied by the classification score to obtain the final confidence. The multi-level prediction based on FPN [
52] alleviates the influence of overlapping ambiguous samples. Rotated FCOS is a re-implementation of FCOS in rotated object detection that adds an additional angle branch parallel to the regression branch.
3.1.9. Oriented RepPoints
Figure 15.
The architecture of Oriented RepPoints
Figure 15.
The architecture of Oriented RepPoints
Based on RepPoints [
53], Oriented RepPoints [
54] summarizes three ways of converting a point set to an OBB, making it suitable for detecting aerial objects. Inherited from RepPoints, Oriented RepPoints combines DCN [
43] with anchor-free key-point detection, enabling model to extract non-axis aligned features from an aerial perspective. To constrain the spatial distribution of point sets, the proposed spatially constrained loss constrains the vulnerable outliers within their instance owner,, and uses GIOU [
55] to quantify localization loss. Additionally, the proposed Adaptive Points Assessment and Assignment adopts four metrics to evaluate the quality of learning point sets, and use them to determine positive samples.
3.2. Evaluation Metrics
In rotated object detection, the ground truth of the object’s position and the bounding box predicted by the model are oriented bounding boxes. Similar to generic target detection, rotated target detection uses Intersection over Union (IoU) to measure the quality of the predicted bounding box:
In the classification stage, based on the combination of the prediction bounding box and the ground truth, four results are produced: True Positives (TP), True Negatives (TN), False Negatives (FN), and False Positives (FP). Precision and recall are formulated as follows:
Based on precision and recall, AP is defined as the area under the precision-recall (P-R) curve, while Mean Average Precision (mAP) is defined as the mean of AP values across all classes:
F1 score is the harmonic mean of precision and recall, which is defined as:
3.3. Environment and Details
All experiment were implemented in Ubuntu 20.04.4 with Python3.8.10, Pytorch1.11.0, Cuda11.3. The CPU is Intel i9 11900k @3.5GHz with 32GB RAM, and the GPU is Nvidia GeForce RTX 3090 (24GB) with driver version 470.103.01.
All models involved in this article are implemented through the MMRotate [
56] framework. For fair comparison, the backbone network of each detector is ResNet50 [
1] pretrained on ImageNet [
57] by default, and the neck is FPN [
52]. Each image in train set was cropped into 4 pieces of 1024*1024, and the four large-scene images in test set were split into a series of 1024*1024 patches with a stride of 824. To display the performance of each detector on Mix MSTAR as fairly as possible, we simply follow these settings without additional embellishments: data augmentation used random flip with a probability of 0.25 on horizontal, vertical or diagonal axes. Each model was trained for 180 epochs with 2 images per batch. The optimizer was SGD with an initial learning rate of 0.005, momentum of 0.9, and weight decay of 1e-4. L2 norm was adopted for gradient clipping, with the maximum gradient set to 35. The learning rate was decayed by a factor of 10 at the 160th and 220th epochs. Linear preheating was used for the first 500 iterations, with the initial preheating learning rate set to 1/3 of the initial learning rate. Additionally, the IoU threshold in the experiments was set to 0.5 and the confidence threshold was set to 0.3. The mAP and its standard error of all models in this article were obtained by training the network with three different random seeds. The final result is obtained by mapping the prediction of each small picture to the big picture and applying NMS. More details can be found in our log files.
3.4. Result Analysis on Mix MSTAR
The evaluation results for nine models on Mix MSTAR are shown in
Table 6 while the class-wise AP results are shown in
Table 7. It is important to note that in
Table 6, Precision, Recall and F1-score are calculated based on the statistics of all categories of TP, FP and FN.
Combining the results and the previous analysis of the model and the dataset, we can draw the following conclusions:
In terms of the mAP metric, Oriented RepPoints achieved the best accuracy, which we attribute to its unique proposal approach based on sampling points. This approach successfully combines the deformation convolution and non-axis aligned feature extraction together. Additionally, being a two-stage model, its feature extraction is more accurate. Compared to other refined models, it has more sampling points, up to 9, which makes the extracted features more comprehensive. However, the heavy use of deformation convolution has made its training speed slow. The two-stage model performs better than the single-stage network due to the initial screening of the RPN network. However, the performance of Gliding Vertex is average, which may be due to its failure to use directed proposals in the first stage, resulting in inaccurate feature extraction. ReDet has poor performance, possibly because the rotation-invariant network used in ReDet is not suitable for SAR images with a low depression. Mix MSTAR are simulated at a depression of 15°, and the shadow area is quite large, leading to significant imaging differences for the same object under different azimuth angles. For example, rotating a vehicle image at an angle of θ by α degrees would produce an image that is significantly different from the image of the same vehicle captured at (θ+α) degrees, which may cause ReResNet to extract incorrect rotation-invariant features. Compared to single-stage models, refined-stage models demonstrate a significant performance improvement, suggesting that refined-stage models are more accurate in extracting non-axis aligned features of rotated objects, which can reduce the gap between refined-stage models and two-stage models. While the performance of R3Det is slightly inferior, it is similar to ReDet, and its reason may lie in the sampling points in its refined stage, which are fixed at the four corners and the center point. In low-pitch-angle SAR images, one vertex far from the radar sensor is necessarily shaded, which means that the feature extraction of the sampling point interferes with the overall feature expression. S2A-Net uses deformation convolution, with the position of the sampling point being learnable. Although there is still a probability of collecting data from the shaded vertexes, there are nine sampling points, which dilutes the influence of features from the shaded vertexes.
In terms of speed, Rotated FCOS performs the best, benefiting from its anchor-free design and full convolution structure. Its parameters and computation are both lower than those of Rotated Retinanet. In contrast, other models use deformation convolution or non-conventional feature alignment convolution or non-full convolution structures, making network speed relatively slow. Due to its special rotation-equivariant convolution, ReDet has the slowest inference speed, even though its parameter and computation is the lowest. In terms of parameter quantities, the two anchor-free models and the single-stage model have fewer parameters than other models. The RPN of ROI Transformer requires two stages to extract the rotation ROI, so it has the most parameters. In terms of computation, due to its multi-head design, the detection head of the single-stage model is too cumbersome, making its computation not significantly lower than that of the two-stage model. However, Mix MSTAR is a small target data set, with most of its ground truth width being below 32. After five times downsampling, its localization information has been lost. Better balance may be obtained by optimizing the regression subnetwork of layers with downsample sizes greater than 32.
In terms of precision and recall metrics, all networks tend to maintain high recall. As using inter-class NMS limits the Recall integration range of mAP, like the DOTA, inter-class NMS is disabled. But this resulted in lower accuracy. Among them, ROI Transformer achieved a balance between accuracy and recall and obtained the highest F1 score.
From the results presented in
Table 7, it is evident that the fine-grained classification result of T72 tank is poor and has a significant impact on all detectors.
Figure 16a further illustrates this point, as the confusion matrix of Oriented RepPoints indicates a considerable number of FP assigned to wrong subtypes of the T72 tank, which is also observed in cross-category confusion intervals such as BTR70-BTR60, 2S1-T62, and T72-T62. Another notable observation is the poor detection effect of BMP2 under EOC, as indicated in the confusion matrix. Many BMP2 subtypes that didn’t appear in the train set are mistaken for other vehicles in testing.
Figure 16b depicts the P-R curves of all detectors.
Figure 17 presents the detection results of three detectors on the same picture. The results showed that the localization of the vehicles was accurate, but the recognition accuracy was not high, with a small number of false positives and misses. Additionally, we discovered two unknown vehicles in the scene, which were initially hidden among the clutters and did not belong to the Chips. One vehicle was recognized as T62 by all three models, while the other vehicle was classified as background, possibly because its area was significantly larger than the vehicles in the Mix MSTAR. This indicates that the model trained by Mix MSTAR has the ability to recognize real vehicles.
4. Discussion
For a synthetic dataset that aims to become a detection benchmark, both fidelity and effectiveness are essential. However, in the production of Mix MSTAR, it is necessary to manually extract vehicles from Chips and fuse radar data collected under different modes before generating the final image. Thus, there are two potential problems in this process, which will affect the visual effectiveness of the synthetic images:
Artificial traces: The vehicle masks manually extracted can alter the contour features of the vehicles and leave artificial traces in the synthetic images. Even though Gaussian smoothing was applied to reduce this effect on the vehicle edges, theoretically, these traces could still be utilized by overfitting models to identify targets.
Data variance: The vehicle and background data in Mix MSTAR were collected under different operating modes. Although we harmonized the data amplitude based on reasonable assumptions, Chips was collected using spotlight mode, while Clutters used strip mode. The two different scanning modes of radar can cause variances in the image style (particularly spatial distribution) of foreground and background in the synthetic images. This could lead detection models to find some cheating shortcuts due to the non-realistic effects of the synthetic images, failing to extract common image features.
To address these concerns, we designed two separate experiments to demonstrate the reliability of the synthetic dataset.
4.1. The Artificial Traces Problem
To address the potential problem of artificial traces and to prove the fidelity of the synthetic dataset, our approach was to use a model trained on Mix MSTAR to detect intact vehicle images. We randomly selected 25 images from the Chips and expanded them to 204x204 to maintain their original size. These images were then stitched into a 1024x1024 large image, which was input into the ROI Transformer trained on Mix MSTAR. As shown in
Figure 18a, all these intact vehicles were accurately localized, with a classification accuracy of 80%.
However, an accuracy of 80% is not an ideal result, as the background in Chips is quite simple and the five misidentified vehicles were all subtypes of T72. As a comparison experiment, we trained and tested ResNet18 as a classification model on the 20 classes Chips of MSTAR, following the same partition strategy as Mix MSTAR, and the classifier easily achieved 92.22% accuracy. However, we found through class activation maps [
58] that since each type of vehicle in MSTAR was captured at different angles, but at the same location, the high correlation between the backgrounds in Chips causes the classifier to focus more on the terrain than the vehicles themselves. As shown in
Figure 19, the two subtypes of T72 were identified based on their tracks and unusual vegetation, with recognition rates of 98.77% and 100%. However, the accuracy of the two T72 subtypes that did not benefit from background correlation was only 73.17% and 66.67%, respectively. This phenomenon also existed in other types of vehicles, indicating that the training results of using background-correlated Chips are actually unreliable.
Through the detection of intact vehicles in real images, we have proven that the artificial traces generated in the process of mask extraction did not affect the models. On the contrary, benefit from the mask extraction and background transfer, Mix MSTAR eliminated background correlation, allowing models trained on the high-fidelity synthetic images to focus on vehicle features, such as shadows and bright spots, as shown in
Figure 18b.
4.2. The Data Variance Problem
To address potential data variance problem and demonstrate the authentic detection capability of models obtained from Mix MSTAR, we designed the following experiment to prove the effectiveness of the Mix MSTAR. The real dataset, Mini SAR was used to train and evaluate models pretrained on Mix MSTAR and those not pretrained on Mix MSTAR. For the pretrained models, we froze the weights of first stage of the backbone, forcing the network to extract features in the same way as it does with synthetic images. The non-pretrained models were loaded from ImageNet weights as a regular setting. We selected nine images containing vehicles as the dataset, and seven were used for training and two for validation. The images were divided into 1024x1024 images with a stride of 824. Since the dataset was very small, the training process of each network was unstable. Therefore, we extended the number of iterations to 240 epochs, recorded the mAP of the model on the validation set after each epoch, and set the learning rate reducing 10 folds at epoch 160 and epoch 220, with all other settings consistent with those in the Mix MSTAR experiments. It is worth noting that there is no perfect unified training setting that can fit all detectors due to their different feature extraction capabilities and the propensity for overfitting on the small dataset. Thus, we record the best results of the validation set during training in
Table 8.
Firstly, as shown in
Table 8, all models obtained an improvement after being pretrained on Mix MSTAR. Since the weights of the first layer are frozen after pretraining, this indicates that the models effectively learn how to extract general underlying features from SAR images. Secondly, since the validation set contains only two images, the results of non-pretrained models were very unstable, but the standard errors of all models were significantly reduced after pretraining on Mix MSTAR. Additionally, as shown in
Figure 20, the pretrained models had very rapid loss reduction during the training process. See
Figure 21, after a few epochs, their accuracy on the validation set increased significantly, and ultimately reached a relatively stable result. However, the loss and mAP of the non-pretrained models were unstable.
We noticed that Rotated RetinaNet and Rotated FCOS are very sensitive to the random seed initialization, making them prone to training failure. This may be due to the weak ability of single-stage detectors in feature extraction, which makes it difficult for them to learn effective feature extraction capabilities from a small quantity of data. Therefore, we conducted a comparison experiment in which we added the Mix MSTAR train set to the Mini SAR train set to increase the data size when training the non-pretrained models. As shown in
Table 9, both single-stage models obtained significant improvements after mixed training with the two datasets. As seen in
Figure 22, pretraining on Mix MSTAR or mixed training with Mix MSTAR both resulted in increased recall and precision of the models, achieving more accurate bounding box regression.
Based on the above comparison experiments using real data, we have demonstrated the effectiveness of Mix MSTAR, indicating that synthetic data can also help networks learn how to extract features from real SAR images, thereby proving the effectiveness and transferability ability of the Mix MSTAR. In addition, the experiment shows that the unstable Mini SAR is not suitable as the benchmark dataset for algorithm comparison, especially for the single-stage model, and also verifies that the Mix MSTAR is effective in addressing the problem of insufficient real data for SAR vehicle detection.
4.3. Potential Application
As more and more creative work leverages synthetic data to advance human understanding towards the real world, Mix MSTAR, as the first public SAR vehicle multi-class detection dataset, has many potential applications. Here, we envision two potential use cases:
SAR image generation. While mutual conversion between optical and SAR imagery is no longer a groundbreaking achievement, current style transfer methods between visible light and SAR are primarily used for low-resolution terrain classification [
59]. Given the scarcity of high-resolution SAR images and the abundance of high-resolution labeled visible light images, a promising avenue is to combine the two to generate more synthetic SAR images to address the lack of labeled SAR data and ultimately improve real SAR object detection. Although the synthetic image obtained in this way can not be used for model evaluation, it can help the detection model to obtain stronger positioning ability when detecting real SAR objects through pre-training or mixed training.
Figure 23 demonstrates an example of using CycleGAN [
60] to transfer vehicle images from DOTA domain to the Mix MSTAR domain.
Out-of-distribution detection. Out-of-distribution detection, or OOD detection, aims to detect test samples that drawn from a distribution that is different from the training distribution [
61]. Using the model trained by synthetic images to classify real images was regarded as a challenging problem in SAMPLE[
25]. Unlike visible light imagery, SAR imaging is heavily influenced by sensor operating parameters, resulting in significant stylistic differences between images captured under different condition. Our experiments found that current models’ performance on different SAR datasets is poorly generalizable. If reannotation and retraining are required for every new dataset, the cost will increase significantly, exacerbating the scarcity of SAR imagery and limiting the application scenarios of SAR-ATR. Therefore, it is an important research direction to use the limited labeled datasets to detect more unlabeled data. We used the Redet model trained on Mix MSTAR to detect real vehicles in a image from FARAD KA BAND. Due to resolution differences, three vehicles were detected dafter applying multi-scale test techniques as shown in
Figure 24.
Author Contributions
Conceptualization, Z.L. and S.L.; methodology, Z.L.; software, S.L.; validation, Z.L., S.L. and Y.W.; formal analysis, Z.L.; investigation, Z.L.; resources, Z.L.; data curation, S.L.; writing—original draft preparation, S.L.; writing—review and editing, Z.L.; visualization, Z.L.; supervision, Z.L.; project administration, Z.L.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Three data generation methods around MSTAR. (
a) Some sample pictures from [
20] based on GANs; (
b) Some sample pictures from [
25] based on CAD 3D modeling and electromagnetic calculation simulation;
(c) A sample picture from [
26] based on background transferring.
Figure 1.
Three data generation methods around MSTAR. (
a) Some sample pictures from [
20] based on GANs; (
b) Some sample pictures from [
25] based on CAD 3D modeling and electromagnetic calculation simulation;
(c) A sample picture from [
26] based on background transferring.
Figure 2.
The pipeline of construct the synthetic dataset.
Figure 2.
The pipeline of construct the synthetic dataset.
Figure 3.
Vehicle segmentation label, containing a mask of the vehicle and its shadow and a rotated bounding box of its visually salient part. (a) The label of the vehicle when the boundary is relatively clear; (b) The label of the vehicle when the boundary is blurred.
Figure 3.
Vehicle segmentation label, containing a mask of the vehicle and its shadow and a rotated bounding box of its visually salient part. (a) The label of the vehicle when the boundary is relatively clear; (b) The label of the vehicle when the boundary is blurred.
Figure 4.
(a) The pipeline of extracting grass and calculating the cosine similarity; (b) The histogram of the grass in Chips and Clutters.
Figure 4.
(a) The pipeline of extracting grass and calculating the cosine similarity; (b) The histogram of the grass in Chips and Clutters.
Figure 5.
A picture from the test set with 10346*1784 pixels (a) Densely ranked vehicles; (b) Sparsely ranked vehicles; (c) Town scene; (d) Field scene.
Figure 5.
A picture from the test set with 10346*1784 pixels (a) Densely ranked vehicles; (b) Sparsely ranked vehicles; (c) Town scene; (d) Field scene.
Figure 6.
Data statistics of Mix MSTAR (a) The area distribution of different categories of vehicles; (b) Histogram of number of annotated instances per image; (c) The number of vehicles in different azimuths; (d) The length - width distribution and aspect ratio distribution of vehicles.
Figure 6.
Data statistics of Mix MSTAR (a) The area distribution of different categories of vehicles; (b) Histogram of number of annotated instances per image; (c) The number of vehicles in different azimuths; (d) The length - width distribution and aspect ratio distribution of vehicles.
Figure 16.
(a) Confusion matrix of Oriented RepPoints on Mix MSTAR; (b) The P-R curves of models on Mix MSTAR.
Figure 16.
(a) Confusion matrix of Oriented RepPoints on Mix MSTAR; (b) The P-R curves of models on Mix MSTAR.
Figure 17.
Some detection result of different models on Mix MSTAR. (a) Ground truth; (b) Result of S2A-Net; (c) Result of ROI Transformer; (d) Result of Oriented RepPoints.
Figure 17.
Some detection result of different models on Mix MSTAR. (a) Ground truth; (b) Result of S2A-Net; (c) Result of ROI Transformer; (d) Result of Oriented RepPoints.
Figure 18.
(a) The result of ROI Transformer on concatenated Chips; (b) Class activation map of concatenated Chips.
Figure 18.
(a) The result of ROI Transformer on concatenated Chips; (b) Class activation map of concatenated Chips.
Figure 19.
(a) (c) T72 A05 Chips; (f) (g) T72 A07 Chips; (b) (d) Class activation map of T72 A05 Chips;(f) (h) Class activation map of T72 A07 Chips.
Figure 19.
(a) (c) T72 A05 Chips; (f) (g) T72 A07 Chips; (b) (d) Class activation map of T72 A05 Chips;(f) (h) Class activation map of T72 A07 Chips.
Figure 20.
The loss of pretrained/unpretrained models during training on Mini SAR. (a) Rotated Retinanet; (b) S2A-Net; (c) R3Det; (d) ROI Transformer; (e) Oriented RCNN; (f) ReDet; (g) Gliding Vertex; (h) Rotated FCOS; (i) Oriented RepPoints.
Figure 20.
The loss of pretrained/unpretrained models during training on Mini SAR. (a) Rotated Retinanet; (b) S2A-Net; (c) R3Det; (d) ROI Transformer; (e) Oriented RCNN; (f) ReDet; (g) Gliding Vertex; (h) Rotated FCOS; (i) Oriented RepPoints.
Figure 21.
The mAP of pretrained/unpretrained models during training on Mini SAR. (a) Rotated Retinanet; (b) S2A-Net; (c) R3Det; (d) ROI Transformer; (e) Oriented RCNN; (f) ReDet; (g) Gliding Vertex; (h) Rotated FCOS; (i) Oriented RepPoints.
Figure 21.
The mAP of pretrained/unpretrained models during training on Mini SAR. (a) Rotated Retinanet; (b) S2A-Net; (c) R3Det; (d) ROI Transformer; (e) Oriented RCNN; (f) ReDet; (g) Gliding Vertex; (h) Rotated FCOS; (i) Oriented RepPoints.
Figure 22.
Some detection result of Rotated Retinanet on Mini SAR. (a) Ground truth; (b) Rotated Retinanet trained on Mini SAR only; (c) Rotated Retinanet pretrained on Mix MSTAR; (d) Rotated Retinanet train on Mini SAR and Mix MSTAR.
Figure 22.
Some detection result of Rotated Retinanet on Mini SAR. (a) Ground truth; (b) Rotated Retinanet trained on Mini SAR only; (c) Rotated Retinanet pretrained on Mix MSTAR; (d) Rotated Retinanet train on Mini SAR and Mix MSTAR.
Figure 23.
The style transfer of optical and SAR by using CycleGAN. (a) A optical car image with label from DOTA domain; (b) Transferred image on Mix MSTAR domain.
Figure 23.
The style transfer of optical and SAR by using CycleGAN. (a) A optical car image with label from DOTA domain; (b) Transferred image on Mix MSTAR domain.
Figure 24.
Detection result of Redet on FARAD KA BAND. (a) Ground truth; (b) Result.
Figure 24.
Detection result of Redet on FARAD KA BAND. (a) Ground truth; (b) Result.
Table 1.
Detailed information of existing public SAR vehicle datasets with large scenes.
Table 1.
Detailed information of existing public SAR vehicle datasets with large scenes.
Datasets |
Resolution (m) |
Image Size (pixel) |
Images (n) |
Vehicle Quantity |
Noise Interference |
FARAD X BAND [13] |
0.1016*0.1016 |
1682*3334- 5736*4028 |
30 |
Large |
√ |
FARAD KA BAND [14] |
0.1016*0.1016 |
436*1288- 1624*4080 |
175 |
Large |
√ |
Spotlight SAR [15] |
0.1000*0.1000 |
3000*1754 |
64 |
Small |
× |
Mini SAR [16] |
0.1016*0.1016 |
2510*1638- 2510*3274 |
20 |
Large |
× |
MSTAR [10] |
Clutters |
0.3047*0.3047 |
1472*1784- 1478*1784 |
100 |
0 |
× |
Chips |
0.3047*0.3047 |
128128- 192*193 |
20000 |
Large |
× |
Table 2.
Basic radar parameter of Chips and Clutters in MSTAR.
Table 2.
Basic radar parameter of Chips and Clutters in MSTAR.
Collection Parameters |
Chips |
Clutters |
Center Frequency |
9.60 GHz |
9.60 GHz |
Bandwidth |
0.591 GHz |
0.591 GHz |
Polarization |
HH |
HH |
Depression |
15° |
15° |
Resolution(m) |
0.3047*0.3047 |
0.3047*0.3047 |
Pixel Spacing(m) |
0.202148*0.203125 |
0.202148*0.203125 |
Platform |
airborne |
airborne |
Radar Mode |
spot light |
strip map |
Data type |
float32 |
uint16 |
Table 3.
Analysis of grassland data from Chips and Clutters in the same period.
Table 3.
Analysis of grassland data from Chips and Clutters in the same period.
Grassland |
Collection Date |
Mean |
Std |
CV(std/mean) |
CSIM |
BMP2 SN9563 |
1995.09.01 - 1995.09.02 |
0.049305962 |
0.030280159 |
0.614127740 |
0.99826 |
BMP2 SN9566 |
0.046989963 |
0.028360445 |
0.603542612 |
0.99966 |
BMP2 SN C21 |
0.046560729 |
0.02830699 |
0.607958479 |
0.99958 |
BTR70 SN C71 |
0.046856523 |
0.028143257 |
0.600626235 |
0.99970 |
T72 SN132 |
0.045960505 |
0.028047173 |
0.610245101 |
0.99935 |
T72 SN812 |
0.04546104 |
0.027559057 |
0.606212638 |
0.99911 |
T72 SNS7 |
0.041791245 |
0.025319219 |
0.605849838 |
0.99260 |
Clutters |
1995.09.05 |
63.2881255 |
37.850263 |
0.598062633 |
11
|
Table 4.
Original imaging algorithm and improved imaging algorithm.
Table 4.
Original imaging algorithm and improved imaging algorithm.
Original imaging algorithm |
Improved imaging algorithm |
Input: Amplitude in MSTAR Data a>0, enhance=T or F |
Input: Amplitude in MSTAR Data a>0, Threshold thresh |
Output: uint8 image img |
Output: uint8 image img |
1: fmin←min(a), fmax←max(a) |
1: for pixel∈a do |
2: frange←fmax-fmin, fscale←255/frange |
2: if pixel>thresh then
|
3: a←(a-fmin)/fscale |
3: pixel←thresh |
4: img←uint8(a) |
4: scale←255/thresh |
5: if enhance then
|
5: img←uint8(a/scale) |
6: hist8←hist(img) |
6: Return img |
7: maxPixelCountBin←index[max(hist8)] |
|
8: minPixelCountBin←index[min(hist8)] |
|
9: if minPixelCountBin>maxPixelCountBin then
|
|
10: thresh←minPixelCountBin-maxPixelCountBin |
|
11: scaleAdj←255/thresh |
|
12: img←img*scaleAdj |
|
13: else
|
|
14: img←img*3 |
|
15: img←uint8(img) |
|
16: Return img |
|
Table 5.
The division of Mix MSTAR.
Table 5.
The division of Mix MSTAR.
Class |
Train |
Test |
Total |
2S1 |
192 |
82 |
274 |
BMP2 |
195 |
392 |
587 |
BRDM2 |
192 |
82 |
274 |
BTR60 |
136 |
59 |
195 |
BTR70 |
137 |
59 |
196 |
D7 |
192 |
82 |
274 |
T62 |
191 |
82 |
273 |
T72 A04 |
192 |
82 |
274 |
T72 A05 |
192 |
82 |
274 |
T72 A07 |
192 |
82 |
274 |
T72 A10 |
190 |
81 |
271 |
T72 A32 |
192 |
82 |
274 |
T72 A62 |
192 |
82 |
274 |
T72 A63 |
192 |
82 |
274 |
T72 A64 |
192 |
82 |
274 |
T72 SN132 |
137 |
59 |
196 |
T72 SN812 |
136 |
59 |
195 |
T72 SNS7 |
134 |
57 |
191 |
ZIL131 |
192 |
82 |
274 |
ZSU234 |
192 |
82 |
274 |
Total |
3560 |
1832 |
5392 |
Table 6.
Performance Evaluation of models on Mix MSTAR.
Table 6.
Performance Evaluation of models on Mix MSTAR.
Category |
Model |
Params(M) |
FLOPs(G) |
FPS |
mAP |
Precision |
Recall |
F1-score |
One-stage |
Rotated Retinanet |
36.74 |
218.18 |
29.2 |
61.03±0.75 |
30.98 |
89.36 |
46.01 |
Refine-stage |
S2A-Net |
38.85 |
198.12 |
26.3 |
72.41±0.10 |
31.57 |
95.74 |
47.48 |
R3Det |
37.52 |
235.19 |
26.1 |
70.87±0.31 |
22.28 |
97.11 |
36.24 |
Two-stage |
ROI Transformer |
55.39 |
225.32 |
25.3 |
75.17±0.24 |
46.90 |
93.27 |
62.42 |
Oriented RCNN |
41.37 |
211.44 |
26.5 |
73.72±0.45 |
38.24 |
93.56 |
54.29 |
ReDet |
31.7 |
54.48 |
18.4 |
70.27±0.75 |
45.83 |
89.99 |
60.73 |
Gliding Vertex |
41.37 |
211.31 |
28.5 |
71.81±0.19 |
44.17 |
91.78 |
59.64 |
Anchor-free |
Rotated FCOS |
32.16 |
207.16 |
29.7 |
72.27±1.27 |
27.52 |
96.47 |
42.82 |
Oriented RepPoints |
36.83 |
194.35 |
26.8 |
75.37±0.80 |
34.73 |
95.25 |
50.90 |
Table 7.
AP50 of each category on Mix MSTAR.
Table 7.
AP50 of each category on Mix MSTAR.
Class |
Rotated Retinanet |
S2A-Net |
R3Det |
ROI Transformer |
Oriented RCNN |
ReDet |
Gliding Vertex |
Rotated FCOS |
Oriented RepPoints |
Mean |
2S1 |
87.95 |
98.02 |
95.16 |
99.48 |
97.52 |
95.48 |
95.38 |
97.22 |
98.16 |
96.0 |
BMP2 |
88.15 |
90.69 |
90.62 |
90.82 |
90.73 |
90.67 |
90.65 |
90.66 |
90.80 |
90.4 |
BRDM2 |
90.86 |
99.65 |
98.83 |
99.62 |
99.03 |
98.14 |
98.39 |
99.72 |
99.22 |
98.2 |
BTR60 |
71.86 |
88.52 |
88.07 |
88.02 |
85.67 |
88.84 |
86.18 |
86.55 |
86.54 |
85.6 |
BTR70 |
89.03 |
98.06 |
95.36 |
97.57 |
97.68 |
92.53 |
97.02 |
96.67 |
95.06 |
95.4 |
D7 |
89.76 |
90.75 |
93.38 |
98.02 |
95.70 |
93.42 |
95.51 |
95.52 |
96.21 |
94.3 |
T62 |
78.46 |
88.66 |
91.20 |
90.29 |
92.39 |
86.53 |
89.70 |
89.99 |
90.20 |
88.6 |
T72 A04 |
37.43 |
56.71 |
50.23 |
55.97 |
55.42 |
50.44 |
50.01 |
50.46 |
53.40 |
51.1 |
T72 A05 |
31.09 |
40.71 |
43.10 |
46.17 |
48.27 |
44.56 |
45.93 |
42.09 |
50.04 |
43.5 |
T72 A07 |
29.50 |
40.28 |
40.13 |
37.13 |
38.22 |
33.40 |
38.49 |
33.61 |
44.37 |
37.2 |
T72 A10 |
27.99 |
39.82 |
36.00 |
40.71 |
36.81 |
34.04 |
34.44 |
40.67 |
47.57 |
37.6 |
T72 A32 |
69.24 |
79.96 |
83.05 |
82.57 |
80.77 |
77.48 |
77.02 |
74.56 |
78.65 |
78.1 |
T72 A62 |
41.06 |
49.49 |
50.05 |
54.32 |
47.31 |
41.97 |
45.71 |
53.77 |
54.00 |
48.6 |
T72 A63 |
38.10 |
51.07 |
46.45 |
53.63 |
50.06 |
43.79 |
49.27 |
49.44 |
53.05 |
48.3 |
T72 A64 |
35.51 |
58.28 |
57.54 |
67.37 |
66.65 |
57.95 |
63.38 |
58.47 |
66.14 |
59.0 |
T72 SN132 |
34.18 |
54.95 |
45.71 |
59.85 |
58.16 |
56.80 |
52.23 |
54.35 |
65.38 |
53.5 |
T72 SN812 |
49.27 |
72.01 |
61.86 |
77.42 |
74.23 |
65.13 |
71.34 |
73.33 |
72.14 |
68.5 |
T72 SNS7 |
43.61 |
59.37 |
56.77 |
66.43 |
65.70 |
57.43 |
64.56 |
64.43 |
67.62 |
60.7 |
ZIL131 |
96.06 |
96.24 |
97.76 |
99.00 |
97.88 |
98.78 |
96.16 |
95.24 |
99.58 |
97.4 |
ZSU234 |
91.55 |
95.03 |
96.17 |
99.09 |
96.15 |
98.04 |
94.90 |
98.65 |
99.26 |
96.5 |
mAP |
61.03 |
72.41 |
70.87 |
75.17 |
73.72 |
70.27 |
71.81 |
72.27 |
75.37 |
71.4 |
Table 8.
Best mAP of pretrained/unpretrained models on Mini SAR validation set.
Table 8.
Best mAP of pretrained/unpretrained models on Mini SAR validation set.
Model |
Unpretrained |
Pretrained |
Rotated Retinanet |
38.00±15.52 |
71.40±0.75 |
S2A-Net |
65.63±1.94 |
69.81±0.89 |
R3Det |
66.30±2.66 |
70.35±0.18 |
ROI Transformer |
79.42±0.61 |
80.12±0.01 |
Oriented RCNN |
70.49±0.47 |
80.07±0.24 |
ReDet |
79.47±0.58 |
79.64±0.31 |
Gliding Vertex |
70.71±0.20 |
77.64±0.49 |
Rotated FCOS |
10.82±3.94 |
74.93±2.60 |
Oriented RepPoints |
72.72±2.04 |
79.02±0.39 |
Table 9.
mAP of pretrained/unpretrained/mixed trained models on Mini SAR.
Table 9.
mAP of pretrained/unpretrained/mixed trained models on Mini SAR.
Model |
Trained on Mini SAR only |
Pretrained on Mix MSTAR |
Add Mix MSTAR |
Rotated Retinanet |
38.00±15.52 |
71.40±0.75 |
78.62±0.42 |
Rotated FCOS |
10.82±3.94 |
74.93±2.60 |
77.70±0.10 |