We opt for YOLOX with CSPDarknet as a strong baseline and have substantially modified the structure of YOLOX, forming a new high-performance detector YOLOAX. In the subsequent part, we will step-by-step introduce the entire architecture designs in YOLOAX. The overview of YOLOAX architecture is illustrated in
Figure 2.
3.1. Cross-Stage Partial Attention Module
To exploit the potential of attention mechanism, we present a cross-stage partial attention module called CSPAM to serve as the base module of our backbone, as illustrated in
Figure 3.
Given a feature map
as input, CSPAM is a computational unit which can map the input
to the output
. The representations of the input
will be initially learned through a channel self-attention module (CSAM) we proposed, which can activate the model to selectively inhibit less useful features and emphasize informative ones by learning to use global information. The CSAM can be embedded in any CNN modules to effectively enhance the ability of learning the representations useful for detectors. The whole architecture design of CSAM is illustrated in
Figure 4 and the results of assessing the power of proposed CSPAM are demonstrated in section 4.2.
Considering the problem that each learned filters with a local receptive field has difficulty exploiting contextual information outside that region, we first decide to use global average pooling operation
to squeeze global information into each channel space and refer to each channel output as
. The statistic
Z is formally generated by shrinking the spatial dimensions
H ×
W of
, as defined:
Subsequently, to analyze the mechanisms of each channel and focus on the most discriminative features, we utilize 1×1 base convolutional operation
twice in succession. It’s our view that each information value the model learned is almost compressed from 2D feature maps of each channel and can be thought of as getting a global receptive field in a way of information theory. The output is denoted as
S, which is calculated by:
where
represents a standard 2d convolutional layer with batch normalization operation and SiLU.
To mitigate negative activations and normalize
, as for non-linear activation function, instead of ReLU, our work opt for the smoother activation function SiLU, which outperforms ReLU especially when training deep learning models and processing with more complex information feature maps. By utilizing the
as normalized weighting factors or scalar, the two dimensions of height and width of the input
are rescaled with
in each channel to obtain the final feature map
U, which adaptively highlight the more informative representation of the information feature map:
⨂ represents channel-wise multiplication.
Since the outstanding performance of residual learning framework, our proposed CSPAM adopts the bottleneck architecture of the CSPDarknet to mitigate the gradient disappearance, and optimize the network degradation problems and the training difficulties. For the main considerations that reducing the computational density and the amount of hyperparameters, the channels of input will first be expanded to twice its original size after a 1×1 base convolutional operation , and then the channels will be split in half, which one less convolutional operation can be performed comparing to the previous module. The output will be processed separately on both channels, one is successively convolved using a 3×3 base convolutional kernel with a step size of two through three bottleneck structures, and its channels is divided into two halves each time, while the other without any operations. Information data from one of the channels of the feature map is retained for stacking at the end after each one pass through a bottleneck structure, which can be regarded as a dense residual structure or information extractor.
After initial feature extraction in multiple CSPAMs, the output of the penultimate and the third CSPAM from different shapes is saved and fed into the FPN structure for enhanced feature extraction. We decide to keep the original FPN structure with the fewest possible modifications and design a pooling module, which is based on spatial attention, in order to extract deterministic feature information from the output of the last CSPAM. More details please refer to the next section.
3.2. Spatial Feature Pooling Module
It is crucial for model to enhance the capability to capture spatial representations and information of features, therefore we propose a Spatial Feature Pooling Module called SFPM, showing in
Figure 5, to replace the SPPBottleneck of CSPDarknet. The input from the last CSPAM is set to
and the
will first be fed into
g and
c separately for different feature extraction.
g and
c denote feature extractors. In
c, to preserve original features as much as possible, it just contains a layer of 1×1 base convolution, but reduce the channels of the input map to half the original number, which tentatively generates the statistics
. In
g, it contains a layer of 3×3 base convolution, a layer of 1×1 base convolution, and a layer using global pooling. To explore the importance of each channel, after initial extraction in the first two layers, we can get a feature tensor denoted
. Subsequently, we separately utilize global max pooling operations on the
. Specifically, for learning useful spatial information, in global pooling layer, three parallel pooling kernels of different sizes will be adopted to extract feature representations of different shapes, and the output
can be got:
Then we perform channel-wise dimensional concat on the
and aggregate them together with the following equation:
Finally, after performing dimensional concat on
with
, and a 1×1 base convolutional operation
, we obtain the most informative output
:
The output, enriched with information from various sources, will be fed into the FPN structure. Here, it undergoes a process of fusion with other features, each possessing distinct shapes and carrying unique informational content. This integration is crucial in information theory, as it allows for the aggregation of diverse data streams, enriching the overall informational value. By extracting and combining these features, the FPN structure generates three enhanced features that capture a more comprehensive representation of the input data. These enhanced features, which embody the essence of the original information while also incorporating new perspectives, are ultimately utilized for prediction, leveraging the power of information theory to achieve accurate and insightful outcomes.
3.3. Anchor-Free
There are some known conflicts between classification and regression tasks when models using for object detection with original anchor-based detectors. In the aspects of the overall inference complexity and latency, anchor mechanism may become the potential bottleneck, especially moving large amount of predictions for each image between devices on serval special AI systems. Following the exploration of DenseBox [
31] and YOLO, the performance of Anchor-free detectors has been widely acknowledged in the past five years, which can attain the same AP on object detection as anchor-based one while reducing the number of parameters.
Based on Anchor-free manner of YOLOX, we propose a new decoupled head, as shown in
Figure 6, without the branch for IoU to lighten the model in terms of both speed and size. Firstly, we can obtain three enhanced features of different shapes via the FPN structure and define the input feature maps as
. Then the input
are separately processed through two parallel convolutional lines for the prediction of classification (Cls) and regression (Reg), which is composed of two layers of 3×3 base convolution and two 1×1 convolutional layers. We take
tr to be an operator to denote the above series of convolutional operations of each line, to get the final tensor
for the branches to compute loss:
We still opt for BCE Loss for training classification branch, but we propose a new information loss function Generalized Efficient IOU Loss called GEIOU Loss for regression branch because we infer the original use of IOU Loss may harm the performance. Compared to previous Losses, GEIOU introduces Generalized Focal Loss [
24] to achieves the great balance between easy and hard samples in the regression stage, and address the problem of learning distributed and qualified representation for dense detection.
In GEIOU, given the box of prediction
and the ground truth box
, we divide the calculation of Reg loss into two parts: the first part uses EIOU Loss [
21] to calculate the IOU between
and
; the latter part uses GFL to accurately locate box in arbitrary distribution and depict the flexible distribution in real data. The specific formulation of EIOU Loss can be defined as follows:
where
and
mean the center points of each B and B^gt.
,
and
,
indicates the height and width of
and
, respectively.
and
represent the height and width of the smallest outer box that covers both boxes.
Considering the problem that, if the number of high-quality anchors is much less than poor samples’s with large regression errors, it may produce large gradient to affect the training phase when in a bounding box regression task. To tackle the problem, we reweight the EIOU Loss by using the value of GFL and get
𝓛GEIOU as follows:
where the specific formulation and derivation of GFL can be found in [
24], and
β can be seen as a controlling parameter to inhibit outliers. We also take ablation studies and observe significant improvements over counterparts trying other IOU Loss. More details are shown in section 4.2.
3.4. Simple Task Assigner
Due to the fact that advanced label assignments have become an important progress for object detection, we opt for SimOTA [
9] as our start point and optimize it a bit. And considering the excellence of the dynamic label assignments, we choose to combine it with TaskAlignedAssigner [
32] in a weighted way, and propose a simple task assigner named STA for optimal transport problem.
For reducing the cost between ground truth box
and prediction
, we first add a new element, which is defined as
E and denotes the samples that are excluded in the intersection part of
and
, to the cost function in SimOTA and use a coefficient μ to control the degree of scaling of the
E. The new cost
we proposed is calculated as:
where
𝓛cls and
𝓛reg are Cls loss and Reg loss between
and
. The
and
are both balancing coefficients, and the
is a controlling coefficient, an extremely large constant which we take the value of 10^5, to force the detector to prefer samples within the intersection for matching in the process of minimizing cost.
Then inspired by the assigning strategy in TaskAlignedAssigner, which select positive samples based on the scores weighted by the Cls and Reg scores, we modify the
as the follow equation:
For each , the degree of alignment can be measured by multiplying the two weighted loss, and finally the top k prediction with the least inference cost are selected as its positive samples. As shown in section 4.2, the STA show its power, which raises the performance of our detector from 50.7% AP to 51.3% AP.