3.1. Preliminaries
Problem Definition In this work, we focus on processing the task of few-shot object detection (FSOD). Given two sets of classes, a base set Cbase and a novel set Cnovel, where Cbase ∩ Cnovel = ∅. Defining two datasets, a base dataset Dbase with sufficient annotated objects of Cbase and a novel dataset Dnovel with few annotated objects of Cnovel. A few-shot object detector aims at classifying and localizing objects of Cbase ∪ Cnovel by learning from Dbase ∪ Dnovel. In a task of Nn-way K-shot object detection with Nn = |Cnovel|, there are exactly K annotated instances for each novel class in Dnovel. The goal of this work is to train a model that can detect novel classes in Cnovel by only providing K-shot labeled samples for Cnovel and abundant images from Cbase. Basically, images from Cbase are split into support image set Sb containing support images with a close-up of the target object, and query image set Qb containing query images which potentially contains objects belonging to the support class. Given all support images Sb, our model learns to detect objects in Qb. For convenience, we denote Cbase, Cnovel, Dbase and Dnovel as Cb, Cn, Db and Dn in the following sections.
Rethink Region-Based Object Detectors The majority of current few-shot object detection methods rely heavily on the Faster R-CNN framework [
16], which leverages a region proposal network (RPN) to generate potentially relevant bounding boxes to facilitate subsequent detection tasks. The RPN plays a pivotal role, as it must not only distinguish between objects and the background but also filter out negative objects belonging to non-matching support categories. However, under the few-shot detection setting, where support image information is extremely limited, the RPN often struggles. It tends to indiscriminately focus on every potential object with a high objectness score, regardless of whether they belong to the support category or not. This behavior can hamper the generalization of knowledge from base classes to novel classes, and it also places a significant burden on the subsequent classification task of the detector, as it has to deal with a large number of irrelevant objects. Previous studies [
1,
3,
4,
16,
21,
30] have attempted to address this challenge by generating more accurate region proposals. Nevertheless, the issue persists, stemming from the inherent limitations of region-based object detection frameworks within the few-shot learning context. To truly address this challenge, it is essential to develop novel strategies that can effectively leverage the limited support image information, enhancing the discriminative capabilities of the RPN and ensuring that it focuses only on relevant objects, thus improving the overall performance of few-shot object detection systems.
Rethink Transformer-Based Detection Frameworks Transformer [
31] emerged as a revolutionary self-attention-based building block specifically tailored for machine translation tasks. This architecture revolutionizes the way sequences are processed, updating each element by scanning through the entire sequence and subsequently aggregating information from it. Seeking to harness the immense potential of the Transformer, DETRs [
1] introduced an innovative approach by integrating a transformer encoder-decoder architecture into an object detector. This integration enabled the system to tackle the intricate challenge of attending to multiple support classes within a single forward pass. Nevertheless, a notable issue persists: the vision transformers employed in DETRs were randomly initialized, limiting their capabilities to solely processing feature representations extracted by the backbone network. This constraint underscores the need for further advancements to fully unlock the potential of vision transformers in object detection tasks.
3.2. Architecture
Correlation-Aware Region Proposal Network In generic object detection, RPN is useful to provide region proposals and generate object-relevant anchors while suffer from performance drop when in few-shot object detection, since the low-quality region proposals for novel classes and the fatigue to capture inter-class correlation among different classes. Take inspiration of success of RPN-based FSOD framework [
21], we propose a novel network structure based on general RPN which learns the matching correlation between the support set
Sb and queries
Qb.
Figure 2 shows the overall architecture of our proposed Correlation-RPN. The Correlation-RPN can make use of the support information to sensitively aware the similarities and difference between
Sb and
Qb, which is able to provide high-quality region proposals with objects of target or novel classes, while relatively depressing proposals in background or in other categories.
Specifically, we compute the correlational metric between the feature map of
Sb and
Qb in a depth-wise manner. The similarity map then is utilized to build the region proposal generation. In particular, we denote the support features of
Sb as
and feature map of the query
as
, the similarity is defined as:
where
F is the resultant correlation feature map and α, β are control coefficients to prevent overly favoring features of either side. Here the
X is used as the kernel to slide on the query feature map [
32,
33] in a depth-wise cross correlation way [
34]. Our work adopts the top architecture of the Attention RPN [
21]. We empirically find that a kernel size of
K = 1 performs well in our case, since we argue that global feature can provide a great object prior for objectness classification, consistent with [
16]. In our case, the kernel is calculated by averaging on the support feature map
X. The correlation map is processed simultaneously by a 1 × 1 convolution and a 3 × 3 convolution followed by the objectiveness branch and regression branch. The Correlation-RPN is trained jointly with the network and elaborated as in the section 4.3.
Feature Metric Matching Based on the idea of [
1,
29], we integrally migrate the transformer encoder-decoder as the pillars of correlational metric aggregation module into our object detector. The feature metric matching is accomplished in transformer encoder by multi-head attention mechanism. Specifically, given the support features of
Sb, denoted as
, and the query image
, of which feature map is denoted as
, the matching coefficients
M can be obtained by:
where
HW is the feature spatial size,
d is the feature dimensionality,
C is the number of support categories, and
S is a cosine similarity shared by
and
, to ensure they are embedded into the same linear feature projection space. To calculate cosine similarity as the correlational metric of each pair of feature representation of
and
, which is calculated via:
where
and
denote the single feature representation of
and
. Finally, for each
that may contain multiple complex instances, we ensure a fact that the choice of the m potential support cases for each of these instances is same. Therefore, the average correlation score of the m same potential support objects can be considered as the similarity or shared feature representation between
and the m potential
Sb, with which we prefer the
containing the most similar support instance as the powerful support patch of
in training. And this process has been experimentally demonstrated to be helpful for our training. The effectiveness of support-query feature similarity metric mining, i.e., distinguishing support objects similar to the query, is discussed in
Section 4.3.
Encoding Matching In order to achieve class-agnostic object prediction, we propose the utilization of a carefully crafted set of predefined task encodings, which serve as a bridge between the given support classes and the abstract task encodings space. By mapping the support classes to these encodings, we ensure that the final object predictions are constrained within the task encodings space, rather than being limited to predicting specific classes on the surface level. Drawing inspiration from the positional encodings employed in the Transformer architecture, we implement task encodings utilizing sinusoidal functions. This allows us to capture both local and global patterns within the task encodings space, enhancing the representational power of our approach.
Furthermore, encoding matching and feature metric matching share the same matching coefficients. This ensures consistency across different matching processes and simplifies the overall pipeline. The matched encodings
QE are simply obtained through a straightforward process, further streamlining the prediction framework:
where
denote sinusoidal functional multiplication. In essence, our approach offers a more flexible and generalizable framework for object prediction, enabling us to transcend the limitations of traditional class-specific prediction methods and move towards a more abstract and powerful representation of objects.
Modeling Background for Object Prediction Generally, under a few-shot object detection setup, background does not belong to any target classes and usually takes up a lot of space in a support or query image. Those images that objects only account for a small proportion and most of the area is complex background, which we also called hard samples, as shown in
Figure 3. Taking the consideration of this reason, we propose a learnable prototype BG-P and a corresponding task encoding BG-E (fixed to zeros), for explicitly modeling the background class. This can significantly eliminate the matching ambiguity when query is very hard to match any of the given support classes. And we additionally introduce a background-suppression (BS) regularization as an auxiliary branch to help addressing this problem, which will be described in detail in the next section. The final output of the feature metric matching module can be obtained via the following equation:
where
denotes Hadamard product,
denotes background-suppression operation and
denotes sigmoid function. By applying the matching coefficients
M, we filter out features not matched to
Sb, producing a feature map
QF that inhibits the negative impact of hard samples and highlights class-related objects from query set
Qb for each individual support class.
3.3. Training Procedure
Two-Stage Training strategy Our training procedure consists of two stages: the base-class training stage on samples from Cb (Ctrain = Cb), followed by K-shot few-shot fine-tuning stage on a balanced set of samples from both Cb and Cn (Ctrain = Cb ∪ Cn). More precisely, in the second stage where only K labeled samples are individually available for each class in Cn, K samples are randomly selected for each class in Cb to balance the training iterations between Cb and Cn.
Generally, a naive training strategy is matching the objects of same class by constructing a training pair
where the
and
are both in the same
c-th class. However, a powerful model should not only can perform query-support feature similarity mining but also allow capturing the inter-class correlation among different categories. For this reason, according to the different matching results in
Figure 4, we present a novel 4-way tuple-contrast training strategy to match the same category while distinguishing different categories. We randomly choose a query image
, a support image
and a hard sample
containing the same c-th category object and one other support image
containing a different
n-th category object, to construct the training pair
, where c ≠ n. In the training pair
, only the objects of
c-th category in the
are needed and annotated as foreground, while all other objects are neglected and treated as background.
During training, our model learns to match every proposal generated by the Correlation-RPN in the with the object of . Thus, the model needs to not only match the same category objects from and, but also distinguish objects in different categories from . Nevertheless, there are a massive amount of background proposals, especially with , which usually dominate the training. Taking the consideration of this reason, we adjust these training pairs to balance the ratio of proposals between queries and supports. The ratio of is kept as 2:1:1 for , and . According to their matching scores, we pick all N and select top 2N and top N respectively and calculate the matching loss on the selected training pairs.
Detection Loss Function During the training process, we use the multi-task loss on each sampled proposal. It is worth mentioning that we choose the different loss function during optimizing the network in two stages. In the first stage, for each bounding box
, we predict a 2D classification vector
to represent the probability for target object and background respectively. Inspired by [
35,
36,
37], concretely, for a mini-batch of
N RoI box features
, our first stage loss
is defined as follows with considerations tailored for detection:
where x, y denotes locations and
is the hyper-parameter temperature as in [
36]. The
refers to encoded RoI feature of detector head for
i-th region proposal generated by Correlation-RPN,
denotes the IOU score of
with matched ground truth bounding box
, and
denotes the truth annotation.
The second stage output also includes the vector
for distinguishing between background and target object classes. Different from the first stage, following the parameterization in [
14], we present a new regression vector
, to specify a scale-invariant translation and height/width shift of log-space relative to a region proposal. In the second stage, we adopt binary cross entropy (BCE) loss for classification and smooth L1 loss for regression. In combination:
where
refers to class label for target object and
denotes a balancing factor, which we empirically set to 2.
Our total loss function is the combination of the first and second stage loss:
where
= 3 is a balancing factor for second stage loss.
Background-Suppression (BS) Regularization In our proposed structure of CRTED, feature metric matching is developed with the encoder architecture design in transformer by multi-head attention mechanism. It is sure that this design can moderate the training stress for objects with various sizes, but it may still disturb the detector performance when for localization in the scenario of hard samples, especially when in the few-shot condition. For this reason, we propose a novel background-suppression (BS) regularization, by utilizing object knowledge in the domain of ground-truth bounding boxes for each training pair
. Specifically, for the
in
, we first obtain the middle-level
of target domain generating from Correlation-RPN. Then, we adopt a masking method that enables the ground-truth labels of target objects in the image
to be mapped to the convolutional cube. Consequently, we can identify the feature regions corresponding to background, namely
. To minimize the adverse effects of background disturbances, we choose L2 regularization to penalize the activation of
:
With this
, CRTED can depress regions of indifference while pay more attention to where we interest, which is especially important for training in few-shot learning. More details and visualization results of the experiment are shown in Sec. 4.3.
Proposal Consistency Control One of the differences of image classification and object detection is that the former extract semantics from the entire image while the classification signals for the latter come from region proposals. We adopt a settled IoU threshold
to assure the consistency of proposals, with the consideration that low IoU proposals may result in excessive deviation to the center of regressed objects, therefore might include irrelevant semantic information. In the following formula,
is responsible for controlling the consistency of proposals, defined with proposal consistency threshold φ:
where
can re-weight for object proposals with different level of IoU scores. We experimentally find that φ = 0.75 is a good cutoff point which the detector head can be trained according to most centered object recommendations.