3.3.1. Optimization-based class representation learning
The existing optimization-based class representation learning can be divided into two categories: (1) learning a model-based methods which aim to design network architectures to efficiently adapt to target tasks through only several gradient descent steps; (2) learning to fine-tune-based methods.
Learning a model. In [
75], an optimization-based FSFGIC method was proposed which includes two modules: a bilinear feature learning module and a classifier mapping module. Class-level feature representations are obtained from the bilinear feature learning module [
76]. Furthermore, a novel “piecewise mappings” strategy is proposed which aims to map each sub-vector in class-level feature representations into its corresponding sub-classifier, and then to combine these sub-classifiers together to generate a class-level classifier. Meta Variance Transfer methods [
34] is proposed to learn to augment from the others by observing variations of real data (e.g. geometric deformation, background changes, simple noise) which could hint on unseen variations in other classes. Then the model learns to transfer the variance in the feature space by selecting the variations that could be helpful in simulating the unseen test examples for the target class. In order to combine distribution-level and instance-level relation, Yang et al. [
77] proposed distribution propagation graph network (DPGN). The features of support images and query images are fed into a dual complete graph network, they apply a point-to-distribution aggregation a strategy to aggregate instance similarities to construct distribution representations. And a distribution-to-point aggregation strategy is applied to calculate similarity with both distribution-level and instance-level relations. Inspired by the compositional representation of objects in humans, [
78] proposed a neural network architecture that explicitly represents objects as a dictionary of shared components and their spatial composition.
Learning to fine-tune. A weight imprinting strategy was proposed in [
79] which aimed to set weights directly of a ConvNet classifier for new categories. They applied a normalization layer with a scaling factor in the classifier which aims to transform the features of new category samples into activation vectors as the weights of the normalization layer. In [
80] a transfer-based method was proposed to generate class representations. They applied a power transform mechanism to preprocess support features to make them closer to the Gaussian distribution. According to the gaussian-like distribution, they applied maximum a posteriori probability to find the estimates of each class center which is similar to the minimization of Wasserstein distance. then a iterative algorithm based on a Wasserstein distance to estimate the optimal transport from the initial distribution of the features to its the Gaussian distribution to update the center. In [
81], they proposed an Adaptive Distribution Calibration (ADC) method for FSL, which can adaptively transfer distribution information from related base classes to calibrate the biased distributions of novel classes.
3.3.2. Metric-based class representation learning
Many techniques have been put forward for effective metric-based class representations which can be broadly divided into five categories: feature distribution, attention mechanism, metric strategy for improving class representation, semantic alignment, and multimodal learning.
Feature distribution. It was demonstrated in [
82] that GANs based feature generator [
83] suffers from the issue of mode collapse. To address this problem, variational autoencoder (VAE) [
84] and GANs are combined together to form a conditional feature generation model [
32] which aims to learn the conditional distribution of image features on the labeled class data and the marginal distribution of image features on the unlabeled class data. Alternatively, a multi-mixed feature distribution can be learned for represent each category in RepMet [
85] and perform FSFGIC tasks. Davis et al. [
27] extended DeepEMD method [
59] by reconstructing each query sample as a weighted sum of components from the same class for obtaining class-level feature distribution. [
86] proposed a re-abstraction and perturbing support pair network (RaPSPNet) for FSFGIC. Specifically, they first design a feature re-abstraction embedding (FRaE) module which can not only effectively amplify the difference between the feature information from different categories but also better extract the feature information from images. Furthermore, a novel perturbing support pair (PSP) based similarity measure module is designed which evaluates the relationships of feature information among a query image and two different categories of support images (a support pair) at the same time for guiding the designed FRaE module to find salient feature information from the same category of query and support images and find distinguishable feature information from the different categories of query and support images.
Afrasiyabi [
25] proposed two distribution alignment strategy to align the novel categories to the related base categories which aims to obtain better class respresentations. A centroid alignment strategy and an adversarial alignment strategy based on Wasserstein distance is designed to enforce intra-class compactness. Das et al. proposed a non-parametric approach [
87] to address the problem that only base-class prototypes are available. They considered that all class prototype distributions are arranged on a manifold. They first estimate the novel-class prototypes by calculating the mean of the prototypes which are near the noval samples. A graph is structured with all the class prototypes and an induced absorbing Markov-chain is applied to complete classification task. Inspired by the fact that humans can use learned concepts or components to help them recognize novel classes, [
88] proposed Compositional Prototypical Networks (CPN) to learn a transferable prototype for each human-annotated attribute, which we call a component prototype. In order to learn fine-grained structure in the feature space, Luo et al. proposed a two-path network to adaptively learn the views [
89]. One path is label-guided classification, the support features belong to same class are aggregated into a prototype and the similarities are calculated between the prototypes and query images. Another path is instance-level classification which aims to produce different views for an image, then mapping them into feature space to construct a better fine-grained semantic structure. [
90] proposed to combine the frequency features with routine features. In addition to a regular CNN module, a discrete cosine transformation is applied to generate frequency feature representations. Then, two kinds of feaures are concatenated as the final features. In order to explore intra-class distribution information, [
91] proposed improved prototypical networks (IPN). Compared to the prototype network, IPN adds a weight distribution module to weight different samples belong to same category. And a distance scaling module is applied to maximize the inter-class difference while minimize the intra-class difference via distance measurement at different scales. To gain Gaussian-like distributions, [
92] proposed a transfer-based method to process features belong to same class. They introduce transforms to adjust the distribution of features, and a Wasserstein distance based iterative algorithm to calculate prototype for each class. Similarly, [
93] proposed optimal-transport algorithm to tansform features into Gaussian-like distributions and estimate best class centres.
Attention mechanism. Attention strategy aims to select discriminative feature or region from the extracted feature space for effective class-level feature representation. In [
50], an attention mechanism [
94] was applied to locate and reweight semantically relevant local region pairs between query and support samples which aims to strength discriminative objects and suppress the background. He et al. [
95] indicated that object localization (using local discriminative regions) can provide great help for FSFGIC. Then a self-attention based complementary module which utilize channel attention and spatial attention was designed for performing weakly supervised object localization and finding their corresponding discriminative regions. Alternatively, [
96] utilize channel attention and spatial attention to find discriminative regions from query and support samples for improving the classification performance of FSFGIC. Alternatively, a novel transformer based neural network architecture called CrossTransformers [
97] was designed which applies a cross attention mechanism to find coarse spatial correspondence between the query and support labeled samples in a class. In [
24], an attention mechanism was proposed to mix two modalities (i.e., semantic and visual modalities) and ensure that the representations of attributes are in the same space with visual representation.
Single prototype-based methods may fail to capture the subtle information of a class. To address this problem, Huang et al. [
98] proposed a descriptor-based multi-prototype network (LMPNet) to learn multi-prototype. They design an attention mechanism to weight all channels in each spatial position of all samples adaptively to obtain local descriptors, and construct multiple prototypes based on these descriptors which contain more complete information of a class.
Metric strategy for improving class representation. To obtain discriminative class representations for FSFGIC, image-to-class metric strategies have been proposed. Deep nearest neighbor neural network (DN4) [
7] was proposed which aims to learn optimal class-level local deep feature representation of a class space based on the designed image-to-class similarity measure strategy in the case of extremely limited training samples. A discriminative deep nearest neighbor neural network (D2N4) [
99] extended the DN4 method [
7] by adding a center loss function [
100]. And then class-level local and global feature representations are learned for improving the quality discriminability features in the framework of the DN4 method [
7]. In [
26], Li et al. argued that if samples can be well classified by two different similarity measures at the same time, then the samples in a class can be more compactly distributed in a smaller feature space and a more discriminative feature map can be obtained. In this way, a bi-similarity network (BSNet) [
26] was proposed that the learned class-level feature representations from query and support samples are obtained by using two different similarity measures. In [
38], Zhu et al. argued that a large number of unlabeled data has the high potential to improve the classification performance in FSFGIC tasks. A progressive point to set metric learning (PPSML) [
38] was presented for semi-supervised FSFGIC tasks. The aim of a self-training strategy is to select local and global feature representations from a mixture of labeled and unlabeled samples. Then point to set similarity measure is applied to obtain class-level local or global feature representation for performing FSFGIC tasks. To avoid overfitting and calculate a robust class representation under the condition of extremely limited training samples, a deep subspace network (DSN) [
101] was introduced to transform class representation into an adaptive subspace and generate a corresponding classifier. And then testing samples are classified according to the nearest subspace classifier. That is to assign the query sample to the class which has the shortest Euclidean distance between the query and its projection onto the class-specific subspace.
Triantafillou et al. propose a mean average precision (m-AP) [
102] which aims to learn similarity metric based on information retrieval. They extended the work of that optimizes for AP in order to account for all possible choices of query among the batch points. They then use the frameworks of SSVM (Structural Support Vector Machine) and DLM (Direct Loss Minimization) for optimization of m-AP.
Liu et al. [
103] introduced a negative margin loss to reduce inter-class variance and generate more efficient decision boundaries. They integrate the margin parameter to softmax loss and apply negative margin to softmax loss which aims to improve both the discriminability on training classes and the transferability to novel classes of the learned metrics. Hilliard et al.[
28] proposed a metric-agnostic conditional embeddings (MACO) network. MACO contains four stages, the feature stage is used to obtaining features, the relational stage produces a single veter as the class respresentation of each class. The conditioning stage connects the class representations to query image features which aims to learn the class representation that is more relevant to the query image and the classifier makes the final prediction.
Semantic alignment. It was indicated in [
104] that people tend to compare similar objects thoroughly in a pairwise manner, e.g., comparing the heads of two birds first, then their wings and feet. In this manner, it is natural to enhance feature information during the comparison process. A low-rank pairwise bilinear pooling operation network [
104] was designed for obtaining class-level deep feature representation between query and support samples in terms of the way that people compare similar objects. It was indicated in [
50] that the dominant object can be located anywhere in the image. In this way, directly calculating the distance between query and support samples may cause ambiguity. To address this problem, semantic alignment metric learning (SAML) [
50] was proposed to align the semantically related local regions on samples by a “collect and select” strategy. On the one hand, the similarities of all local region pairs from query samples and support class in a relation matrix are calculated and obtained. On the other hand, an attention mechanism [
94] was applied to “select” the semantically relevant pairs. Li et al. [
96] extended the method in [
50], and a convolutional block attention module [
105] was applied to capture discriminative regions. To eliminate the influence of noise and improve the efficiency of a similarity measure, query-relevant regions from support samples are selected for semantic alignment. And then multi-scale class-level feature representations are utilized to represent discriminative regions of the query and support samples in a class and perform FSFGIC tasks. In [
25] a centroid associative alignment strategy was proposed to enforce intra-class compactness and obtain better class representations.
Alternatively, an end-to-end graph-based approach called explicit class knowledge propagation network (ECKPN) [
17] was proposed which aims to learn and propagate the class representations explicitly. First, a comparison module is used to explore the relationship between paired samples for learning sample representations in instance-level graphs. Secondly, a squeeze strategy is proposed to make the instance-level graph to generate the class-level graph which can help obtain class-level visual representation. Third, the class-level visual representations are combined with the instance-level sample representations for performing FSFGIC tasks.
Multimodel learning. Inspired by the prototypical network [
48], a multimodal prototypical network [
106] was designed for mapping text data into the visual feature space by using GANs. In [
24], Huang et al. indicated that some methods which apply auxiliary semantic modalities into a metric learning framework only augment the feature representations of samples with available semantics and ignore the query samples, which may lose the potential for the improvement of classification performance and may lead to a shift between the modalities combination and the pure-visual representation. To address this issue, an attributes-guided attention module (AGAM) is proposed which aims to make more effective use of human-annotated attributes and learn more discriminative class-level feature representations. An attention alignment mechanism is designed to distill knowledge from attribute guidance to the pure visual feature selection process, so that it can learn to pay attention to more semantic features without using the restriction of attribute annotation. To better align the visual and language feature distributions that describe the same object class, [
107] proposed a cross-modal distribution alignment module, in which they introduce a vision-language prototype for each class to align the distributions, and adopt the Earth Mover’s Distance (EMD) to optimize the prototypes.
Gu et al. [
108] proposed a two-stream neural network (TSNN), which not only learns features from RGB images, but also focus on steganalysis features via a steganalysis rich model filter layer. The RGB stream aims to distinguish the diference between support images and query images based on the global-level features and calculate the representations of each support class, the steganalysis stream extracts steganalysis features to locate critical regions. An extractor and fusion module is used to fusion the two-stream features by a general convolutional block. An image-to-class deep metric is applied to produce the similarity scores.
Zhang et al. [
109] introduced fine-grained attributes into prototype network and proposed a prototype completion network (ProtoComNet). In meta-training stage, ProtoComNet will extracts representative attribute features as priors. They applied a attention-based aggregator to aggregate the attribute features and prototype to obtain completed prototype. In addition, a Gaussian-based prototype fusion strategy was designed to learn mean-based prototypes from unlabeled samples, and applied the Bayesian estimation to fuse the two kinds of prototypes which aims to produce more representative prototypes.