Self-Attention and Adversary Guided Hashing Network for Cross-Modal Retrieval

Recently deep cross-modal hashing networks have received increasing interests due to its superior query efficiency and low storage cost. However, most of existing methods concentrate less on hash representations learning part, which means the semantic information of data cannot be fully used. Furthermore, they may neglect the high-ranking relevance and consistency of hash codes. To solve these problems, we propose a Self-Attention and Adversary Guided Hashing Network (SAAGHN). Specifically, it employs self-attention mechanism in hash representations learning part to extract rich semantic relevance information. Meanwhile, in order to keep invariability of hash codes, adversarial learning is adopted in the hash codes learning part. In addition, to generate higher-ranking hash codes and avoid local minima early, a new batch semi-hard cosine triplet loss and a cosine quantization loss are proposed. Extensive experiments on two benchmark datasets have shown that SAAGHN outperforms other baselines and achieves the state-of-the-art performance.


Introduction
With the explosive growth of multimedia data in social networks and search engines, including texts, audio, videos and images, efficiently retrieval information between different modalities types has become a hot spot. For example, given an image, one may want to obtain all semantic related captions from the database. However, due to the inconsistent distribution and representation of different modalities, it has posted new challenges to unify different modalities and bridge their semantic gaps efficiently and effectively.
Existing cross-modal retrieval methods can be roughly divided into two categories. The first one is latent embedding learning methods and the second is hashing based methods. Many real-valued latent embedding methods have been proposed, including topic models [1,2], subspace learning [3][4][5] and deep models [6][7][8][9]. While the high computing complexity and low search efficiency have become another problem. Another universal solution to cross-modal retrieval is hashing based methods, which use the Manifold Learning to transfer binary codes from high dimensional features. Many cross-modal hashing methods have been proposed to bridge the "heterogeneity gap" by finding a latent space's representation (i.e. Hamming), in which the similarities between features can be directly estimated with same distance metric.
Up to now, many cross-modal hashing methods has been proposed including unsupervised and supervised manners. Unsupervised cross-modal hashing methods utilize the original multi-modal features to explore the latent data structure and distribution. To name a few, Collective Matrix Factorization Hashing (CMFH) [10], Inter-Media Hashing (IMH) [11], Unsupervised Generative Adversarial Cross-modal Hashing (UGACH) [12], and Latent Semantic Sparse Hashing (LSSH) [13]. Supervised methods, on the other hand, try to leverage semantic labels as supervised information to learn hash coding functions. In this way, supervised cross-modal hashing methods

Related Work
In this section, we briefly review some typical related work on hashing based cross-modal retrieval.

Deep Cross-Modal Hashing
Since the previous cross-modal hashing method are built on shallow architecture, they cannot explore the complex nonlinear correlation across different modalities, thus, deep cross-modal hashing retrieval has attracted increasing attention. One of the most representative work is deep cross modal hashing (DCMH) [23] which simultaneously learns features and hash codes in an end-to-end deep learning framework. Correlation hashing network (CHN) [24] adopts a hybrid deep architecture to jointly learn good data representation tailored to hash coding and formally control the quantization error. Pairwise relation guided deep hashing (PRDH) [25] jointly performs inter-modality and intra-modality pairwise constraints to enhance the semantic preservation. Although these method has achieved appealing performance, they all lose the rich information from the intermediate layers and higher-ranking correlation relationship among inter-modality pairs.

Generative Adversarial Networks (GAN)
Generative adversarial networks (GAN) [30] have achieved remarkable progress in image generation, representation learning, super resolution and other fields since it was proposed. Many deep cross-modal hashing methods based on GAN were proposed. Adversarial cross-modal retrieval (ACMR) [31] adds discriminators to distinguish different modalities features, which is the first method to apply GAN in cross-modal hashing. Self-supervised adversarial hashing (SSAH) [27] leverages two adversarial networks to maximize the semantic relevance and consistency across modalities. In comparison of SSAH, label network is replaced by label information to directly bridge modality gap, and adversarial module is added at the hash codes learning step to ensure consistency of hash codes across modalities.

Attention Mechanism
Recently attention based deep networks has drawn increasing interest because of the appealing performance in various tasks, like image classification [32], visual question answering [33] and person re-identification [34] etc. Attention-aware deep adversarial hashing (DAH) [35] divides the hash representations into the attended and the unattended in an adaptive method. Adversarial guided asymmetric hashing (AGAH) [28] adopts a multi-label attention map to preserve semantic information. In spite of this progress, self-attention has yet been explored in the context of cross-modal hashing. Thus, a self-attention constraint module is proposed to enhance preservation of global, long-range dependencies within internal representations of different modalities.

Problem Definition
We use uppercase letters represent matrices, such as X, and lowercase letters represent vectors, such as y. And G T denotes the transpose of G, the Frobenius norm is denoted as · F . The sign(·) is an element-wise sign function defined as follows: Without losing generality, assuming there are N instances of image-text features pairs. Let and Y = {y i } N i=1 denote the feature vector of image modality and text modality, where x i ∈ R d x denotes the d x -dimensional feature vector of the image modality, and y i ∈ R d y denotes the d y -dimensional feature vector of the text modality. The similarity matrix S = {S xx , S yy } is used to measure the label relevance of two intra-instances, where S xx ij = 1 or S yy ij = 1 if two intra-modal instances share at least one label, and S xx ij = 0 or S yy ij = 0 otherwise. Given the training data information X, Y and S, the goal of SAAGH is to learn two modality-specific hashing functions h (x) (x) and h (y) (y) which project feature vectors of image modality and text modality into binary code B X ∈ {−1, +1} c and B Y ∈ {−1, +1} c , where c is the length of the hash code. The framework of cross-modal hashing can be divided into hash representation learning part and hash-function learning part. F = { f x i | i = 1, 2, · · · , N} ∈ R N×c , G = g y i | i = 1, 2, · · · , N} ∈ R N×c are denoted as learned hash representation from image modality and text modality. Finally, the hash code can be calculated by applying a sign function on hash representation F and G: B X = sign(F), B Y = sign(G).

Network Architecture of SAAGH
The feature extractor is composed of two deep neural networks for image-modality feature learning and text-modality feature learning. Inspired by SCAHN, a self-attention constraint module is proposed to guide the intermediate layers as the additional supervision. The framework of self-attention constraint module is shown in Fig. 2.The self-attention constraint module fuses the features extracted from intermediate into a features pool. The features pool integrates with self-attention module to get the weighted features.
The whole SAAGH framework is shown in Fig 1, which learns hash representations and hash codes in an end-to-end architecture. The adopted neural network for both ImageNet and TextNet is ResNet. Image-text pairs are used as input data for the whole architecture. Inspired by SSAH [27], the text instances are represented by bag-of-words (BoW) to fuse multiple scale features. Multiple scale features are generated by multiple average pooling layers and they are fused by a up-sampling operation. In this way, the multiple semantic relevance for text modality can be kept, and the same module can be applied on both text and image modalities.
Inspired by SCAHN [36], we create the side branch from intermediate layers to make a robust features. Without losing generality, we assume there are K blocks in ResNet, which corresponds to K side branches. As shown in Fig. 2 ,the global averaging pooling and a 1x1 convolution layer are Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 18 September 2020 doi:10.20944/preprints202009.0416.v1 added. All blocks are fused into a features pool. The features pool will be transferred to final hash representation by going through a self-attention module. Finally, the hash representations for image modality and text modality can be expressed as f x i and g y j .

Hash Function Learning
In order to generate high-quality and high-ranking relevance hash codes, several loss functions are proposed by exploring the semantic relevance in intra-modality and inter-modality.
Adversarial Loss. To generate modality-invariant hash codes across modalities, the adversarial learning is introduced in hashing function learning part. We design two discriminators of GAN (D x and D y ) for image modality and text modality, which are composed of two-layer feed-forward neural networks with parameters θ D x and θ D y . For text modality discriminator D y , the image network is regraded as generator for text hash codes. The hash codes generated from text modality are real text hash codes, while hash codes from image network are considered as fake codes. This discriminator is designed to distinguish whether the text hash code is real or fake. The image modality discriminator acts the same way.
In the process of adversarial learning, the text network attempts to generate image hash codes to cheat the image discriminator, and the image network tries to generate text hash codes to cheat the text discriminator. The adversarial loss in hash function learning part is defined as L adv , which is a cross-entropy loss. The formulation is defined as follows: Specially, L * adv denotes the image modality or text modality adversarial loss and b * i , * ∈ {x, y} means the hash code of i-th instance of image or text.
Batch Semi-hard Cosine Triplet Loss. In order to preserve higher-ranking relationships between modalities, pairwise loss is usually adopted in many other methods like [25]. While pairwise cannot Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 18 September 2020 doi:10.20944/preprints202009.0416.v1 fully explore the discriminability between instances. So we proposed a batch semi-hard cosine triplet loss which is inspired by the cosine triplet loss used in AGAH [28].
Take the image modality for example, the form of triplet in our method is b k is another way around. Cosine triplet loss means that cosine distance is considered as surrogate of Hamming distance because of the gap between Hamming distance and inner product for continuous representations which is mentioned in CHN [24]. The cosine triplet loss for image modality can be formulated as follows: where m is the margin parameter.
In the same way, the cosine triplet loss for the text modality can be formulated as: Although cosine triplet loss in [28] has achieved appealing results, it still suffers from some disadvantages. On the one hand, as the dataset gets large, the number of triplets grows rapidly which makes the training inefficiently. On the other hand, in the process of hard triplet loss learning, selecting the hardest negatives will in practice lead to bad local minima early, specifically it can result in a collapsed model.
To solve the problem mentioned above, inspired by FaceNet [37] and Batch Hard triplet loss, the semi-hard triplet selection and batch hard triplet loss form are used. The principle of semi-hard selection is defined as follows: where d(.)means the distance between two instances. In this way, the negatives can be far away from the anchor than the positive example, but still in a hard way to explore the relevance between instances. Moreover, the loss function can be formulated in a batch-hard manner based on semi-hard selection, which is formulated as follows: Finally, combining (6) and (7) the total triplet loss can be formulated as: Hamming Distance Intra-Loss. Although the adversarial loss is proposed to generate modality-invariant hash codes and batch semi-hard cosine triplet loss is proposed to preserve higher-ranking relationship, they almost concentrate on inter-modality, neglecting the similarity between intra-modality instances. The Hamming distance intra-loss is introduced to solve this problem. The similarity of learned hash representation f x i and g y j can be represented by S = {S xx , S yy } ∈ {0, 1}, which is evaluated by their inner product. S * ij = 0 means that the two intra-modality instances are dissimilar, consequently the inner product should be small, S * ij = 1 otherwise. The likelihood function can be defined as follows: where θ ij = αM * i M T * j , M * = F * or G * , F * and G * denote the image modality and text modality learned hash representations, α is the control parameter under different hash bit, which is set to α = 2 − log (c/32) 2 , and σ θ ij = 1 1+e −θ ij . We use the negative log likelihood loss as Hamming distance loss, which can be formulated as follows: where θ x k Then the intra-modality Hamming distance loss can be defined as: L intra = L intra-image + L intra-text (11) Quantization Loss Although the cosine triplet loss can preserve high-ranking correlationship, such a surrogate may suffer from the disagreement between small cosine distance and large Hamming distance. As shown in Fig. 3,their cosine distance is small while they lie on different side of hyper-plane, which makes their Hamming distance still large. Thus, we design the cosine quantization loss as follows: L cos = L x cos + L y cos = ∑ i 1 1 + log (exp (cos (|u i | , 1))) + ∑ i 1 1 + log (exp (cos (|v i | , 1))) (12) Modality Invariance Loss. To obtain a lower quantization loss when projecting the hash representations f x i ∈ R c and g y i ∈ R c to hash codes b x i = sgn ( f i ) ∈ {−1, 1} c and b y i = sgn (g i ) ∈ {−1, 1} c , specially in some unbalanced outlier codes, we proposed modality invariance loss to minimize the distance between the hash codes of image-text pairs. The modality invariance loss are formulated as follows:

Optimization
By assembling the above five loss functions together, the final overall loss function is given as follows: min where θ x , θ y are the parameters of the image and text networks expect the discriminators in the adversarial learning part. Note that θ G x , θ Gy are part of θ x , θ y . An alternating optimization strategy is employed to optimize equation 14. At the end of each epoch, some parameters will be optimized while others are fixed. The whole optimization algorithm for SAAGHN is outlined in Algorithm 1.

Experiment and Discussion
In the experiment, SAAGH was executed on several popular cross-modal retrieval benchmark datastes MIRFLICKR-25K [38] and NUS-WIDE [39]. The experiments results show the superiority of our method. [38] is a common used dataset which consists of 25,000 image-text pairs collected from Flickr. Each image is annotated by its associated textual tags. The experiment protocols of DCMH [23] is followed by our experiment. All image-text pairs have at least 24 unique labels. The text for each pair is represented as a 1,386-dimensional Bag-of-Words(BoW) vector.

MIRFLICKR-25K
NUS-WIDE [39] dataset contains 269,468 image-text pairs, and each of which is annotated by at least 81 labels. The text for each instance is represented as a 1000-dimensional BoW vector. All of the instances without labels are removed. We select 190,421 image-text pairs labeled by 21 most frequently used concepts.
Besides, 10,000 and 10,500 instances are randomly selected as the training set for MIRFLICKR-25K and NUS-WIDE. And 2,000 and 2,100 additional instances are randomly chosen as the query set of MIRFLICKR-25K and NUS-WIDE. Moreover, the remaining data are used as the retrieval set after picking the query data. Table 1 summarize the statistics of instances of the datasets. Input: Training set {x i , y i , l i } N i=1 , cross-modal similarity matrix S; Output: Optimized parameters θ x , θ y , θ D x , θ D y of networks, and binary codes B; 1 Initialization: Initialize the parameters of network, set the mini-batch size to n x = n y = 128, initialize hash representations of each modality: F and G, set iteration number iter and other hyper-parameters. 2 for t=1 to iter do 3 Update θ G * by ascending their gradients:

Implementation Details
Update θ D * by descending theur gradients: Update θ by BP algorithm:  [40] dataset, while the initialization of Resnet50 in text network is based on the Gaussion distribution N µ, σ 2 with µ = 0 and σ = 0.1. We implemented our SAAGHN in Pytorch [41] with two NVIDIA TITAN Xp GPU. The mini-batch size is set to 128 and the learning rate is 0.0003.

Evaluation criteria
Hamming ranking is widely used as retrieval procedure of hashing-based retrieval methods. The Hamming distance is sorted based on hamming distance and Mean Precision (MAP) [42] is a common used metric. In our experiment, MAP is adopted as the evaluation criteria for SAAGHN.

Performance Evaluation
The MAP results of SAAGHN and other baselines for different hash code lengths are shown in Table 2 . From the table, we have the following observations.
• SAAGHN outperforms all the other state-of-the-art methods, which demonstrates the superiority of our method in cross-modal retrieval. The superiority is partly because SAAGHN explores rich fused hash representations as well as the high-ranking correlation preservation in cosine triplet loss. • In some baselines, the results of Image-query-Text greatly outperform the Text-query-Image, while SAAGHN can achieve an almost equal performance. This may because the adversarial learning part and modality invariance loss can fully preserve the modality-invariability. • Deep cross-modal hashing methods (DCMH, CMHH, SSAH, PRDH, CHN and SACHN) have an improved performance than the other hand-crafted cross-modal hashing methods(SCM and SePH). This proves that the deep hash representations learned from datasets are more robust than hand-crafted representations.

Ablation Study
The impact of different modules of SAAGHN are verified in this section. To test the effectiveness of our contributions, three ablation experiments are designed as follows: • SAAGHN-COS: This experiment is designed to replace batch semi-hard cosine triplet loss by cosine triplet loss used in AGAH. • SAAGHN-ATT: This experiment is designed based on SAAGHN without self-attention fusion module, which means the hash representations are generated from the last layer of ImageNet and TextNet. • SAAGHN-ADV: This experiment means that SAAGHN is used without adversarial learning part, which means the discriminator are removed as well as the adversarial loss. Table 3 shows the results of ablation study. Firstly, we can observe that the designed batch semi-hard cosine triplet loss can achieve better results than cosine triplet loss. This may partly because the batch semi-hard cosine triplet loss can avoid early local minima. Secondly, we can observe that the performance improves a lot by using self-attention fusion module. This is probably because that the self-attention fusion module can explore more semantic relevant information and global dependencies. Finally, the improvement of adversarial learning part is competitive.

Conclusion
In this paper, a novel deep hashing method for cross modal retrieval (SAAGHN) is proposed, which not only learns rich modality-invariant hash representations but also generates discriminative hash codes for cross-modal retrieval in an end-to-end framework. To enhance the feature learning part, a self-attention mechanism is introduced to explore global, long-range dependencies of hash representations among modalities. Furthermore, adversarial learning and batch semi-hard cosine triplet loss is adopted to learn the consistent and high-ranking hash codes. Moreover, we add a modality-invariance loss to minimize the gap between hash codes of image modality and text modality. Due to the contributions mentioned above, the optimal hash codes can be learned at the end of learning process. Extensive experiments demonstrate that SAAGHN outperforms the current hashing-based cross-modal retrieval methods.