Adversarial Learning-Based Semantic Correlation Representation for Cross-Modal Retrieval

Cross-modal retrieval has become a hot issue in past years. Many existing works pay attentions on correlation learning to generate a common subspace for cross-modal correlation measurement, and others use adversarial learning technique to abate the heterogeneity of multimodal data. However, very few works combine correlation learning and adversarial learning to bridge the intermodal semantic gap and diminish cross-modal heterogeneity. This article proposes a novel cross-modal retrieval method, named Adversarial Learning based Semantic COrrelation Representation (ALSCOR), which is an end-to-end framework to integrate cross-modal representation learning, correlation learning, and adversarial. Canonical correlation analysis model, combined with VisNet and TxtNet, is proposed to capture cross-modal nonlinear correlation. Besides, intramodal classifier and modality classifier are used to learn intramodal discrimination and minimize the intermodal heterogeneity. Comprehensive experiments are conducted on three benchmark datasets. The results demonstrate that the proposed ALSCOR has better performance than the state of the arts.

with VisNet and TxtNet, is proposed to capture cross-modal nonlinear correlation. Besides, intramodal classifier and modality classifier are used to learn intramodal discrimination and minimize the intermodal heterogeneity. Comprehensive experiments are conducted on three benchmark datasets.The results demonstrate that the proposed ALSCOR has better performance than the state of the arts. & n WITH THE rapid development of Internet and the widely usage of smart devices, huge amounts of multimedia data with various modalities, such as images, texts, videos, audios, etc., are generated, collected, stored, and shared on the Internet and online social networks, as shown in Figure 1. These multimodal data are usually used to describe the same events, scenes, or objects in our daily life, and users always have the requirements to search multimedia data by multimodal queries. This retrieval paradigm is called cross-modal retrieval, which attracts more and more attentions in the community of multimedia. 18 In the past decade, lots of approaches have been proposed to address cross-modal retrieval problem. The main challenge is to learn a common subspace, in which the representations or embeddings of different modalities can be measured via distance function. Canonical correlation analysis (CCA) 9 is a widely used statistical method to find the correlations between crossmodal representations. Following the work by Rasiwasia et al., 1 several CCA-based research works have been presented to support crossmodal retrieval. Inspired by deep learning, a number of researchers exploit DNN to improve the performance of retrieval by learning the nonlinear correlations between modalities.
Motivation. The previous approaches aim to learn a common cross-modal semantic subspace. However, very few works combine correlation learning and adversarial learning to bridge the intermodal semantic gap and diminish crossmodal heterogeneity together. To overcome this challenge, for the first time, this article develops an end-to-end framework to bridge the semantic gap and diminish the cross-modal heterogeneity. Different from the existing studies 5,6 , we combine deep CCA-based cross-modal correlation learning and adversarial learning to not only learn the semantic correlations to bridge the semantic gap between different modalities, but implement a better cross-modal distribution alignment to diminish the cross-modal heterogeneity.
Our Method. We propose a novel cross-modal retrieval approach, termed as Adversarial Learning-based Semantic COrrelation Representation (ALSCOR). Inspired by deep CCA, we design a cross-modal deep representation CCA model that consists of a two-branch network. The CCA model accompanied by this two-branch model is used to learn the intermodal correlation. Besides, an intramodal classifier is used to learn the intramodal discriminative information. In addition, a modality classifier is utilized to diminish the cross-modal heterogeneity, which is realized by discriminating representations of different modalities.
Contributions. The main contributions of this article can be summarized as follows.
We propose a novel framework that is a combination of cross-modal correlation learning and adversarial learning.
To learn the intermodal nonlinear correlation, a two-branch cross-modal correlation model is developed, which is an integration of VisNet, TxtNet, and CCA. An intramodal classifier is utilized to learn the intramodal discrimination, and the modality classifier plays the discriminator to diminish the cross-modal heterogeneity in an adversarial manner.
Comprehensive experiments on three benchmark datasets are conducted. We compare the proposed method with eight state-of-thearts. The results demonstrate that our method has great performance for crossmodal retrieval.

Cross-Modal Retrieval
Cross-modal retrieval is a significant problem in the area of multimedia computing, 7,8 which aims to find out the similar enough objects of one modality in the multimedia database by a query of different modality. Due to the exponential growth of amount of multimedia data, this task attracts a large number of attentions in recent years. CCA 9 is an important statistic method to seek the linear correlation between two sets of variates, which is utilized by many studies for cross-modal retrieval. For example, the work by Rasiwasia et al. 1 is the first work using CCA to address cross-modal retrieval. Wang et al. 2 proposed a method called unsupervised discriminant CCA. Different from the existing research works, we combine deep CCA and adversarial learning method to learn better representations with modality invariance, which can implement feature distribution alignment between different modalities.

Deep Learning
As a very powerful technique, deep learning 10-12 is used to overcome several challenges of multimedia analysis and retrieval, such as image classification, object recognition, video retrieval, multimodal/cross-modal retrieval, etc. Wei et al. 13 proposed to use deep CNN to learn visual representations for cross-modal retrieval, which performs much better than traditional hand-crafted features.
Inspired by Goodfellow et al., 14 several studies utilized adversarial learning to improve the cross-modal representation learning. Wang et al. 5 are the first to use adversarial learning method to address cross-modal retrieval problem. Wen et al. 6 introduced a new cross-modal similarity transferring method by adversarial learning. Unlike the existing studies, we combine deep CCA method and adversarial learning to not only learn the semantic correlations between different modalities, but implement a better cross-modal distribution alignment.

PRELIMINARY
In this section, we formalize the definition of cross-modal retrieval and the related notions. Besides, the cross-modal correlation measurement is proposed.

Problem Definition
Definition 1 (Cross-Modal Retrieval). Without losing generality, consider a multimedia dataset that contains multimodal data is denoted as D ¼ fX 1 ; X 2 ; . . .X n ; Y 1 ; Y 2 ; . . .; Y n g, where X and Y denote two different modalities. Cross-modal retrieval aims to return a set of data R of one modality, which are correlative enough to the query of another modality, namely, October-December 2020 R ¼ fY jY 2 D; Y 0 2 D n R; where CorrðÁ; ÁÞ is a cross-modal correlation measurement that is to measure the correlations between two objects of different modalities.
This article focuses on two most common modalities on the Internet, i.e., image I and text T . Two corresponding retrieval tasks are studied: (1) image-to-text retrieval that is to find correlative texts by an image query and (2) text-toimage retrieval that is to find correlative images by a text query. According to Definition 1, these two tasks can be formalized as CorrðT Q ; IÞ ! CorrðT Q ; I 0 Þg: Suppose that the multimedia dataset D ¼ fhI 1 ; T 1 i; hI 2 ; T 2 i; . . .; hI n ; T n ig contains n pairs of image and text. Each pairs has a classification label vector denoted as L L i ¼ fL where g I and g T are the number of dimensions, and generally g I 6 ¼ g T . Therefore, for the multimedia dataset D, the image representation matrix, the text representation matrix and the classification label matrix are denoted as The main challenge of implementing crossmodal retrieval is the heterogeneity between different modalities, which is manifested in two aspects: (1) the difference of feature distributions and (2) the semantic gap between different modalities. That means the dimensions of feature representations are different, and they are hard to be represented in the same distribution. Besides, the semantic concepts of representations are hard to be aligned. These two limitations hinder the correlation measurement of cross-modal data. Thus, two cross-modal mappings, V I ððI 1 ; I 2 ; . . .; I n ÞÞ and V T ððT 1 ; T 2 ; . . .; T n ÞÞ need to be learnt to project images ðI 1 ; I 2 ; . . .; I n Þ and texts ðT 1 ; T 2 ; . . .; T n Þ into a common semantic subspace, in which the representations of images and texts have the similar distributions and the semantic concepts can be aligned. Formally, . . .; I n ÞÞ : . . .; T n ÞÞ : where ð 1 ; 2 ; . . .; n Þ and ðẑ z 1 ;ẑ z 2 ; . . .;ẑ z n Þ are the representation matrix of images and texts in the common semantic subspace. Therefore, the correlations between multimodal data can be measured via distance function in this space. Inspired by Pearson correlation, we propose the cross-modal correlation measurement as follows.

Definition 2 (Cross-Modal Correlation Measurement).
Given an image I i and a text T j , the representations of I i and T j in the common semantic subspace are i andẑ z j , the cross-modal correlation between I i and T j is measured by the following equation: where m i and mẑ z j are the averages of i andẑ z j , respectively.

METHOD
To learn a common subspace and bridge the semantic gap between different modalities, we propose an effective end-to-end framework, named ALSCOR, which is a combination of cross-modal correlation learning and adversarial learning.

Architecture of ALSCOR
Overview. The general architecture of the proposed ALSCOR is illustrated in Figure 2, which is a combination of deep CCA and adversarial learning technique. However, different from the traditional deep CCA that uses two deep fully connected networks, ALSCOR utilizes a deep CNN for image representation and a integration of BiLSTM and convolutional network for text representations. Specifically, this is an endto-end framework of cross-modal retrieval. It has three layers: (1) Input layer, (2) cross-modal representation layer, and (3) loss layer. The input layer contains a multimedia collection, and it feeds image-text pairs into the next layer, namely cross-modal representation layer. Cross-modal representation layer is a two-way deep neural network structure, which includes two models to learn deep representations of image and text, respectively. The one is VisNet, which is implemented by deep convolutional network to learn the deep representations of images. The other is TxtNet, which is generated by text representations via a combination of word2vec model, BiLSTM network, and text convolutional network. The loss layer contains three models. CCA model is used to learn the correlations between representations of image and text via correlation loss. Compared with deep CCA, this framework can learn much more high-level semantic features. Intramodal classifier model is to learn the intramodal discriminative representations via discrimination loss. The modality discriminator is used to learn modality invariant representations via adversarial loss.
where u u I is the model parameter vector. Compared with deep fully connected neural networks in the deep CCA method, CNN is more powerful to capture highlevel visual semantic information from images. In our approach, it has five convolutional layers and two fully connected layers. The input images are resized to 224 Â 224 Â 3, which are fed into the first convolutional layer. The first convolutional layer has 96 kernels of size 11 Â 11 Â 3. The second convolutional layer has 256 kernels of size 5 Â 5 Â 96. The third convolutional layer has 384 kernels of size 3 Â 3 Â 256. The fourth convolutional layers has 384 kernels of size 3 Â 3 Â 192. The fifth convolutional layers has 256 kernels of size 3 Â 3 Â 192. Following the last convolutional layer, there are two fully connected layers that have 4096 neurons each. The second full-connected layer output 4096-dimensional feature representations, namely g I ¼ 4096.
TxtNet. TxtNet is a combination of word2vec, BiLSTM network, and CNN model, which is to generate -dimensional text representations, namely ðz  October-December 2020 word2vec can capture both semantic and syntactic information of text. Following word2vec model, a BiLSTM model is used to encode the contextual information from both the previous and future context. For each of the directions, the LSTM has three gates: the input gate i, forget gate f; and output gate o. At the time t, the each state in the LSTM is fðtÞ CðtÞ ¼ iðtÞ ÂCðtÞ þ fðtÞ Â Cðt À 1Þ h hðtÞ ¼ oðtÞ Â tanhðCðtÞÞ where h hðtÞ is the hidden vector, s is the sigmoid function, After the BiLSTM model, a convolutional network is used to capture the local semantic information of the output of BiLSTM. This network has one convolutional layer with k convolutional kernels of size m Â m, i.e., K K ¼ ðK K ð1Þ ; K K ð2Þ ; . . .; K K ðkÞ Þ. For the jth location of the input map, the calculation can be formalized as follows: where gðÁÞ : R 7 ! R is a nonlinear activation function, b is a bias, and Ã Ã is the convolutional operator. For the ith kernel K K ðiÞ , it slides across the input feature map step-by-step and generate the feature map as follows: where maxðÁÞ is the function to select the maximal element of each feature vector. To restraint overfitting, before the fully connected layers, a drop-out operation is used to randomly discards a part of outputs of max-pooling, shown as follows: where W W fc and b b are the parameters of the fully connected layers, is the elementwise multiplication operator, and M M is a masking vector of Bernoulli random variables.

Loss
In the loss layer, three modules are used to learn the common semantic subspace. The CCA module aims to learn the correlation between images and texts by correlation loss, which receives the deep representations from VisNet and TxtNet. The intramodal classifier is to learn the intramodal discriminations by using the classification labels of image-text pairs via discrimination loss. The modality classifier plays an role of discriminator in GAN, which is to diminish the heterogeneity between representations of different modalities via adversarial loss.
Correlation Loss. The CCA module integrated with VisNet and TxtNet forms an end-to-end nonlinear correlation learning model to maximize the cross-modal correlation. According to deep CCA, the correlation loss is formalized as follows: L corr ðI i ; T i ; u u I ; u u T Þ ¼ CorrðVisNetðI i ; u u I Þ; TxtNetðT i ; u u T ÞÞ: Therefore, the correlation learning is to optimize the following objective function: The optimal parameters ðu u Ã I ; u u Ã T Þ are calculated by using gradient of the correlation objective on the training set D.
Discrimination Loss. The intramodal classifier is to maintain the discrimination of multimodal data after the cross-modal nonlinear mapping. It is realized by predicting the categories label of the cross-modal data in the common semantic subspace. Specifically, this model is a feed-forward neural network followed by a softmax layer, which is to receive the representations of different modalities in the common subspace and output a probability distribution of categories. In our scheme, the cross-entropy loss is used to implement the discrimination loss, shown as follows: where P ðÁÞ is the probability distribution of categories, u u D is the parameter vector of the classifier, and m is the number of samples in each minibatch during the training. Adversarial Loss. Inspired by GAN, the adversarial learning in our method is realized by a modality classifier D, which works as the discriminator to identify the representation is generated from an image and a text. According to Wang et al. 5 , this model is implemented by a three-layer feed-forward neural network with parameter u u A . The representations generated from image modality are assigned with label 01, and the representations generated from text modality are assigned with label 10. The loss function can be formalized as follows: where m i m i is the ground-truth modality label of each representation.
Adversarial Learning. For the adversarial learning process, we incorporate the correlation loss (17), discrimination loss (19), and adversarial loss (20), and optimize them as a min-max game, shown as follows: ðu u Ã I ; u u Ã T ; u u Ã D Þ ¼ arg min ðu u I ;u u T ;u u D Þ ðaL corr ðI i ; T i ; u u I ; u u T Þ þ dL disc ðI i ; T i ; u u I ; u u T ; u u D Þ À "L adv ðI i ; T i ; u u I ; u u T ; u u Ã A ÞÞ; where a, d; and " are the weight parameters for these three loss terms. The training can be realized by using a stochastic gradient descent algorithm.

EXPERIMENTS
In this section, we present the performance evaluation of the proposed method and the comparison with several state-of-the-arts on three multimedia datasets.

Experimental Setup
Datasets. Our experiments are conducted on three benchmark multimedia datasets: Wikipedia, NUS-WIDE, and Pascal Sentence, which are widely used in multimodal/cross-modal retrieval tasks. Some samples of these datasets are shown in Figure 3.
Performance Metrics. In our experiments, two cross-modal retrieval tasks are considered: image-to-text (I2T) retrieval and text-to-image (T2I) retrieval. The I2T retrieval is to retrieve similar texts from an image query, and the I2T retrieval is to find similar images from a text query. To evaluate cross-modal correlation learning, Pearson correlation coefficient is used to measure the correlation between different representations. To comprehensive evaluate the retrieval performance, precision-recall curves (PR curves) and mean Average Precision scores (mAP) are used in these experiments.
Implementation Details. For VisNet, all the images are resized to 256 Â 256 Â 3 without cropping. 224 Â 224 Â 3 patches (and their horizontal reflections) are extracted randomly. We set different learning rates to different layers: the learning rates of first five convolutional layers and the FC-layer are set to 0.001 and 0.01, respectively. For TxtNet, the word2vec model is pretrained by Skip-gram model, which generates 300-dimensional word vectors from texts. The filter size of CNN is set to 3 Â 300, the dropout rate is set to 0.5.
Experimental Environment. All the experiments are run on a workstation with Intel(R) CPU Xeon 2.60 GHz, 16 GB RAM and NVIDIA GeForce GTX 1080 GPU with 8 GB memory running Ubuntu 16.04 LTS Operation System. All algorithms in the experiments are implemented in Python.

Performance Evaluation
In this section, we show the performance evaluation of the proposed method on  Wikipedia, NUS-WIDE, and Pascal Sentence datasets, and compared it with eight cross-modal retrieval approaches. The experimental results are reported in Table 1 and Figure 4.
Experiments on Wikipedia Dataset. We compare the proposed ALSCOR with CCA, DCCA, TVKCCA, SM, Deep-SM, RE-DNN, Corr-AE, and ACMR on Wikipedia dataset. From Table 1, we can see that for both Img2Txt and Txt2Img task, our method (Img2Txt=51.05%, Txt2Img=63.58%) not only defeats the traditional approaches such as CCA (Img2Txt=21.01%, Txt2Img=17.84%), DCCA (Img2Txt=29.58%, Txt2Img=28.12%), TVK-CCA (Img2Txt=20.13%, Txt2Img=22.06%), SM (Img2Txt=23.34%, Txt2Img=28.51%), etc., but also outperforms the adversarial learning-based method ACMR (Img2Txt=50.33%, Txt2Img=61.71%). This is mainly because the proposed ALSCOR has more powerful semantic representation. That means our method can capture more abstract concepts information to bridge the semantic gap. On the other hand, the performance of adversarial learning-based methods, i.e., ACMR and ALSCOR, are much better than others, which verifies the adversarial learning can strongly support the common semantic subspace learning.  Figure 4(a), it is obvious that with the increase in recall, the precision of ALSCOR and ACMR rise gradually and then drop rapidly near recall=1.0. Unsurprisingly, they perform much better than others over all values of recall. This verifies the improvement brought from adversarial learning once again. Similar to the discussed earlier, the precision of ALSCOR is a bit higher than ACMR for both Img2Txt retrieval and Txt2Img retrieval, which is because of the advanced cross-modal representation learning model (the combination of VisNet and TxtNet).
Experiments on NUS-WIDE Dataset. We evaluate the performance of CCA, DCCA, TVKCCA, SM, Deep-SM, RE-DNN, Corr-AE, ACMR, and the proposed ALSCOR on NUS-WIDE dataset, shown in the middle column of Table 2. Similar to the experiment on Wikipedia, our method outperforms all the opponents by 60.88% for Img2Txt retrieval and 65.26% for Txt2Img retrieval. ACMR is the second best method, whose performance (Img2Txt=58.41%, Txt2Img=57.85%) is close to our method. Different from the experiment on Wikipedia, Deep-SM performs better, which achieve Img2Txt=57.80% and Txt2Img=62.55%. However, it still cannot defeat ALSCOR. Meanwhile, the performance of other methods is far behind the proposed method. On the other hand, the mAPs of all these methods on NUS-WIDE are higher than Wikipedia. That mainly because the categories in Wikipedia are more abstract, such as "art," "literature," "royalty," etc., which means the discriminant features are hardly to learn. The best performance values are in bold-font. October-December 2020 Figure 4(b) and (e) shows the comparison of ALSCOR and other eight approaches on NUS-WIDE dataset for Img2Txt and Txt2Img, respectively. Figure 4(b) tells us that for Img2Txt retrieval, the precision of ALSCOR, ACMR, and Deep-SM are close, which are higher than other five approaches obviously. Specifically, in the recall internal [0.1,0.8], these top-three approaches are not sensitive to the changing of recall. Like the situation on Wikipedia, the performance of ALSCOR is better than ACMR. On the other hand, for Txt2Img retrieval, the performance gap between the top-three methods and others is not so obvious, and the precision of Deep-SM is a little bit higher than ACMR. However, ALSCOR is still the best for all value of recall. This verifies effectiveness of the proposed ALSCOR.
Experiments on Pascal Sentence Dataset. The experimental results of the proposed ALSCOR and CCA, DCCA, TVKCCA, SM, Deep-SM, RE-DNN, Corr-AE, and ACMR on Pascal Sentence dataset are shown in the right column of Table 3. On this multimedia dataset, our method ALSCOR is still the best. It achieves mAP = 63.92% for Img2Txt task and mAP = 68.41% for Txt2Img task, which are obviously higher than ACMR (mAP = 63.92% for Img2Txt and mAP = 68.41% for Txt2Img), the most competitive opponent. The precision of other hand-crafted feature based methods, i.e., CCA, TVKCCA, SM, etc., are much lower than the two former. On the other hand, similar to the previous experiments, the precision of ALSCOR for Txt2Img is better than Img2Txt. This is mainly because the textual semantics is easier to be learnt than images, and the combination of BiLSTM and CNN model can capture more highlevel concept information from texts. Figure 4(c) and (f) demonstrates the trend of precision of ALSCOR and the compared methods with the varying of recall on Pascal Sentence dataset. For Img2Txt retrieval, we can see from Figure 4(c) that the proposed the precision of ALSCOR declines step-by-step with the rising of recall, which is higher than ACMR. Similar to the experiments on Wikipedia and NUS-WIDE, these two adversarial learning-based approaches are much more powerful than others for Img2Txt retrieval. On the other hand, in Figure 4(f), the superiority of them is not so obvious for Txt2Img retrieval, but they still defeat other methods. Compared with ACMR, the proposed ALSCOR performs better.

CONCLUSION
In this article, a novel cross-modal approach ALSCOR is proposed. This approach is a combination of adversarial learning and cross-modal correlation learning. To bridge the semantic gap, a deep-learning-based cross-modal correlation learning model is developed. Besides, a modality classifier is utilized to implement adversarial learning, which is to learn modality invariant representations. In addition, an intramodal classifier is used to capture the intramodal discriminant information. We conduct comprehensive experiments on three benchmark datasets to evaluate the performance of ALSCOR and eight state-of-thearts. Experimental results shows that ALSCOR has better performance than the state of the arts. In our future studies, we will consider both intermodal and intramodal semantic correlations, which can further reduce the cross-modal semantic gap.