Hierarchical semantic interactionbased deep hashing network for crossmodal retrieval
 Published
 Accepted
 Received
 Academic Editor
 Thippa Reddy Gadekallu
 Subject Areas
 Artificial Intelligence, Computer Vision, Data Mining and Machine Learning, Multimedia, Visual Analytics
 Keywords
 Bidirectional Bilinear Interaction, DualSimilarity Measurement, CrossModal Hashing, Deep Neural Network
 Copyright
 © 2021 Chen et al.
 Licence
 This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
 Cite this article
 2021. Hierarchical semantic interactionbased deep hashing network for crossmodal retrieval. PeerJ Computer Science 7:e552 https://doi.org/10.7717/peerjcs.552
Abstract
Due to the high efficiency of hashing technology and the high abstraction of deep networks, deep hashing has achieved appealing effectiveness and efficiency for largescale crossmodal retrieval. However, how to efficiently measure the similarity of finegrained multilabels for multimodal data and thoroughly explore the intermediate layers specific information of networks are still two challenges for highperformance crossmodal hashing retrieval. Thus, in this paper, we propose a novel Hierarchical Semantic Interactionbased Deep Hashing Network (HSIDHN) for largescale crossmodal retrieval. In the proposed HSIDHN, the multiscale and fusion operations are first applied to each layer of the network. A Bidirectional Bilinear Interaction (BBI) policy is then designed to achieve the hierarchical semantic interaction among different layers, such that the capability of hash representations can be enhanced. Moreover, a dualsimilarity measurement (“hard” similarity and “soft” similarity) is designed to calculate the semantic similarity of different modality data, aiming to better preserve the semantic correlation of multilabels. Extensive experiment results on two largescale public datasets have shown that the performance of our HSIDHN is competitive to stateoftheart deep crossmodal hashing methods.
Introduction
The recent exponential growth of multimedia data (e.g., images, videos, audios, and texts) increases the interest in these different modality data. These different modality data, also named multimodal data, may share similar semantic content or topics. Therefore, crossmodal retrieval, which uses a query from one modality to retrieve all semantically relevant data from another modality, has attracted increasing attention. Because of the existing potential heterogeneous gaps among these multimodal data, which may be inconsistent in different spaces, it posed a challenge to efficiently and effectively retrieve the related data among these multimodal data.
Specifically, crossmodal retrieval aims to learn common latent representations for different modalities data so that the embedding of different modalities could be evaluated in the trained latent space (Kaur, Pannu & Malhi, 2021). Many crossmodal retrieval methods are based on realvalued latent representations for modalityirrelevance data, such as Wang et al. (2014), Jia, Salzmann & Darrell (2011), Mao et al. (2013), Gong et al. (2014), Karpathy, Joulin & FeiFei (2014), and Wang et al. (2015). However, the measure of realvalued latent representations suffers from the low efficiency of searching and high complexity of computing. To reduce the search time and the storage cost of crossmodal retrieval, hashingbased crossmodal (CMH) retrieval methods are proposed to map the data into compact and modalityspecific hash codes in a Hamming space, which have shown their superiority in crossmodal retrieval task such as Ling et al. (2019); Qin (2020).
So far, plentiful CMH algorithms including unsupervised, supervised, and semisupervised learning manners have been proposed to learn robust hash functions as well as highquality hash representations. Unsupervised CMH algorithms explore underlying correlation and model the inter and intramodality similarity among the unlabeled data. In contrast, both semisupervised and supervised methods employ supervised information, e.g., labels/tags, to learn hash function and hash binary codes, which have better performance than unsupervised manner. However, these CMH algorithms heavily depend on the shallow framework, where the features extraction and hash code projection are two separate steps. Thus, it may limit the robustness of the final learned hash functions and hash representations.
With the remarkable development in the field of artificial neural networks (ANN), deep neural networks (DNN) has shown their high performance at various multimedia tasks, such as Han, Laga & Bennamoun (2019), Girshick (2015), Wu et al. (2017b, 2017a), Guo et al. (2016a, 2016b), Mohammad, Muhammad & Shaikh (2019), Swarna et al. (2020), Muhammad et al. (2021), and Sarkar et al. (2021). Because of the significant capability of DNN in fitting nonlinear correlations, it has been widely utilized for the task of crossmodal hashing retrieval, which simultaneously learns robust hash functions and hash representations in an endtoend deep architecture. Moreover, DNN based models have illustrated great advantages over other handcrafted shallow models. To name a few, Deep CrossModal Hashing (DCMH) (Jiang & Li, 2017), SelfSuper Adversarial Hashing (SSAH) Li et al. (2018), Correlation Hashing Network (CHN) (Cao et al., 2016), SelfConstraint and Attentionbased Hashing Network (SCAHN) (Wang et al., 2020a), Tripletbased Deep Hashing (TDH) (Deng et al., 2018), SelfConstraining and Attentionbased Hashing Network (SCAHN) (Wang et al., 2020b), Pairwise Relationship Guided Deep Hashing (PRDH) (Yang et al., 2017) and MultiLabel Semantics Preserving Hashing (MLSPH) (Zou et al., 2021). However, these DNN based models still suffer from the following disadvantages. Firstly, the singleclass labelbased supervised information is adopted to measure the semantic similarity between inter and intramodality instances. However, this oversimple measurement cannot fully exploit the finegrained relevance, as the pairwise data from inter and intramodality may share more than one label. Secondly, the abstract semantic features produced by the top layer of DNN are adopted to represent the semantic information of different modalities. However, the representations from the intermediate layer, which has specific information, are neglected. Moreover, this manner cannot fully make use of the multiscale local and global representations, resulting in suboptimal hash representations.
In this paper, we propose a novel Hierarchical Semantic Interactionbased Deep Hashing Network (HSIDHN) to address the abovementioned problems. As demonstrated in Fig. 1, the proposed HSIDHN consists of two essential components. One component is the backbone network used to extract the hierarchical hash representations from different modality data (e.g., images and text). The other one is the Bidirectional Bilinear Interaction (BBI) module used to capture the hierarchical semantic correlation of each modality data from a different level. In the bidirectional bilinear interaction module, a multiscale and fusion process is first operated on each layer of the backbone network. The bidirectional interaction policy consisting of a bottomtop interaction and a topbottom interaction is then designed to exploit the specific semantic information from different layers. And finally, each interaction is aggregated by a bilinear pooling operation, and the interactions between bottomup and upbottom are concatenated together to enhance the capability of hash representations. Moreover, a dualsimilarity measurement (“hard” similarity and “soft” similarity) is designed to calculate the semantic similarity of different modality data, aiming to better preserve the semantic correlation of multilabels. The “hard” similarity means the instances share at least one label, while the “soft” similarity means the distribution difference between two label vectors measured by Maximum Mean Discrepancy (MMD).
The main contributions of HSIDHN are summarized as follows:

Firstly, a novel bidirectional bilinear interaction module is designed to achieve hierarchical semantic interaction for different modality data. The bidirectional bilinear interaction policy could effectively aggregate the hash representations from multiple layers and explore pairwise semantic correlation, promoting significant parts from different layers in a macro view. Therefore, it could enhance the discrimination of final hash representations.

Secondly, a dualsimilarity measurement using both a single class label constraint and Maximum Mean Discrepancy is proposed to map label vectors into the Reproducing Kernel Hilbert Space (RKHS). Thus, the semantic relationship of different modalities, especially instances with multilabels, can be thoroughly explored.

Thirdly, we apply the HSIDHN model on two largescale benchmark datasets with images and text modalities. The experiment results illustrated the HSIDHN surpasses other baseline models on the task of hashingbased crossmodal retrieval.
The rest of this paper is organized as follows. The related work is summarized in “Related Work”. The detailed description of HSIDHN for crossmodal retrieval is presented in “Proposed HSIDHN”. The experimental results and evaluations are illustrated in “Experiment”. Finally, we conclude this paper in “Conclusion”.
Related Work
Deep crossmodal hashing
In these years, deep learning has been widely used in crossmodal retrieval tasks due to its appealing performance in various computer vision applications such as Vasan et al. (2020), Dwivedi et al. (2021), Bhattacharya et al. (2020), Gadekallu et al. (2020), Jalil Piran et al. (2020), Jalil Piran, Islam & Suh (2018), and Joshi et al. (2018). It obtains hash representations and hash function learning in an optimal endtoend framework which also demonstrates the robustness. One of the most typical is deep crossmodal hashing (DCMH) (Jiang & Li, 2017), which firstly applies the deep learning architecture to crossmodal hashing retrieval. The selfconstraint and attentionbased hashing network (SCAHN) (Wang et al., 2020a) explores the hash representations of intermediate layers in an adaptive attention matrix. The correlation hashing network (CHN) (Cao et al., 2016) adopts the triplet loss measured by cosine distance to reveal the semantic relationship between instances and acquires highranking hash codes. Pairwise relationship guided deep hashing (PRDH) (Yang et al., 2017) leverages pairwise instances as input for each modality where supervised information is fully explored to measure the distance of intra and intermodality, respectively. Crossmodal hamming hashing (CMHH) (Cao et al., 2018) learns highquality hash representations and hash codes with a welldesigned focal loss and a quantization loss. Although these algorithms mentioned above have obtained high performance in CMH tasks, they ignore the rich spatial information from intermediate layers, which is essential to the modalityinvariant hash representations learning procedure.
Multilabel similarity learning
In the realworld scenario or benchmark datasets, instances are always related to multiple labels. Thus multilabel learning has attracted more and more attention in various applications. However, most existing CMH methods adopt single label constraints to measure the similarity among intra and intermodality instances. Selfsupervised adversarial hashing (SSAH) (Li et al., 2018), which uses an independent network to learn multilabel representations, and thus the semantic correlations are preserved. However, it only takes the multilabel information to supervise the label network training, and the original images or text are still measured by singlelabel. Improved deep hashing network (IDHN) (Zhang et al., 2019) introduces pairwise similarity metrics to fit the multilabel instances applications. In contrast, this method concentrates on the single modality hashing retrieval. Different from these methods that apply multilabel information, our HSIDHN employs both singlelabel and multilabel constraint to learn more robust hash representations. Significantly, the Maximum Mean Discrepancy (MMD) is adopted as the multilabel calculation criterion. To our knowledge, the HSIDHN is the first method using MMD in the deep CMH framework.
Proposed HSIDHN
In this section, the problem definition, the details of Hierarchical Semantic Interactionbased Deep Hashing Network (HSIDHN), including feature extraction architecture, are presented one by one. Without losing generality, we assume each instance has both imagemodality and textmodality. However, it can be easily extended to other modalities such as videos, audios and graphics.
Problem definition
We use uppercase letters to represent matrices, such as X, and lowercase letters representing vectors, such as y. The transpose of G are denoted as G^{T}, and sign function sign(·) is defined as:
(1) $$\begin{array}{c}\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}(x)=\{\begin{array}{cc}1\hfill & x\ge 0\hfill \\ 1\hfill & x<0\hfill \end{array}\hfill \end{array}$$
We assume the training dataset $O={\left\{{o}_{i}\right\}}_{i=1}^{N}$ with N instances, which all of them have label information and imagetext modality feature vectors. The ith training instance is denoted as ${o}_{i}=\left({v}_{i},{t}_{i}\right)$, where v_{i} ∈ ${R}^{{d}_{v}}$ and t_{i} ∈ ${R}^{{d}_{t}}$ denote the d_{v} and d_{t} dimensional feature vectors of image and text respectively. Moreover, the label based semantic similarity matrix is defined as ${S}_{\ast}=\left\{{S}_{\ast}^{vt},{S}_{\ast}^{vv},{S}_{\ast}^{tt}\right\}$, where ${S}_{\ast}^{vv}=\left\{{S}_{ij}^{vv}i,j=1,2,\dots ,N\right\}\in {R}^{N\times N}$ and ${S}_{\ast}^{tt}=\left\{{S}_{ij}^{tt}i,j=1,2,\dots ,N\right\}\in {R}^{N\times N}$ denote the intramodality similarity matrix of image and text, ${S}_{\ast}^{vt}=\left\{{S}_{ij}^{vt}i,j=1,2,\dots ,N\right\}\in {R}^{N\times N}$ denotes the intermodality similarity matrix between image and text. S_{*} means the “hard” similarity and “soft” similarity when * = h or * = r.
Given the training datasets O and S, the main objective of our proposed HSIDHN is to learn two modality discriminative hash functions h^{(v)}(v) and h^{(t)}(t) for image and text modalities, which can map features vectors into a compact binary space and preserve relationship and correlation among instances. The learning framework can be roughly divided into two parts, hash representations learning section and hash function learning section. Therefore, $F=\left\{{f}_{{v}_{i}}i=1,2,\cdots ,N\right\}\in {R}^{N\times c}$ and $G=\left\{{g}_{{t}_{i}}i=1,2,\cdots ,N\right\}\in {R}^{N\times c}$ are used to denote the learned hash representations of imagemodality and textmodality. Besides, $B=\left\{{B}_{i}i=1,2,\cdots ,N\right\}\in {R}^{N\times c}$ is the projection of the final hash codes from F and G by simply using a sign function B = sign(F + G).
The architecture of our proposed HSIDHN which consists of two parts. One component is the backbone network used to extract hash representations. The other one is the Bidirectional Bilinear Interaction (BBI) module used to capture the hierarchical semantic correlation of each modality data from different levels.
Network framework of HSIDHN
For most crossmodal hashing retrieval methods, the multilevel and multiscale information cannot be fully explored. Thus, it may limit the invariance and discrimination of the final learned hash representations. In this paper, we propose a novel Hierarchical Semantic Interactionbased Deep Hashing Network (HSIDHN) for largescale crossmodal retrieval, where a multilevel and multiscale interaction based network and bidirectional bilinear interaction module are used to explicitly specifics spatial and semantic information. The general architecture of our proposed HSIDHN is shown in Fig. 1.
In terms of the multilevel and multiscale hash representations generation, HSIDHN contains double endtoend network to learn hash functions and hash representations from text and image modality. The deep feature extraction procedure is conducted on Resnet (He et al., 2016), and pairwise pairs of images and text are applied as input for the Image Network and Text Network. For the Text Network, the bagofwords (BoW) vector policy has been widely adopted to extract features from Text Networks since Jiang & Li (2017). However, it is inappropriate to learn rich features demanded by the hash functions learning procedure because of BoW vectors’ sparsity. To solve this issue, a multiscale operation is leveraged by multiple pooling layers, and the vectors are resized by bilinearinterpolation. Finally, these vectors are concatenated together and fed to the text network, which consequently is helpful to construct semantic correlation for the text. Both image and text networks generate multilevel feature information from midlayers by exploring an adaptive average pooling. Motivated by SPPNet (Purkait, Zhao & Zach, 2017), the multiscale fusion structure is also applied to hash representations from each layer to obtain rich spatial information. Therefore, the semantic relevance and correlation from different layers can be fully explored to enhance the invariance of hash representations for both image and text modality. The whole architecture is shown in Fig. 2.
As different layers of the network have complementary hash representations. Thus the interaction among different layers may help to learn discriminative hash representations. A bidirectional bilinear integration module integrates the multiscaled multilevel hash representations from intermediate layers to learn more robust hash representations. The bidirectional bilinear integration policy has two main procedures, the bottomup and topdown progress. The bottomup procedure allows shadow activation, which always being covered by the top layers to accumulate slowly. The topdown operation can take advantage of contextual and spatial information. Therefore, the combination of bottomup and topdown interaction policy could generate better hash representations. We assume there are K multiscaled and multilevels hash representations generated from Resnet (He et al., 2016), and the f^{(k)} and g^{(k)} are output from K^{th} block.
Feeding an instance to the network, the output feature map is X ∈ R^{c}, where c is the dimension of X, and Z ∈ R^{o} is the bilinear representation with dimension o. For Z_{i} in $z=\left[{z}_{1},{z}_{2},\cdots ,{z}_{o}\right]$, the bilinear pooling interaction can be defined as:
(2) $${z}_{i}=I(\mathrm{x},\mathrm{x})={\mathrm{x}}^{T}{W}_{i}\mathrm{x}$$where W_{i} is the weighted projection matrix need to be learned, I(x, x) is the interaction function. According to Rendle (2010), the weighted projection matrix in Eq. (2) can be rewritten by factorizing as:
(3) $${z}_{i}=I(\mathrm{x},\mathrm{x})={\mathrm{x}}^{T}{U}_{i}{V}_{i}^{T}\mathrm{x}={U}_{i}^{T}\mathrm{x}\circ {V}_{i}^{T}\mathrm{x}$$where U_{i} ∈ R^{c} and V_{i} ∈ R^{c}. And consequently, the output features z is calculated as:
(4) $$\mathbf{z}={P}^{T}\left({U}^{T}\mathbf{x}\circ {V}^{T}\mathbf{x}\right)$$where U, V and P are projection matrices.
The proposed bidirectional bilinear integration policy aims to explore the interaction between intermediate layers. Taking the image modality as an example, we firstly select two layers f_{i} and f_{j} from multiscaled multilevel hash representations. Hence, inspired by Yu et al. (2018), the Eq. (4) can be rewritten as:
(5) $$\mathbf{z}={P}^{T}\left({U}^{T}{\mathbf{f}}_{\mathbf{i}}\circ {V}^{T}{\mathbf{f}}_{\mathbf{j}}\right)$$where P,U,V are projection matrices. To reduce number of parameters, the bilinear pooling is divided into two stages, which can be formulated as:
(6) $$\mathbf{P}\mathbf{B}={U}^{T}{f}_{i}\circ {V}^{T}{f}_{j}$$
(7) $$\mathbf{P}\mathbf{L}={P}^{T}PB$$
Thus, the interaction between two layers can be defined as:
(8) $$Z=={P}^{T}\mathrm{p}\mathrm{o}\mathrm{o}\mathrm{l}\left(\mathrm{P}\mathrm{B}\left({X}_{1}\right)\circ \mathrm{P}\mathrm{B}\left({X}_{2}\right)\right)$$
(9) $$={P}^{T}\mathrm{p}\mathrm{o}\mathrm{o}\mathrm{l}\left(I\left({X}_{1},{X}_{2}\right)\right)$$
In this paper, the interaction is applied on multilayer and the representation of each layers is defined as:
(10) $${Z}_{v}=BI\left({f}_{1},{f}_{2},{f}_{3}\right)={P}^{T}\mathrm{c}\mathrm{o}\mathrm{n}\mathrm{c}\mathrm{a}\mathrm{t}\left[I\left({f}_{1},{f}_{2}\right),I\left({f}_{1},{f}_{3}\right),I\left({f}_{2},{f}_{3}\right)\right]$$
(11) $$\begin{array}{c}{Z}_{t}=BI\left({g}_{1},{g}_{2},{g}_{3}\right)\end{array}={P}^{T}\mathrm{c}\mathrm{o}\mathrm{n}\mathrm{c}\mathrm{a}\mathrm{t}\left[I\left({g}_{1},{g}_{2}\right),I\left({g}_{1},{g}_{3}\right),I\left({g}_{2},{g}_{3}\right)\right]$$where f_{1}, f_{2}, f_{3} and g_{1}, g_{2}, g_{3} are hash representations from different layers of image and text modality, and concat denotes the concatenation operation. However, the bilinear pooling operation from one direction may lead to a vanishing gradient problem. This is because parameters from intermediate layers update faster than the end. Thus, the bidirectional bilinear integration policy can be written as:
(12) $$\begin{array}{l}{Z}_{v}=BBI\left({f}_{1},{f}_{2},{f}_{3}\right)={P}^{T}\text{concat}\left[{Z}_{1},{Z}_{2}\right]\\ ={P}^{T}\text{concat}[\underset{\text{bottomtop}}{\underbrace{I\left(\cdots I\left(I\left({f}_{1},{f}_{2}\right),{f}_{3}\right)\right)}},\text{\hspace{1em}}\underset{\text{topdown}}{\underbrace{I\left(I\left(I\left({f}_{3},{f}_{2}\right),{f}_{1}\right)\cdots ,{f}_{1}\right)}}],\end{array}$$
(13) $$\begin{array}{l}{Z}_{t}=BBI\left({g}_{1},{g}_{2},{g}_{3}\right)={P}^{T}\text{concat}\left[{Z}_{1},{Z}_{2}\right]\\ ={P}^{T}\text{concat}[\underset{\text{bottomtop}}{\underbrace{I\left(\cdots I\left(I\left({g}_{1},{g}_{2}\right),{g}_{3}\right)\right)}},\text{\hspace{1em}}\underset{\text{topdown}}{\underbrace{I\left(I\left(I\left({g}_{3},{g}_{2}\right),{g}_{1}\right)\cdots ,{g}_{1}\right)}}],\end{array}$$where f_{1}, f_{2}, f_{3} and g_{1}, g_{2}, g_{3} are hash representations from different layers of image and text modality and concat denotes the concatenation operation. And Z_{1} and Z_{2} are the multilayer interaction from bottomup and topdown procedure.
Dualsimilarity measurement
For most crossmodal retrieval benchmark datasets, it is common for an image or text to have multiple labels. Thus, the traditional methods, which only explore if labels are shared among instances, are not suitable for this situation. Therefore, to enhance the quality of similarity measurement, we propose a Dualsimilarity evaluation strategy.
“Hard” similarity based hamming distance loss
We use ${S}_{h}=\left\{{S}_{h}^{vv},{S}_{h}^{tt},{S}_{h}^{vt}\right\}\in \{0,1\}$ to represent the “hard” similarity matrix. In this scenario, the similarity definition follows the identical way which is similar to the previous methods. Given the training instances o_{i} and o_{j}, the element of similarity matrix ${S}_{i{j}_{h}}=1$ means the instances share at least one label, and thus the inner product of these two instance should be large, and ${S}_{i{j}_{h}}=0$ otherwise. In the ImgNet and TxtNet, there are k parts of features, the k − th parts of the hash representations from networks are denotes as ${f}_{{v}_{i}}^{k}$ and ${g}_{{t}_{j}}^{k}$. The likelihood function of image and text inter and intrainstances are calculated as:
(14) $$p\left({S}_{i{j}_{h}}^{vv}{f}_{{v}_{i}}^{k},{f}_{{v}_{j}}^{k}\right)=\{\begin{array}{cc}\sigma \left({\theta}_{ij}\right),\hfill & {S}_{i{j}_{h}}^{vv}=1\hfill \\ 1\sigma \left({\theta}_{ij}\right),\hfill & {S}_{i{j}_{h}}^{vv}=0\hfill \end{array}$$where ${\theta}_{ij}=\alpha {f}_{{v}_{i}}^{k}{f}_{{v}_{j}}^{kT}$.
(15) $$p\left({S}_{i{j}_{h}}^{tt}{g}_{{t}_{i}}^{k},{g}_{{t}_{j}}^{k}\right)=\{\begin{array}{cc}\sigma \left({\theta}_{ij}\right),\hfill & {S}_{i{j}_{h}}^{tt}=1\hfill \\ 1\sigma \left({\theta}_{ij}\right),\hfill & {S}_{i{j}_{h}}^{tt}=0\hfill \end{array}$$where ${\theta}_{ij}=\alpha {g}_{{t}_{i}}^{k}{g}_{{t}_{j}}^{kT}$.
(16) $$p\left({S}_{i{j}_{h}}^{vt}{f}_{{v}_{i}}^{k},{g}_{{t}_{j}}^{k}\right)=\{\begin{array}{cc}\sigma \left({\theta}_{ij}\right),\hfill & {S}_{i{j}_{h}}^{vt}=1\hfill \\ 1\sigma \left({\theta}_{ij}\right),\hfill & {S}_{i{j}_{h}}^{vt}=0\hfill \end{array}$$where ${\theta}_{ij}=\alpha {f}_{{v}_{i}}^{k}{g}_{{t}_{j}}^{kT}$. α is a control hyperparameter to selfadapt in different length of binary codes, which the value is set to $\alpha ={2}^{{\mathrm{log}}_{2}^{(c/64)}}$ and $\sigma \left({\theta}_{ij}\right)={\displaystyle \frac{1}{1+{e}^{{\theta}_{ij}}}}$. The Hamming distance intraloss of image and text and interloss can be defined as:
(17) $$\begin{array}{cc}{\mathcal{L}}_{\text{intraimage}}\hfill & =\sum _{k=1}^{K}\left(\sum _{i,j=1}^{N}\mathrm{log}p\left({S}_{i{j}_{h}}^{vv}{f}_{{v}_{i}}^{k},{f}_{{v}_{j}}^{k}\right)\right)\hfill \\ \hfill & =\sum _{k=1}^{K}\sum _{i,j=1}^{N}\left({S}_{i{j}_{h}}^{vv}{\theta}_{{v}_{i}^{k}{v}_{j}^{k}}\mathrm{log}\left(1+{e}^{{\theta}_{i}^{k}{v}_{j}^{k}}\right)\right)\hfill \end{array}$$
(18) $$\begin{array}{cc}{\mathcal{L}}_{\text{intratext}}\hfill & =\sum _{k=1}^{K}\left(\sum _{i,j=1}^{N}\mathrm{log}p\left({S}_{i{j}_{h}}^{tt}{g}_{{t}_{i}}^{k},{g}_{{t}_{j}}^{k}\right)\right)\hfill \\ \hfill & =\sum _{k=1}^{K}\sum _{i,j=1}^{N}\left({S}_{i{j}_{h}}^{tt}{\theta}_{{t}_{i}^{k}{t}_{j}^{k}}\mathrm{log}\left(1+{e}^{{\theta}_{i}^{k}{t}_{j}^{k}}\right)\right)\hfill \end{array}$$
(19) $$\begin{array}{cc}{\mathcal{L}}_{\text{inter}}\hfill & =\sum _{k=1}^{K}\left(\sum _{i,j=1}^{N}\mathrm{log}p\left({S}_{i{j}_{h}}^{vt}{f}_{{v}_{i}}^{k},{g}_{{t}_{j}}^{k}\right)\right)\hfill \\ \hfill & =\sum _{k=1}^{K}\sum _{i,j=1}^{N}\left({S}_{i{j}_{h}}^{vt}{\theta}_{{v}_{i}^{k}{t}_{j}^{k}}\mathrm{log}\left(1+{e}^{{\theta}_{{v}_{i}}{t}_{j}k}\right)\right)\hfill \end{array}$$
The overall “hard” similarity based hamming distance loss can be written as:
(20) $${\mathcal{L}}_{\mathrm{h}}={\mathcal{L}}_{\text{intraimage}}+{\mathcal{L}}_{\text{intratext}}+{\mathcal{L}}_{\mathrm{i}\mathrm{n}\mathrm{t}\mathrm{e}\mathrm{r}}$$
“Soft” similarity based mean square error loss
We use ${S}_{r}=\left\{{S}_{r}^{vv},{S}_{r}^{tt},{S}_{r}^{vt}\right\}\in [0,1]$ to represent the “Soft” Similarity matrix. In this scenario, we use the Maximum Mean Discrepancy (MMD) (Borgwardt et al., 2006) to measure the distance between two label vectors by projecting the original vector into a Reproducing Kernel Hilbert Space (RKHS). The “Soft” Similarity is defined as:
(21) $$\begin{array}{cc}{S}_{r}^{vt}\hfill & =\mathrm{M}\mathrm{M}\mathrm{D}\left({l}^{v},{l}^{t}\right)\hfill \\ \hfill & ={\Vert {\displaystyle \frac{1}{n}\sum _{i=1}^{n}\varphi \left({l}_{i}^{v}\right){\displaystyle \frac{1}{n}\sum _{j=1}^{n}\varphi \left({l}_{j}^{t}\right)}}\Vert}_{\mathcal{H}}^{2}\hfill \end{array}$$where l^{v} and l^{t} denotes label vectors of image and text instances.
It is hard to find a suitable projection function ϕ(.) in crossmodal retrieval tasks. Thus, the formula of “Soft” is expanded as:
(22) $$\begin{array}{cc}{S}_{r}^{vt}\hfill & =\mathrm{M}\mathrm{M}\mathrm{D}\left({l}^{v},{l}^{t}\right)=\parallel {\displaystyle \frac{1}{{n}^{2}}\sum _{i}^{n}\sum _{{i}^{{}^{\prime}}}^{n}\varphi \left({l}_{i}^{v}\right)\varphi \left({l}_{i}^{{v}^{\prime}}\right)}\hfill \\ \hfill & {\displaystyle \frac{2}{nm}\sum _{i}^{n}\sum _{j}^{m}\varphi \left({l}_{i}^{v}\right)\varphi \left({l}_{j}^{t}\right)+{\displaystyle \frac{1}{{m}^{2}}\sum _{j}^{m}\sum _{{j}^{{}^{\prime}}}^{m}\varphi \left({l}_{j}^{t}\right)\varphi \left({l}_{j}^{t}\right){\parallel}_{\mathcal{H}}^{2}}}\hfill \end{array}$$
We can easily calculate the above formula by the kernel function k(*). The final definition of “Soft” Similarity is shown as:
(23) $$\begin{array}{cc}{S}_{r}^{vt}\hfill & =\mathrm{M}\mathrm{M}\mathrm{D}({l}^{v},{l}^{t})=\parallel {\displaystyle \frac{1}{{n}^{2}}\sum _{i}^{n}\sum _{{i}^{{}^{\prime}}}^{n}\varphi \left({l}_{i}^{v}\right)\varphi \left({l}_{i}^{{v}^{\prime}}\right)}\hfill \\ \hfill & {\displaystyle \frac{2}{nm}\sum _{i}^{n}\sum _{j}^{m}\varphi \left({l}_{i}^{v}\right)\varphi \left({l}_{j}^{t}\right)+{\displaystyle \frac{1}{{m}^{2}}\sum _{j}^{m}\sum _{{j}^{{}^{\prime}}}^{m}\varphi \left({l}_{j}^{t}\right)\varphi \left({l}_{j}^{{t}^{\prime}}\right){\parallel}_{\mathcal{H}}^{2}}}\hfill \end{array}$$where l^{v} is the label information of image modality and l^{t} is the label information of text modality. $\left\{i,j\right\}\in {R}^{1\times n}$ denotes the number of instances.
Thus, according to Eq. (23), we apply this metric to define pairwise intramodality similarity for imagemodality and textmodality as:
(24) $$\begin{array}{cc}{S}_{r}^{vv}\hfill & =\mathrm{M}\mathrm{M}\mathrm{D}({l}^{v},{l}^{v})=\parallel {\displaystyle \frac{1}{{n}^{2}}\sum _{i}^{n}\sum _{{i}^{{}^{\prime}}}^{n}\varphi \left({l}_{i}^{v}\right)\varphi \left({l}_{i}^{{v}^{\prime}}\right)}\hfill \\ \hfill & {\displaystyle \frac{2}{nm}\sum _{i}^{n}\sum _{j}^{m}\varphi \left({l}_{i}^{v}\right)\varphi \left({l}_{j}^{v}\right)+{\displaystyle \frac{1}{{m}^{2}}\sum _{j}^{m}\sum _{{j}^{{}^{\prime}}}^{m}\varphi \left({l}_{j}^{v}\right)\varphi \left({l}_{j}^{{v}^{\prime}}\right){\parallel}_{\mathcal{H}}^{2}}}\hfill \end{array}$$
(25) $$\begin{array}{cc}{S}_{r}^{tt}\hfill & =\mathrm{M}\mathrm{M}\mathrm{D}({l}^{t},{l}^{t})=\parallel {\displaystyle \frac{1}{{n}^{2}}\sum _{i}^{n}\sum _{{i}^{{}^{\prime}}}^{n}\varphi \left({l}_{i}^{t}\right)\varphi \left({l}_{i}^{{t}^{\prime}}\right)}\hfill \\ \hfill & {\displaystyle \frac{2}{nm}\sum _{i}^{n}\sum _{j}^{m}\varphi \left({l}_{i}^{t}\right)\varphi \left({l}_{j}^{t}\right)+{\displaystyle \frac{1}{{m}^{2}}\sum _{j}^{m}\sum _{{j}^{{}^{\prime}}}^{m}\varphi \left({l}_{j}^{t}\right)\varphi \left({l}_{j}^{{t}^{\prime}}\right){\parallel}_{\mathcal{H}}^{2}}}\hfill \end{array}$$where k(*) is the Gaussian kernel since it can map the original label information to an infinite dimension.
As the similarity is continuous, the Mean Square Error (MSE) loss function is adopted to adapt the “Soft” Similarity. Besides, we apply both interand intramodality constraint loss to bridge the heterogeneous gap and preserve the semantic relevance. Thus, the MSE loss is calculated as:
(26) $${\mathcal{L}}_{inte{r}_{r}}=\sum _{i=1,j=1}^{n}{\Vert {\displaystyle \frac{\u27e8{f}_{i},{g}_{j}\u27e9+c}{2}{s}_{ijr}^{vt}\cdot c}\Vert}^{2}$$
(27) $${\mathcal{L}}_{intraimag{e}_{r}}=\sum _{i=1,j=1}^{n}{\Vert {\displaystyle \frac{\u27e8{f}_{i},{f}_{j}\u27e9+c}{2}{s}_{ijr}^{vv}\cdot c}\Vert}^{2}$$
(28) $${\mathcal{L}}_{intratex{t}_{r}}=\sum _{i=1,j=1}^{n}{\Vert {\displaystyle \frac{\u27e8{g}_{i},{g}_{j}\u27e9+c}{2}{s}_{ijr}^{tt}\cdot c}\Vert}^{2}$$where f_{i} represents the hash representation of ith image instance, g_{j} represents the hash representation of jth text instance and c is the length of binary codes. Since the inner product $\u27e8\ast ,\ast \u27e9\in [c,c]$, the value range of $\frac{\u27e8\ast ,\ast \u27e9+c}{2}$ will be the same as ${s}_{ijr}^{\ast \ast}\xb7c$.
The overall “Soft” Similarity based MSE loss can be written as:
(29) $${\mathcal{L}}_{r}={\mathcal{L}}_{inte{r}_{r}}+{\mathcal{L}}_{intraimag{e}_{r}}+{\mathcal{L}}_{intratex{t}_{r}}$$
Quantization loss
The purpose of the dualsimilarity based loss function is to guarantee the hash representations F, and G can preserve similarity. While the similarity of hash codes B^{(v)} = sign(F) and B^{(t)} = sign(g) has been neglected. Therefore, we also need to make sure that the binary codes B_{(v)} and B_{(t)} preserve the similarity, which is also the goal of crossmodal retrieval. As both B_{(v)} and B_{(t)} share the same label information in a minibatch, the hash codes is set to B^{(v)} = B^{(t)} = B. Accordingly, the quantization loss is defined as:
(30) $${\mathcal{L}}_{q}={\displaystyle \frac{1}{c}\left(\parallel BF{\parallel}_{F}^{2}+\parallel BG{\parallel}_{F}^{2}+\parallel FG{\parallel}_{F}^{2}\right)}$$
Optimization
By aggregating the Eqs. (20), (29) and (30), we get the general objective function as:
(31) $$\underset{B,{\theta}_{x},{\theta}_{y}}{min}\mathcal{L}=\underset{B,{\theta}_{x},{\theta}_{y}}{min}\left({\mathcal{L}}_{h}+\gamma {\mathcal{L}}_{r}+\beta {\mathcal{L}}_{q}\right)$$where θ_{x} and θ_{y} are network parameters of image and text, and B is the learned binary codes. γ and β are hyperparameters to control each part’s weights in the general objective function. We adopt an alternating optimization algorithm, and some parameters are fixed while other parameters are optimized.
Fix B, optimize θ_{x} and θ_{y}
The back propagation (BP) algorithm is adopted to update parameters θ_{Dx}, θ_{Dy} by descending gradients:
(32) $$\theta \leftarrow \theta \eta \cdot {\mathrm{\nabla}}_{\theta}{\displaystyle \frac{1}{n}\mathcal{L}}$$
Fix θ_{x} and θ_{y}, optimize B
As the θ_{x} and θ_{y} is fixed, the optimization of binary codes B can be defined as:
(33) $$\begin{array}{l}\underset{B}{\mathrm{min}}tr\left({B}^{T}(\eta (F+G))\right)=\eta {\displaystyle \sum _{i,j}{B}_{ij}}\left({F}_{ij}+{G}_{ij}\right)\\ \text{\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}}s.t.\text{\hspace{1em}\hspace{1em}}B\in {\{1,+1\}}^{c\times N}\end{array}$$
which can be formulated as:
(34) $$\begin{array}{c}B=\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}(\eta (F+G))\hfill \end{array}$$
Experiment
To evaluate the algorithm we proposed, two largescale public datasets MIRFlickr25k (Huiskes & Lew, 2008) and NUSWIDE (Chua et al., 2009) are employed as our training data to compare with other sateoftheart crossmodal hashing methods.
Datasets
MIRFlickr25K (Huiskes & Lew, 2008): There are 25,000 instances images in the MIRFlickr25K collected from Flickr with several textual descriptions. Following the standard experimental settings proposed in DCMH (Jiang & Li, 2017), 20,015 data are samples are leveraged with less than 24 distinct labels. A 1,386dimensional BoW vector is generated for each text description.
NUSWIDE (Chua et al., 2009): There are 269,468 imagetext instances pair belonging to 81 categories collected from realworld web datasets. Each textual description for image instance is represented by a 1,000dimensional binary vector. In this paper, 21 of the most frequently used categories are chosen with 190,421 images and related text.
We randomly select 10,000 and 10,500 instances from MIRFlickr25K and NUSWIDE as the training set to reduce the computational cost. Meanwhile, we randomly choose 2,000 and 2,100 samples as the query set for MIRFlickr25k and NUSWIDE, respectively. The remained data are leveraged as a retrieval set after the query set is selected. Images are normalized before inputting to the network. The details of dataset division are summarized in Table 1.
Dataset name  Total number  Training set/test set 

MIRFLlickr25K  20,015  10,000/2,000 
NUSWIDE  190,421  10,500/2,100 
Implementation details
Our HSIDHN is implemented using the Pytorch (Paszke et al., 2019) framework and performed on one TITAN Xp GPU server. In the endtoend framework, Resnet34 is applied as the backbone network. For the bidirectional bilinear interaction module, the last three parts of hash representations are integrated to enhance the capability hash representations, respectively. Moreover, the multiscale fusion of text is applied on pooling sizes of 1, 5, 10, 15, 30. The image network parameters initialization is pretrained on the ImageNet (Russakovsky et al., 2015) dataset, and the network for textmodality is initialized by Normal distribution $N\left(\mu ,{\sigma}^{2}\right)$ with μ = 0 and σ = 0.1. Learning rate is initialized in 10^{−1.1} and gradually decays to 10^{−6.1}, and the minibatch size is 128. Besides, we use the SGD as our optimization for image and text networks.
Evaluation and baselines
To measure CMH methods’ performance, we adopt hamming ranking as the retrieval protocol, which sorting instances by hamming distance. In this paper, the PR Curves and Mean Average Precision (MAP) (Liu et al., 2014) are leveraged as the evaluation criteria for HSIDHN.
The HSIDHN is compared with several baseline methods including SCM (Zhang & Li, 2014), DCMH (Jiang & Li, 2017), CMHH (Cao et al., 2018), PRDH (Yang et al., 2017), CHN (Cao et al., 2016), SepH (Lin et al., 2015) and SSAH (Li et al., 2018). Table 2 and Table 3 illustrates the MAP results of HSIDHN and other methods in different lengths of hash codes. Fig. 3 and Fig. 4 demonstrate the PR curves of different length of hash codes conducted on MIRFlickr25K and NUSWIDE. From the result, we can get the following observations and analysis.

HSIDHN dramatically exceeds other methods on different lengths of hash codes in consideration of the MAP, which reveals the advantages of the multiscale and multilevel interaction module. It is worth nothing that HSIDHN outperform DCMH by 9.8–3.9% and 17.77–12.93% in terms of MAP for ImagequeryText and TextqueryImage tasks on MIRFLICKR25K and NUSWIDE. This is mainly because that the multiscale process could explore different receptive field of input data, where information with different size could be fully used. Additionally, the hierarchical feature interaction could explore the useful specific feature from different layer and integrated them to enhance the capability of final hash representations.

The high performance of HSIDHN is partly because the semantic relation and correlation from different intermediate layers are explored by bidirectional bilinear interaction module. Besides, the multiscale fusion could further make full use of leverage spatial information.

There is a kind of imbalance between the performance of imagequerytext and textqueryimage in almost all the other baselines. However, this phenomenon could be effectively avoided in HSIDHN. This is mainly due to the dualsimilarity measurement, which can be sufficient to unify the imagemodality and textmodality in the latent common space.

All deep CMH methods, including DCMH, CHN, PRDH, CMHHH, and SSAH obtain higher performance than other shadow hashing methods such as SePH and SCM. This demonstrates the effectiveness and efficiency of deep neural networks in hash representations and hash function learning, which is more robust than the nondeep neural network methods. Thus, deep neural network based deep hashing methods could obtain better performance.
MIRFLICKR25K  

Method  Imagequerytext  Textqueryimage  
16 bits  32 bits  64 bits  16 bits  32 bits  64 bits  
SCM Zhang & Li (2014)  0.6354  0.5618  0.5634  0.6340  0.6458  0.6541 
SePH Lin et al. (2015)  0.6740  0.6813  0.6830  0.7139  0.7258  0.7294 
DCMH Jiang & Li (2017)  0.7316  0.7343  0.7446  0.7607  0.7737  0.7805 
CHN Cao et al. (2016)  0.7504  0.7495  0.7461  0.7776  0.7775  0.7798 
PRDH Yang et al. (2017)  0.6952  0.7072  0.7108  0.7626  0.7718  0.7755 
SSAH Li et al. (2018)  0.7745  0.7882  0.7990  0.7860  0.7974  0.7910 
CMHH Cao et al. (2018)  0.7334  0.7281  0.7444  0.7320  0.7183  0.7279 
HSIDHN  0.7978  0.8097  0.8179  0.7802  0.7946  0.8115 
NUSWIDE  

Method  Imagequerytext  Textqueryimage  
16 bits  32 bits  64 bits  16 bits  32 bits  64 bits  
SCM Zhang & Li (2014)  0.3121  0.3111  0.3121  0.4261  0.4372  0.4478 
SePH Lin et al. (2015)  0.4797  0.4859  0.4906  0.6072  0.6280  0.6291 
DCMH Jiang & Li (2017)  0.5445  0.5597  0.5803  0.5793  0.5922  0.6014 
CHN Cao et al. (2016)  0.5754  0.5966  0.6015  0.5816  0.5967  0.5992 
PRDH Yang et al. (2017)  0.5919  0.6059  0.6116  0.6155  0.6286  0.6349 
SSAH Li et al. (2018)  0.6163  0.6278  0.6140  0.6204  0.6251  0.6215 
CMHH Cao et al. (2018)  0.5530  0.5698  0.5924  0.5739  0.5786  0.5889 
HSIDHN  0.6498  0.6787  0.6834  0.6396  0.6529  0.6792 
Ablation study
In this section, the importance of each component of HSIDHN is validated. To evaluate the effect of the different modules, the settings of experimental are defined as:

HSIDHNSIM is designed by replacing the dualsimilarity by single hamming distance measurement.

HSIDHNBBI is designed by removing the interaction between layers, and the final hash representations are generated from the final layer of the network.
The results of the ablation study are shown in Table 4. Firstly, there is no doubt that the dualsimilarity measurement is better than the hammingbased distance. This is mainly due to that the finegrained dualsimilarity could better preserve the semantic relationship. Moreover, the performance experiences a significant drop when the BBI policy is removed. This may partly because the BBI policy can explore the more robust hash representations from intermediate layers of networks.
Method  MIRFLICKR25K  NUSWIDE  

Imagequerytext  Textqueryimage  Imagequerytext  Textqueryimage  
HSIDHNSIM  0.8140  0.8097  0.6432  0.6401 
HSIDHNBBI  0.8034  0.8004  0.6316  0.6275 
HSIDHN  0.8179  0.8115  0.6834  0.6792 
Time complexity
The Eq. (31) is taken as the final loss function to train. Each term of the Eq. (31) is MSE loss or max loglikelihood loss which are general in crossmodal retrieval applications. A server with a Titan Xp card is leveraged to train. For the whole HSIDHN, the training and validation procedure need around 28 h for MIRFLICKR25K and 53 h for NUSWIDE. The proposed HSIDHN have a fast convergence rate than other deep hashing methods, as the introduction of bidirectional bilinear interaction and dualsimilarity measurement.
Limitation of HSIDHN and future work
Although the appealing performance has been obtained in the HSIDHN framework, there are still some limitations. Firstly, the network architecture, especially the multiscale and multilevel features extraction process, requires huge GPU memory to train. The model compression might be the possible solution to solve it. Secondly, the performance of textqueryimage is not as significant as imagequerytext. This is partly because of the sparsity of features learning from text modality. Some pretraining model is the possible way to learn higher quality features from the original text.
Conclusion
In this paper, an efficient and effective framework called HSIDHN is proposed for crossmodal hashing retrieval tasks. HSIDHN has three main benefits over the existing methods in CMH community. Firstly, a multiscale fusion and a Bidirectional Bilinear Interaction (BBI) module are designed in our framework, with the goal of learning modalspecific hash representations and discriminative hashing codes. Additionally, a dualsimilarity measurement strategy is proposed to calculate the finegrained semantic similarity for both intra and intermodality pairwise labels. Finally, but certainly not least, experimental results on two large scale benchmark datasets illustrate the superiority of HSIDHN compared with other baseline methods.