Professional Documents
Culture Documents
Deep Multi-View Enhancement Hashing For Image Retrieval
Deep Multi-View Enhancement Hashing For Image Retrieval
Abstract—Hashing is an efficient method for nearest neighbor search in large-scale data space by embedding high-dimensional feature
descriptors into a similarity preserving Hamming space with a low dimension. However, large-scale high-speed retrieval through binary
code has a certain degree of reduction in retrieval accuracy compared to traditional retrieval methods. We have noticed that multi-view
methods can well preserve the diverse characteristics of data. Therefore, we try to introduce the multi-view deep neural network into
the hash learning field, and design an efficient and innovative retrieval model, which has achieved a significant improvement in retrieval
performance. In this paper, we propose a supervised multi-view hash model which can enhance the multi-view information through neural
arXiv:2002.00169v2 [cs.CV] 16 Jun 2020
networks. This is a completely new hash learning method that combines multi-view and deep learning methods. The proposed method
utilizes an effective view stability evaluation method to actively explore the relationship among views, which will affect the optimization
direction of the entire network. We have also designed a variety of multi-data fusion methods in the Hamming space to preserve
the advantages of both convolution and multi-view. In order to avoid excessive computing resources on the enhancement procedure
during retrieval, we set up a separate structure called memory network which participates in training together. The proposed method
is systematically evaluated on the CIFAR-10, NUS-WIDE and MS-COCO datasets, and the results show that our method significantly
outperforms the state-of-the-art single-view and multi-view hashing methods.
1 I NTRODUCTION
most prominent contribution for these multi-view methods. tensive experiments demonstrate the superiority of
Recently, an increasing number of researches explicitly or the proposed D-MVE-Hash, as compared to several
implicitly capture the relations among different views [19], state-of-the-art hash methods in image retrieval task.
[20] and modalities [21], [22] to enhance the multi/cross-
view/modal approaches. Compared to these methods, the 2 R ELATED W ORK
view-relation matrix we proposed can directly be employed
2.1 Approximate Nearest Searching with Hashing
in original feature view spaces. It avoids information miss-
ing during the process of subspace mapping resulting in The image retrieval mentioned in this paper refers to
a better capture of relations. Furthermore, our approach content-based visual information retrieval [26], [27], [28].
optimizes the feature extraction network by both objective This process can be simply expressed as: feeding the un-
function and view relation calculation process to obtain the labeled original images to a deep net architecture or other
best multi-view features which are used to calculate the retrieval methods to get images which are similar or belong
matrix. to the same category of the inputs. It is the typical similarity
Our motivation to design deep multi-view enhancement searching problem. The similarity searching of multimedia
hashing (D-MVE-Hash) arises from two aspects: (1) By splic- data such as collections of images (e.g, views) or 3D point
ing and normalizing the fluctuations of different image fea- clouds [29], [30] usually requires the compression process-
tures under multiple feature views, we obtain the volatility ing. Similarity searching (or proximity search) is achieved
matrix which regulates the view relevance based on fluctua- by means of nearest neighbor finding [1], [31], [32]. Hashing
tion strength to produce a quality view-relation matrix. The is an efficient method for nearest neighbor search in large-
view-independent and robustness are the special properties scale data spaces by embedding high-dimensional feature
of this view-relation matrix. We have conducted detailed descriptors into a similarity preserving Hamming space
ablation experiments in Sec. 4.1 to support this conclusion. with a low dimension.
(2) Since typical multi-view hash image retrieval algorithm In this paper, we do research on the supervised hash
still needs to manually extract image features [15], which learning algorithm [33], [34] which uses tagged datasets.
makes the multi-view information and the deep network Compared with unsupervised hashing algorithm [35] such
less impactful in learning-to-hash and image retrieval. For as LSH [36], KLSH [37] and ITQ [38], supervised meth-
this reason, we want to design an end-to-end architecture ods could obtain more compact binary representaion and
with data fusion methods which are carried out in the generally achieve better retrieval performance [8], [9]. LSH
Hamming space. [36] implements a hash retrieval by generating some hash
functions (which is called the family H of functions) with
In this work, a supervised multi-view hashing method
local sensetive properties and applying some random choice
called D-MVE-Hash is proposed for accurate and efficient
methods to transform the input image and construct a
image retrieval based on multiple visual features, multi-
hash table. KLSH’s [37] main technical contribution is to
view hash and deep learning. Inspiring by multi-modal and
formulate the random projections necessary for LSH [36]
cross-modal hashing methods [21], [23], we excavate mul-
in kernal space. These sensible hashing methods show that
tiple view associations and make this potential knowledge
the local clustering distance of the constrained binary code
dominant to explore the deep integration of the multi-view
plays an important role in the overall optimization. ITQ
information. To this end, we design a view-independent and
[38] finds a rotation of zero-centered data so as to minimize
robustness view-relation matrix which is calculated from
the quantization error of mapping this data to the vertices
the continuously optimized features of each view to better
of a zero-centered binary hypercube. KSH [5] maps the
capture the relation among views. Fig. 1 and Fig. 2 are
data to compact binary codes whose Hamming distance are
the illustration of the key idea and model framework. To
minimized on similar pairs and simultaneously maximized
cope with learning-to-hash, we retain the relaxation and
on dissimilar pairs. All these traditional methods reveal
constraint methods similar to [8], [24], [25]. D-MVE-Hash
the value of hash learning in retrieval. However, since
automatically learns the optimal hash functions through
the deep neural network have shown strong usability in
iterative training. After that, we can achieve high-speed
various fields, researchers began to consider introducing the
image retrieval using the image hash codes which are gen-
advantages of neural network and convolution operation
erated by the trained D-MVE-Hash. In summary, the main
into hash learning and proposed deep hash learning.
contributions of our work are two-fold:
deep convolutional network to learn the hash functions. Xia part of the section, We introduce the joint learning and the
et al. propose a deep convolutional network tailored to the overall architecture of D-MVE-Hash.
learned hash codes in H and optionally the discrete class
labels of the images. The reconstuction error is minimized
during training. In [11], Lai et al. make the CNNH [10] 3.1 Problem Definition and MV-Hash
become an “one-stage” supervised hashing method with
Suppose O = {oi }N i=0 is a set of objects, and the correspond-
a Triplet Ranking Loss. Zhu et al. [24] proposed the DHN (m) (m)
ing features are {X(m) = [x1 , · · · , xN ] ∈ Rdm ×N }M m=1 ,
which uses a pairwise cross-entropy loss L for similarity-
where dm is the dimension of the m-th view, M is the
preserving learning and a pairwise quantization loss Q
number of views, and N is the number of objects. We also
for controlling hashing quality. HashNet [8] attacks the ill-
denote an intergrated binary code matrix B = {bi }N i=1 ∈
posed gradient problem in optimizing deep networks with
non-smooth binary activations by continuation method.
{−1, 1}q×N , where bi is the binary code associated with oi ,
and q is the code length. We formulate a mapping function
DMDH [25] transforms the original binary optimization
into differentiable optimization problem over hash func-
F(O) = [F1 (X(1) ), · · · , FM (X(M ) )], where the function Fm
can convert a bunch of similar objects into classification
tions through series expansion to deal with the objective
scores in different views. Then, we define the composition
discrepancy caused by relaxation.
of the potential expectation hash function ϕ : X → B as
follow:
2.3 Multi-view Hashing
ϕ(X) = [ϕ1 (E(F(X1 , · · · , XN )), X(1) ), · · · ,
The poor interpretability of deep learning makes it difficult , (1)
to further optimize the model in a target manner. We have ϕm (E(F(X1 , · · · , XN )), X(m) )]
noticed that some multi-view-based hashing methods [12],
[13], [14], [15], [40], [41] have emerged in recent years. Before starting stability evaluation, we pre-train each
This type of method processes images by traditional means view network in the tagged dataset for classification task.
and extracts various image features for enhancing the deep Using the following loss function:
learning hashing method. The dispersion of hash codes is exp(x[i])
one of the reasons why such operations are effective. Since Lp (x, i) = − log P . (2)
the data form among features is very different, it can be seen j exp(x[j])
as viewing images from different views.
The criterion of Eq 2 expects a class index in the range [0,
The observation of an object should be multi-angled and class-numbers] as the target for each value of a 1D tensor
multi-faceted. Whereas a quantity of multi-view hashing of size mini-batch. The input x contains raw, unnormalized
method emphasizes the importance of the multi-view space, scores for each class. For instance, x[0] is outputted by the
ignoring the role of convolution and the relationship among classifier to measure the probability that the input image
views. Using kernel functions and integrated nonlinear ker- belongs to the first class. x[1] is the prediction score of the
nel feature is a common way to achieve that, and we can second class. x[i] is the prediction score of the ground truth.
use a weighting vector to constrain different kernels. How- Specific to image data, given N images I = {i1 , ..., iN }.
ever, the weighting vector which can gather relationships Set Q = F(I) in which F means the testing process. The
between views is often pre-set or automatically optimized as dimension of Q is M × N × C , where M is the number
a parameter, which is not reasonable enough. In [41], Zhang of views, N is the number of images, C is the number of
et al. used graphs of multi-view features to learn the hash classes. Qmc , which is actually Qmc· , omits the third di-
codes, and each view is assigned with a weight for combina- mension represented by · and stands for a one-dimensional
tion. Kim et al. proposed Multi-view anchor graph hashing vector rather than a number. E(F) is as follow:
[42] which concentrates on a low-rank form of the averaged
similarity matrix induced by multi-view anchor graph. In M
X q
[12], Shen et al. set µm measures the weight of the m-th Em (Q) = + max V(Qmc )
c
view in the learning process. Multiview Alignment Hashing m=1
, (3)
(MAH) [13] seeks a matrix factorization to effectively fuse 1
C
Xq
the multiple information sources meanwhile discarding the − V(Qmc )
N
feature redundancy. Xie et al. proposed Dynamic Multi- c=1
View Hashing (DMVH) [15] which augments hash codes 1 PN
according to dynamic changes of image, and each view is and the V(Q) = N n=1 (Q − µ), where the µ is arithmetic
assigned with a weight. Based on the above, we conducted mean. E is expressed as [E1 , · · · , EM ]. Then we do a simple
in-depth research. normalization of E :
log(Em (Q) + | min(E(Q))| + 1)
log{Em (Q)} = . (4)
3 D EEP M ULTI -V IEW E NHANCEMENT HASHING log(max(|E(Q)|) + | min(E(Q))| + 1)
In this section, the details of our proposed D-MVE-Hash Then we consider training multi-view binary code gen-
are introduced. We first briefly introduce the definition of eration network with view-relation information. At the be-
multi-view problems and give the detailed description of ginning, we ponder the case of a pair of images i1 , i2 and
MV-Hash. Following, we discuss the enhancement process corresponding binary network outputs b1 , b2 ∈ B, which
and propose three different data fusion methods. In the last can relax from {−1, +1}q to [−1, +1]q [7]. We define y = 1
4
Input 0
view1 (without noise) 1
1
0
Stability evaluation 1
View-relation CNN
view2
matrix E 0
Basic ~
1
Concat 1
Input 0
view3
1101111011 Fusion-R 1
Salt- 1000001111 1
Fusion-C
and- 0110010000 1
pepper 1010010110 Fusion-P
view4
noise Fusion-R/C/P
Multi-view binary code Fusion binary code
Fig. 2. The proposed deep multi-view enhancement hashing (D-MVE-Hash) framework. (1) In the preprocessing stage, the model picks up similar
images from the dataset by labels, and attaches random noise to these images as part of input which is used to get view-relation matrix. The output
of the model is q bits binary code. ( bi ∈ [−1, 1]q while training ) (2) “Basic ∼” means basic binary code which is generated by basic CNN based
hash learning model. (3) View stability evaluation Em (Q) calculates view-relation matrix. This part is replaced by memory network during the testing
process (step2).
if they are similar, and y = −1 otherwise. The following relationship information can affect the direction of gradient
formula is the loss function of the m-th view: descent optimization.
Lm (b1 , b2 , y) = −(y − 1) max(a − kb1 − b2 k22 , 0)
3.2 Enhancement and Fusion
+(y + 1)kb1 − b2 k22 , (5)
The next stage is integrating multi-view and view-relation
+α(k|b1 | − 1k1 + k|b2 | − 1k1 )
information into the traditional global feature in the Ham-
where the k · k1 is the L1-norm, | · | is the absolute value ming space. Fig. 3 is the illustration of the data enhance-
operation, α > 0 is a margin threshold parameter, and the ment process which is actually divided into two parts.
third term in Equation 5 is a regularizer term which is used Firstly, we use some hash mapping constaint rules which
to avoid gradient vanishing. are occur simultaneously in single-view and multi-view
More generally, we have the image input I = {i1 , ..., iN } spaces to embed features into the identical Hamming space.
(m) (m) Secondly, we propose replication fusion, view-code fusion
and output B(m) = {b1 , ..., bN } in the multi-view space.
In order to get the equation representation in matrix form, and probability view pooling which are all carried out in the
we substitute B into Lm formula, then complement the Hamming space to accomplish the multi-view data, view-
regular term and similarity matrix to get following global relation matrix and traditional global feature integration.
objective function: In Fig. 3, the solid and dotted line respectively indicate
the strong and weak constraint. For example, as for the
L(I, Y) = green points: 1) The distance constraint (i.e., the clustering)
. (6)
Yij · log{Em (F(I)} · (ρ(I) + αkϕ(I)0 − 1KN ∗q k) is stronger than ±1 constraint (i.e., the hash dispersion)
when points are within the α circle. Therefore, we connect
The matrix merged by ϕ(I) is denoted as ρ(I) refer to the
the green points by solid lines, and connect the green dots
form of the second term of multiplication in Equation 7. The
with ±1 circle by dotted lines; 2) The distance constraint is
δ(·) in Equation 7 is the maximum operation. The ⊕ is inner
weaker than ±1 constraint when points are between the α
product.
circle and ±1 circle. Therefore, we connect the green points
δ((B ⊕ B) · (B ⊕ B)0 ) by dotted lines, and connect the green dots with ±1 circle
y(bi , bj ) = +1
· . (7)
y(bi , bj ) = −1 (B ⊕ B) · (B ⊕ B)0 by solid lines; 3) The distance constraint and ±1 constraint
are both the strong constraints when points are outside the
In order to show the effect of view stability evaluation in ±1 circle. Similar operations also occur on the red points.
Equation 6 intuitively, we rewrite the overall loss function Compared to the green points, the clustering direction of
as Equation 8. That is, we want to highlight the position of the red points is opposite.
view-relation matrix E which is the output of E(F(I)) in the
overall loss function. 3.2.1 Replication Fusion
N X
M N
X X (m) Replication fusion is a relatively simple solution which relies
L(E, B) = Enm Lm (B(m)
n , Bi , y). (8) on parameters. We sort E to find important views, and
n=1 m=1 i=1 strengthen the importance of a view by repeating the binary
With this objective function, the network is trained using code of the corresponding view in multi-view binary code.
back-propagation algorithm with mini-batch gradient de- Specifically, the basic binary code is denoted as B. The
scent method. Meanwhile, since the view-relation matrix intermediate code (multi-view binary code) is denoted as H.
E is directly multiplied by Lm in Equation 8, the view We set fusion vector v to guide the repetition of multi-view
5
(a) Same Hamming radius (b) Different Hamming radius (a) On CIFAR-10 dataset (b) On NUS-WIDE dataset
Fig. 4. (a) Retrieval results of single-view hashing method and the Fig. 6. The mean average precision (mAP) for the single-view hashing
proposed multi-view hashing method with the same Hamming radius. method (the outputs is the basic binary codes), MV-Hash (the outputs is
(b) The mean average precision (mAP) for different code lengths and the multi-view binary codes) and D-MVE-Hash (the outputs is the fusion
Hamming radius on CIFAR-10 dataset. binary codes) using 96 bits hash code.
(a) Replication Fusion (b) View-Code Fusion (a) On CIFAR-10 dataset (b) On NUS-WIDE dataset
Fig. 5. (a) The D-MVE-Hash (+R) training loss with 16 bits, 128 bits and Fig. 7. The receiver operating characteristic curves (ROC) for the fusion
256 bits hash codes. (b) The D-MVE-Hash (+C) training loss with 16 binary codes using 48 bits, 64 bits and 96 bits hash codes.
bits, 128 bits and 256 bits hash codes.
TABLE 2
The mean average precision (mAP) of re-ranking for different bits on the CIFAR-10 (left), NUS-WIDE (middle) and MS-COCO (right) dataset.
D-MVE-Hash (+R) means the proposed model D-MVE-Hash with replication fusion enhancement method. D-MVE-Hash (+C) means D-MVE-Hash
with view-code fusion enhancement method. D-MVE-Hash (+P) means D-MVE-Hash with probability view pooling enhancement method.
[3] P. Zhang, W. Zhang, W.-J. Li, and M. Guo, “Supervised hashing standing,” IEEE Trans. on Multimedia, vol. 21, no. 10, pp. 2675–2685,
with latent factor models,” in Proc. of the 37th Int. ACM SIGIR 2019.
Conf. on Research and Development in Information Retrieval. ACM, [28] C. Yan, H. Xie, J. Chen, Z. Zha, X. Hao, Y. Zhang, and Q. Dai, “A
2014, pp. 173–182. fast uyghur text detector for complex background images,” IEEE
[4] J. Wang, T. Zhang, j. song, N. Sebe, and H. T. Shen, “A survey on Trans. on Multimedia, vol. 20, no. 12, pp. 3389–3398, 2018.
learning to hash,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, [29] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning
no. 4, pp. 769–790, 2018. on point sets for 3d classification and segmentation,” in Proc. of the
[5] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang, “Supervised IEEE Conf. on Comput. Vis. Pattern Recognit, 2017.
hashing with kernels,” in Proc. of the IEEE Conf. on Comput. Vis. [30] C. Yan, B. Shao, H. Zhao, R. Ning, Y. Zhang, and F. Xu, “3d
Pattern Recognit, 2012, pp. 2074–2081. room layout estimation from a single rgb image,” IEEE Trans. on
[6] Y. Zhen, Y. Gao, D.-Y. Yeung, H. Zha, and X. Li, “Spectral multi- Multimedia, vol. PP, pp. 1–1, 2020.
modal hashing and its application to multimedia retrieval,” IEEE [31] X. Zhu, J. Yang, C. Zhang, and S. Zhang, “Efficient utilization of
Trans. on cybernetics, vol. 46, no. 1, pp. 27–38, 2016. missing data in cost-sensitive learning,” IEEE Trans. on Knowledge
[7] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Ad- and Data Engineering, p. 10.1109/TKDE.2019.2956530, 2019.
vances in Neural Information Processing Systems, 2009, pp. 1753–1760. [32] X. Zhu, Y. Zhu, and W. Zheng, “Spectral rotation for deep one-step
[8] Z. Cao, M. Long, J. Wang, and S. Y. Philip, “Hashnet: Deep learning clustering,” Pattern Recognition, p. 10.1016/j.patcog.2019.107175,
to hash by continuation,” in Proc. of the IEEE conf. on Int. Conf. on 2019.
Computer Vision, 2017, pp. 5609–5618. [33] F. Shen, C. Shen, W. Liu, and H. T. Shen, “Supervised discrete
[9] Y. Cao, M. Long, B. Liu, J. Wang, and M. KLiss, “Deep cauchy hashing,” in Proc. of the IEEE Conf. on Comput. Vis. Pattern Recognit,
hashing for hamming space retrieval,” in Proc. of the IEEE Conf. on 2015, pp. 37–45.
Comput. Vis. Pattern Recognit, 2018, pp. 1229–1237. [34] F. Shen, C. Shen, Q. Shi, A. van den Hengel, Z. Tang, and H. T.
[10] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan, “Supervised hashing for Shen, “Hashing on nonlinear manifolds,” IEEE Trans. on Image
image retrieval via image representation learning,” in Proc. of the Processing, vol. 24, no. 6, pp. 1839–1851, 2015.
Twenty-Eighth AAAI Conf. on Artificial Intelligence, 2014, pp. 2156– [35] F. Shen, Y. Xu, L. Liu, Y. Yang, Z. Huang, and H. T. Shen, “Un-
2162. supervised deep hashing with similarity-adaptive and discrete
[11] H. Lai, Y. Pan, Y. Liu, and S. Yan, “Simultaneous feature learning optimization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40,
and hash coding with deep neural networks,” in Proc. of the IEEE no. 12, pp. 3034–3044, 2018.
Conf. on Comput. Vis. Pattern Recognit, 2015, pp. 3270–3278. [36] A. Gionis, P. Indyk, R. Motwani et al., “Similarity search in high
[12] X. Shen, F. Shen, Q.-S. Sun, and Y.-H. Yuan, “Multi-view latent dimensions via hashing,” in Proc. of the 25th Int. Conf. on Very Large
hashing for efficient multimedia search,” in Proc. of the 23rd ACM Data Bases, vol. 99, no. 6, 1999, pp. 518–529.
Int. conf. on Multimedia. ACM, 2015, pp. 831–834. [37] B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing
[13] L. Liu, M. Yu, and L. Shao, “Multiview alignment hashing for for scalable image search.” in Proc. of the IEEE Conf. on Int. Conf. on
efficient image search,” IEEE Trans. on Image Processing, vol. 24, Computer Vision, vol. 9, 2009, pp. 2130–2137.
no. 3, pp. 956–966, 2015. [38] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative
[14] C. Zhang and W.-S. Zheng, “Semi-supervised multi-view discrete quantization: A procrustean approach to learning binary codes for
hashing for fast image search,” IEEE Trans. on Image Processing, large-scale image retrieval,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 26, no. 6, pp. 2604–2617, 2017. vol. 35, no. 12, pp. 2916–2929, 2012.
[39] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol.
[15] L. Xie, J. Shen, J. Han, L. Zhu, and L. Shao, “Dynamic multi-view
521, no. 7553, p. 436, 2015.
hashing for online image retrieval,” in Proc. of the Int. Joint Conf. on
[40] Z. Chen and J. Zhou, “Collaborative multiview hashing,” Pattern
Artificial Intelligence, 2017, pp. 3133–3139.
Recognition, vol. 75, pp. 149–160, 2018.
[16] Y. Feng, Z. Zhang, X. Zhao, R. Ji, and Y. Gao, “GVCNN: Group-
[41] D. Zhang, F. Wang, and L. Si, “Composite hashing with multiple
view convolutional neural networks for 3d shape recognition,” in
information sources,” in Proc. of the 34th Int. ACM SIGIR Conf. on
Proc. of the IEEE Conf. on Comput. Vis. Pattern Recognit, 2018, pp.
Research and Development in Information Retrieval. ACM, 2011, pp.
264–272.
225–234.
[17] J. Song, Y. Yang, Z. Huang, H. T. Shen, and J. Luo, “Effective multi-
[42] S. Kim and S. Choi, “Multi-view anchor graph hashing,” in Proc. of
ple feature hashing for large-scale near-duplicate video retrieval,”
the IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2013,
IEEE Trans. on Multimedia, vol. 15, no. 8, pp. 1997–2008, 2013.
pp. 3123–3127.
[18] C. Yan, Y. Tu, X. Wang, Y. Zhang, X. Hao, Y. Zhang, and Q. Dai, [43] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-
“Stat: Spatial-temporal attention mechanism for video caption- scale and rotation invariant texture classification with local binary
ing,” IEEE Trans. on Multimedia, vol. 22, no. 1, pp. 229–241, 2020. patterns,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp.
[19] X. Shen, F. Shen, Q.-S. Sun, Y. Yang, Y.-H. Yuan, and H. T. 971–987, 2002.
Shen, “Semi-paired discrete hashing: Learning latent hash codes [44] N. Dalal and B. Triggs, “Histograms of oriented gradients for
for semi-paired cross-view retrieval,” IEEE Trans. on cybernetics, human detection,” in Proc. of the IEEE Conf. on Comput. Vis. Pattern
vol. 47, no. 12, pp. 4275–4288, 2016. Recognit, vol. 1, 2005, pp. 886–893.
[20] K. Jia, J. Lin, M. Tan, and D. Tao, “Deep multi-view learning using [45] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
neuron-wise correlation-maximizing regularizers,” IEEE Trans. on R. Salakhutdinov, “Dropout: a simple way to prevent neural net-
Image Processing, 2019. works from overfitting,” The Journal of Machine Learning Research,
[21] M. Hu, Y. Yang, F. Shen, N. Xie, R. Hong, and H. T. Shen, vol. 15, no. 1, pp. 1929–1958, 2014.
“Collective reconstructive embeddings for cross-modal hashing,” [46] A. Krizhevsky and G. Hinton, “Learning multiple layers of fea-
IEEE Trans. on Image Processing, vol. 28, no. 6, pp. 2770–2784, 2018. tures from tiny images,” Citeseer, Tech. Rep., 2009.
[22] X. Xu, F. Shen, Y. Yang, H. T. Shen, and X. Li, “Learning discrim- [47] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “Nus-
inative binary codes for large-scale cross-modal retrieval,” IEEE wide: a real-world web image database from national university
Trans. on Image Processing, vol. 26, no. 5, pp. 2494–2507, 2017. of singapore,” in Proc. of the ACM Int. conf. on Image and Video
[23] Q.-Y. Jiang and W.-J. Li, “Deep cross-modal hashing,” in Proc. of Retrieval. ACM, 2009, p. 48.
the IEEE Conf. on Comput. Vis. Pattern Recognit, 2017. [48] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
[24] H. Zhu, M. Long, J. Wang, and Y. Cao, “Deep hashing network for P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
efficient similarity retrieval,” in Proc. of the Thirtieth AAAI Conf. on context,” in European conf. on computer vision. Springer, 2014, pp.
Artificial Intelligence, 2016, pp. 2415–2421. 740–755.
[25] Z. Chen, X. Yuan, J. Lu, Q. Tian, and J. Zhou, “Deep hashing via [49] X. Shen, F. Shen, L. Liu, Y.-H. Yuan, W. Liu, and Q.-S. Sun, “Multi-
discrepancy minimization,” in Proc. of the IEEE Conf. on Comput. view discrete hashing for scalable multimedia search,” ACM Trans.
Vis. Pattern Recognit, 2018, pp. 6838–6847. on Intelligent Systems and Technology, vol. 9, no. 5, p. 53, 2018.
[26] S. J. Banerjee, M. Azharuddin, D. Sen, S. Savale, H. Datta, A. K. [50] W. Gu, X. Gu, J. Gu, B. Li, Z. Xiong, and W. Wang, “Adversary
Dasgupta, and S. Roy, “Using complex networks towards infor- guided asymmetric hashing for cross-modal retrieval,” in Proc. of
mation retrieval and diagnostics in multidimensional imaging,” the 2019 on Int. Conf. on Multimedia Retrieval. ACM, 2019, pp.
Scientific Reports, vol. 5, no. 1, pp. 17 271–17 271, 2015. 159–167.
[27] C. Yan, L. Li, C. Zhang, B. Liu, Y. Zhang, and Q. Dai, “Cross-
modality bridging and knowledge transferring for image under-