Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

PROCEEDINGS OF THE IEEE 1

Learning to Hash for Indexing Big Data - A Survey


Jun Wang, Member, IEEE, Wei Liu, Member, IEEE, Sanjiv Kumar, Member, IEEE,
and Shih-Fu Chang, Fellow, IEEE

Abstract—The explosive growth in big data has attracted images per minute. Another rich media sharing website
much attention in designing efficient indexing and search YouTube receives more than 100 hours of videos uploaded
methods recently. In many critical applications such as
large-scale search and pattern matching, finding the near- per minute. Due to the dramatic increase in the size of
est neighbors to a query is a fundamental research prob- the data, modern information technology infrastructure
lem. However, the straightforward solution using exhaus- has to deal with such gigantic databases. In fact, com-
arXiv:1509.05472v1 [cs.LG] 17 Sep 2015

tive comparison is infeasible due to the prohibitive compu-


tational complexity and memory requirement. In response,
pared to the cost of storage, searching for relevant content
Approximate Nearest Neighbor (ANN) search based on in massive databases turns out to be even a more chal-
hashing techniques has become popular due to its promis- lenging task. In particular, searching for rich media data,
ing performance in both efficiency and accuracy. Prior ran- such as audio, images, and videos, remains a major chal-
domized hashing methods, e.g., Locality-Sensitive Hashing
(LSH), explore data-independent hash functions with ran- lenge since there exist major gaps between available solu-
dom projections or permutations. Although having elegant tions and practical needs in both accuracy and computa-
theoretic guarantees on the search quality in certain metric tional costs. Besides the widely used text-based commer-
spaces, performance of randomized hashing has been shown
insufficient in many real-world applications. As a remedy,
cial search engines such as Google and Bing, content-based
new approaches incorporating data-driven learning methods image retrieval (CBIR) has attracted substantial attention
in development of advanced hash functions have emerged. in the past decade [1]. Instead of relying on textual key-
Such learning to hash methods exploit information such as words based indexing structures, CBIR requires efficiently
data distributions or class labels when optimizing the hash
codes or functions. Importantly, the learned hash codes are indexing media content in order to directly respond to vi-
able to preserve the proximity of neighboring data in the sual queries.
original feature spaces in the hash code spaces. The goal Searching for similar data samples in a given database
of this paper is to provide readers with systematic under-
standing of insights, pros and cons of the emerging tech- essentially relates to the fundamental problem of nearest
niques. We provide a comprehensive survey of the learning neighbor search [2]. Exhaustively comparing a query point
to hash framework and representative techniques of vari- q with each sample in a database X is infeasible because
ous types, including unsupervised, semi-supervised, and su-
pervised. In addition, we also summarize recent hashing
the linear time complexity Op|X |q tends to be expensive
approaches utilizing the deep learning models. Finally, we in realistic large-scale settings. Besides the scalability is-
discuss the future direction and trends of research in this sue, most practical large-scale applications also suffer from
area. the curse of dimensionality [3], since data under mod-
Index Terms—Learning to hash, approximate nearest ern analytics usually contains thousands or even tens of
neighbor search, unsupervised learning, semi-supervised
learning, supervised learning, deep learning. thousands of dimensions, e.g., in documents and images.
Therefore, beyond the infeasibility of the computational
cost for exhaustive search, the storage constraint originat-
I. Introduction ing from loading original data into memory also becomes
The advent of Internet has resulted in massive infor- a critical bottleneck. Note that retrieving a set of Ap-
mation overloading in the recent decades. Nowadays, the proximate Nearest Neighbors (ANNs) is often sufficient for
World Wide Web has over 366 million accessible websites, many practical applications. Hence, a fast and effective
containing more than 1 trillion webpages1. For instance, indexing method achieving sublinear (op|X |q), logarithmic
Twitter receives over 100 million tweets per day, and Ya- (Oplog |X |q), or even constant (Op1q) query time is desired
hoo! exchanges over 3 billion messages per day. Besides for ANN search. Tree-based indexing approaches, such as
the overwhelming textual data, the photo sharing website KD tree [4], ball tree [5], metric tree [6], and vantage point
Flickr has more than 5 billion images available, where im- tree [7], have been popular during the past several decades.
ages are still being uploaded at the rate of over 3, 000 However, tree-based approaches require significant storage
costs (sometimes more than the data itself). In addition,
Jun Wang is with the Institute of Data Science and Technology at the performance of tree-based indexing methods dramat-
Alibaba Group, Seattle, WA, 98101, USA. Phone: `1 917 945–2619,
e-mail: j.wang@alibaba-inc.com. ically degrades when handling high-dimensional data [8].
Wei Liu is with IBM T. J. Watson Research Center, York- More recently, product quantization techniques have been
town Heights, NY 10598, USA. Phone: `1 917 945–1274, e-mail: proposed to encode high-dimensional data vectors via sub-
weiliu@us.ibm.com.
Sanjiv Kumar is with Google Research, New York, NY 10011,
space decomposition for efficient ANN search [9][10].
USA. Phone: `1 212 865–2214, e-mail: sanjivk@google.com. Unlike the recursive partitioning used by tree-based in-
Shih-Fu Chang is with the Departments of Electrical Engineer- dexing methods, hashing methods repeatedly partition the
ing and Computer Science, Columbia University, New York, NY
10027, USA. Phone: `1 212 854–6894, fax: `1 212 932–9421, e-mail:
entire dataset and derive a single hash ’bit’2 from each par-
sfchang@ee.columbia.edu.
1 The number webpages is estimated based on the number of in- 2 Depending on the type of the hash function used, each hash may
dexed links by Google in 2008. return either an integer or simply a binary bit. In this survey we
2 PROCEEDINGS OF THE IEEE

titioning. In binary partitioning based hashing, input data learning [25], clustering analysis [26], semi-supervised
is mapped to a discrete code space called Hamming space, learning [14], supervised learning [27][22], graph learn-
where each sample is represented by a binary code. Specif- ing [28], and so on. For instance, in the specific applica-
ically, given N D-dim vectors X P RDˆN , the goal of hash- tion of image search, the similarity (or distance) between
ing is to derive suitable K-bit binary codes Y P BKˆN . To image pairs is usually not defined via a simple metric. Ide-
(K
generate Y, K binary hash functions hk : RD ÞÑ B k“1 ally, one would like to provide pairs of images that contain
are needed. Note that hashing-based ANN search tech- “similar” or “dissimilar” images. From such pairwise la-
niques can lead to substantially reduced storage as they beled information, a good hashing mechanism should be
usually store only compact binary codes. For instance, able to generate hash codes which preserve the semantic
80 million tiny images (32 ˆ 32 pixels, double type) cost consistency, i.e., semantically similar images should have
around 600G bytes [11], but can be compressed into 64-bit similar codes. Both the supervised and semi-supervised
binary codes requiring only 600M bytes! In many cases, learning paradigms have been explored using such pair-
hash codes are organized into a hash table for inverse table wise semantic relationships to learn semantically relevant
lookup, as shown in Figure 1. One advantage of hashing- hash functions [11][29][30][31]. In this paper, we will sur-
based indexing is that hash table lookup takes only con- vey important representative hashing approaches and also
stant query time. In fact, in many cases, another alter- discuss the future research directions.
native way of finding the nearest neighbors in the code The remainder of this article is organized as follows. In
space by explicitly computing Hamming distance with all Section II, we will present necessary background informa-
the database items can be done very efficiently as well. tion, prior randomized hashing methods, and the motiva-
Hashing methods have been intensively studied and tions of studying hashing. Section III gives a high-level
widely used in many different fields, including com- overview of emerging learning-based hashing methods. In
puter graphics, computational geometry, telecommuni- Section IV, we survey several popular methods that fall
cation, computer vision, etc., for several decades [12]. into the learning to hash framework. In addition, Section V
Among these methods, the randomized scheme of Locality- describes the recent development of using neural networks
Sensitive Hashing (LSH) is one of the most popular to perform deep learning of hash codes. Section VI dis-
choices [13]. A key ingredient in LSH family of techniques cusses advanced hashing techniques and large-scale appli-
is a hash function that, with high probabilities, returns the cations of hashing. Several open issues and future direc-
same bit for the nearby data points in the original met- tions are described in Section VII.
ric space. LSH provides interesting asymptotic theoretical
properties leading to performance guarantees. However, II. Notations and Background
LSH based randomized techniques suffer from several cru- In this section, we will first present the notations, as
cial drawbacks. First, to achieve desired search precision, summarized in Table I. Then we will briefly introduce the
LSH often needs to use long hash codes, which reduces the conceptual paradigm of hashing-based ANN search. Fi-
recall. Multiple hash tables are used to alleviate this is- nally, we will present some background information on
sue, but it dramatically increases the storage cost as well hashing methods, including the introduction of two well-
as the query time. Second, the theoretical guarantees of known randomized hashing techniques.
LSH only apply to certain metrics such as ℓp (p P p0, 2s)
and Jaccard [14]. However, returning ANNs in such met- A. Notations
ric spaces may not lead to good search performance when Given a sample point x P RD , one can employ a set of
semantic similarity is represented in a complex way instead hash functions H “ th1 , ¨ ¨ ¨ , hK u to compute a K-bit bi-
of a simple distance or similarity metric. This discrepancy nary code y “ ty1 , ¨ ¨ ¨ , yK u for x as
between semantic and metric spaces has been recognized
in the computer vision and machine learning communities, y “ th1 pxq, ¨ ¨ ¨ , h2 pxq, ¨ ¨ ¨ , hK pxqu, (1)
namely as semantic gap [15].
To tackle the aforementioned issues, many hashing where the k th bit is computed as yk “ hk pxq. The hash
methods have been proposed recently to leverage ma- function performs the mapping as hk : RD ÝÑ B. Such a
chine learning techniques to produce more effective hash binary encoding process can also be viewed as mapping
codes [16]. The goal of learning to hash is to learn data- the original data point to a binary valued space, namely
dependent and task-specific hash functions that yield com- Hamming space:
pact binary codes to achieve good search accuracy [17].
In order to achieve this goal, sophisticated machine learn- H : x Ñ th1 pxq, ¨ ¨ ¨ , hK pxqu. (2)
ing tools and algorithms have been adapted to the proce-
Given a set of hash functions, we can map all the items in
dure of hash function design, including the boosting algo-
the database X “ txn uNn“1 P R
DˆN
to the corresponding
rithm [18], distance metric learning [19], asymmetric bi-
binary codes as
nary embedding [20], kernel methods [21][22], compressed
sensing [23], maximum margin learning [24], sequential Y “ HpXq “ th1 pXq, h2 pXq, ¨ ¨ ¨ , hK pXqu,
primarily focus on binary hashing techniques as they are used most
commonly due to their computational and storage efficiency. where the hash codes of the data X are Y P BKˆN .
WANG, LIU, KUMAR, AND CHANG: LEARNING TO HASH FOR INDEXING BIG DATA - A SURVEY 3

TABLE I
Summary Of Notations

Symbol Definition
N number of data points
D dimensionality of data points
K number of hash bits
i, j indices of data points
k index of a hash function
xi P RD , xj P RD the ith and jth data point
Si , Sj the ith and jth set
qi P RD a query point
X “ rx1 , ¨ ¨ ¨ , xN s P RDˆN data matrix with points as columns
yi P t1, ´1uK , or yi P t0, 1uK hash codes of data points xi and xj
yk¨ P BN ˆ1 the k-th hash bit of N data points
Y “ ry1 , ¨ ¨ ¨ , yN s P BKˆN hash codes of data X
θij angle between data points xi and xj
hk : RD Ñ t1, ´1u the k-th hash function
H “ rh1 , ¨ ¨ ¨ , hK s : RD Ñ t1, ´1uK K hash functions
JpSi , Sj q Jaccard similarity between sets Si and Sj
Jpxi , xj q Jaccard similarity between vectors xi and xj
dH pyi , yj q Hamming distance between yi and yj
dWH pyi , yj q weighted Hamming distance between yi and yj
pxi , xj q P M a pair of similar points
pxi , xj q P C a pair of dissimilar points
pqi , x` ´
i , xi q a ranking triplet
Sij “ simpxi , xj q similarity between data points xi and xj
S P RN ˆN similarity matrix of data X
Pw a hyperplane with its normal vector w

After computing the binary codes, one can perform ANN B.1 Designing Hash Functions
search in Hamming space with significantly reduced com- There exist a number of ways of designing hash func-
putation. Hamming distance between two binary codes yi tions. Randomized hashing approaches often use random
and yj is defined as projections or permutations. The emerging learning to
hash framework exploits the data distribution and often
K
ÿ various levels of supervised information to determine opti-
dH pyi , yj q “ |yi ´ yj | “ |hk pxi q ´ hk pxj q|, (3) mal parameters of the hash functions. The supervised in-
k“1
formation includes pointwise labels, pairwise relationships,
and ranking orders. Due to their efficiency, the most com-
where yi “ rh1 pxi q, ¨ ¨ ¨ , hk pxi q, ¨ ¨ ¨ , hK pxi qs and
monly used hash functions are of the form of a generalized
yj “ rh1 pxj q, ¨ ¨ ¨ , hk pxj q, ¨ ¨ ¨ , hK pxj qs. Note that linear projection:
the Hamming distance can be calculated in an efficient
way as a bitwise logic operation. Thus, even conduct- hk pxq “ sgn f pwkJ x ` bk q .
` ˘
(4)
ing exhaustive search in the Hamming space can be
significantly faster than doing the same in the original Here f p¨q is a prespecified function which can be pos-
space. Furthermore, through designing a certain indexing sibly nonlinear. The parameters to be determined are
structure, the ANN search with hashing methods can be twk , bk uK
k“1 , representing the projection vector wk and
even more efficient. Below we describe the pipeline of a the corresponding intercept bk . During the training pro-
typical hashing-based ANN search system. cedure, the data X, sometimes along with supervised in-
formation, is used to estimate these parameters. In ad-
B. Pipeline of Hashing-based ANN Search dition, different choices of f p¨q yield different properties
of the hash functions, leading to a wide range of hash-
There are three basic steps in ANN search using hash- ing approaches. For example, LSH keeps f p¨q to be an
ing techniques: designing hash functions, generating hash identity function, while shift-invariant kernel-based hash-
codes and indexing the database items, and online query- ing and spectral hashing choose f p¨q to be a shifted cosine
ing using hash codes. These steps are described in detail or sinusoidal function [32][33].
below. Note that, the hash functions given by (4) generate the
4 PROCEEDINGS OF THE IEEE

Binary Hashing Indexing Inverse Lookup 0110 Inverse Lookup


x1 x1 0110 0110 x1, xn
xn x2 1110 0110 x1, xn 0110 x2
h4 1110
x2 1110 x2 1110
h3  0100 

0111 1111
xn 0110 1111
h1
h2
database items hash codes hash table items codes satisfying hash table returned items
Fig. 1
An illustration of linear projection based binary hashing, Fig. 2
indexing, and hash table construction for fast inverse The procedure of inverse-lookup in hash table, where q is
lookup. the query mapped to a 4-bit hash code “0100” and the
returned approximate nearest neighbors within Hamming
radius 1 are x1 , x2 , xn .

codes as hk pxq P t´1, 1u. One can easily convert them into
binary codes from t0, 1u as
Note that the Hamming distance can be rapidly computed
1 using logical xor operation between binary codes as
yk “ p1 ` hk pxqq . (5)
2
dH pyi , yj q “ yi ‘ yj . (6)
Without loss of generality, in this survey we will use the
term hash codes to refer to either t0, 1u or t´1, 1u form, On modern computer architectures, this is achieved effi-
which should be clear from the context. ciently by running xor instruction followed by popcount.
With the computed Hamming distance between the query
B.2 Indexing Using Hash Tables and each database item, one can perform exhaustive scan
With a learned hash function, one can compute the bi- to extract the approximate nearest neighbors of the query.
nary codes Y for all the items in a database. For K hash Although this is much faster than the exhaustive search
functions, the codes for the entire database cost only N K{8 in the original feature space, the time complexity is still
bytes. Assuming the original data to be stored in double- linear. An alternative way of searching for the neighbors is
precision floating-point format, the original storage costs by using the inverse-lookup in the hash table and returning
8N D bytes. Since the massive datasets are often asso- the data points within a small Hamming distance r of the
ciated with thousands of dimensions, the computed hash query. Specifically, given a query point q, and its corre-
codes significantly reduce the storage cost by hundreds and sponding hash code yq “ Hpqq, all the database points ỹ
even thousands of times. whose hash codes fall within the Hamming ball of radius
In practice, the hash codes of the database are orga- r centered at yq , i.,e. dH pỹ, Hpqqq ď r. As shown in Fig-
řr ` ˘
nized as an inverse-lookup, resulting in a hash table or a ure 2, for a K-bit binary code, a total of l“0 Kl possible
hash map. For a set of K binary hash functions, one can codes will be within Hamming radius of r. Thus one needs
have at most 2K entries in the hash table. Each entry, to search OpK r q buckets in the hash table. The union of
called a hash bucket, is indexed by a K-bit hash code. In all the items falling into the corresponding hash buckets
the hash table, one keeps only those buckets that contains is returned as the search result. The inverse-lookup in a
at least one database item. Figure 1 shows an example hash table has constant time complexity independent of the
of using binary hash functions to index the data and con- database size N . In practice, a small value of r (r “ 1, 2 is
struct a hash table. Thus, a hash table can be seen as commonly used) is used to avoid the exponential growth in
an inverse-lookup table, which can return all the database the possible code combinations that need to be searched.
items corresponding to a certain code in constant time.
C. Randomized Hashing Methods
This procedure is key to achieving speedup by many hash-
ing based ANN search techniques. Since most of the buck- Randomized hashing, e.g. locality sensitive hash family,
ets from 2K possible choices are typically empty, creating has been a popular choice due to its simplicity. In addi-
an inverse lookup can be a very efficient way of even stor- tion, it has interesting proximity preserving properties. A
ing the codes if multiple database items end up with the binary hash function hp¨q from LSH family is chosen such
same codes. that the probability of two points having the same bit is
proportional to their (normalized) similarity, i.e.,
B.3 Online Querying with Hashing
P thpxi q “ hpxj qu “ simpxi , xj q. (7)
During the querying procedure, the goal is to find the
nearest database items to a given query. The query is first Here simp¨, ¨q represents similarity between a pair of points
converted into a code using the same hash functions that in the input space, e.g., cosine similarity or Jaccard simi-
mapped the database items to codes. One way to find near- larity [34]. In this section, we briefly review two categories
est neighbors of the query is by computing the Hamming of randomized hashing methods, i.e. random projection
distance between the query code to all the database codes. based and random permutation based approaches.
WANG, LIU, KUMAR, AND CHANG: LEARNING TO HASH FOR INDEXING BIG DATA - A SURVEY 5

Meanwhile, the number of hash tables l should be suffi-


ciently large to increase the recall. However, this is ineffi-
cient due to extra storage cost and longer query time.
To overcome these drawbacks, many practical systems
adapt various strategies to reduce the storage overload and
to improve the efficiency. For instance, a self-tuning in-
dexing technique, called LSH forest was proposed in [36],
which aims at improving the performance without addi-
tional storage and query overhead. In [37][38], a technique
Fig. 3 called MultiProbe LSH was developed to reduce the num-
An illustration of random hyperplane partitioning based ber of required hash tables through intelligently probing
hashing method. multiple buckets in each hash table. In [39], nonlinear
randomized Hadamard transforms were explored to speed
up the LSH based ANN search for Euclidean distance.
C.1 Random Projection Based Hashing In [40], BayesLSH was proposed to combine Bayesian in-
ference with LSH in a principled manner, which has prob-
As a representative member of the LSH family, random abilistic guarantees on the quality of the search results in
projection based hash (RPH) functions have been widely terms of accuracy and recall. However, the random projec-
used in different applications. The key ingredient of RPH is tions based hash functions ignore the specific properties of
to map nearby points in the original space to the same hash a given data set and thus the generated binary codes are
bucket with a high probability. This equivalently preserves data-independent, which leads to less effective performance
the locality in the original space in the Hamming space. compared to the learning based methods to be discussed
Typical examples of RPH functions consist of a random later.
projection w and a random shift b as In machine learning and data mining community, re-
cent methods tend to leverage data-dependent and task-
specific information to improve the efficiency of random
hk pxq “ sgnpwkJ x ` bk q, (8)
projection based hash functions [16]. For example, in-
The random vector w is constructed by sampling each com- corporating kernel learning with LSH can help generalize
ponent of w randomly from a standard Gaussian distribu- ANN search from a standard metric space to a wide range
tion for cosine distance [34]. of similarity functions [41][42]. Furthermore, metric learn-
It is easy to show that the collision probability of two ing has been combined with randomized LSH functions to
samples xi , xj falling into the same hash bucket is deter- explore a set of pairwise similarity and dissimilarity con-
mined by the angle θij between these two sample vectors, straints [19]. Other variants of locality sensitive hashing
as shown in Figure 3. One can show that techniques include super-bit LSH [43], boosted LSH [18],
as well as non-metric LSH [44]
θij 1 xJi xj
Pr rhk pxi q “ hk pxj qs “ 1 ´ “ 1 ´ cos´1 (9) C.2 Random Permutation based Hashing
π π }xi }}xj }

The above collision probability gives the asymptotic theo- Another well-known paradigm from the LSH family is
retical guarantees for approximating the cosine similarity min-wise independent permutation hashing (MinHash),
which has been widely used for approximating Jaccard
defined in the original space. However, long hash codes are
required to achieve sufficient discrimination for high pre- similarity between sets or vectors. Jaccard is a popular
cision. This significantly reduces the recall if hash table choice for measuring similarity between documents or im-
based inverse lookup is used for search. In order to bal- ages. A typical application is to index documents and
ance the tradeoff of precision and recall, one has to con- then identify near-duplicate samples from a corpus of doc-
uments [45][46]. The Jaccard similarity between two sets
struct multiple hash tables with long hash codes, which in- S XS
creases both storage and computation costs. In particular, Si and Sj is defined as JpSi , Sj q “ Sii YSjj . Since a collec-
with hash codes of length K, it is required to construct a tion of sets tSi uN
i“1 can be represented as a characteristic
sufficient number of hash tables to ensure the desired per- matrix C P BMˆN , where M is the cardinality of the uni-
formance bound [35]. Given l K-bit tables, the collision versal set S1 Y ¨ ¨ ¨ Y SN . Here the rows of C represents
probability is given as: the elements of the universal set and the columns corre-
spond to the sets. The element cdi “ 1 indicates the d-th
K
xJ element is a member of the i-th set, cdi “ 0 otherwise. As-

1 i xj
P tHpxi q “ Hpxj qu 9 l ¨ 1 ´ cos´1 (10) sume a random permutation πk p¨q that assigns the index
π }xi }}xj }
of the d-th element as πk pdq P t1, ¨ ¨ ¨ , Du. It is easy to
To balance the search precision and recall, the length of see that the random permutation satisfies two properties:
hash codes should be long enough to reduce false collisions πk pdq ‰ πk plq and P rrπk pdq ą πk plqs “ 0.5. A random per-
(i.e., non-neighbor samples falling into the same bucket). mutation based min-hash signature of a set Si is defined
6 PROCEEDINGS OF THE IEEE

as the minimum index of the non-zero element after per- order to improve precision as well as recall, Xu et al., de-
forming permutation using πk veloped multiple complementary hash tables that are se-
quentially learned using a boosting-style algorithm [31].
hk pSi q “ min πk pdq. (11) Also, in cases when the code length is not very large and
dPt1,¨¨¨ ,Du,cπk pdqi “1
the number of database points is large, exhaustive scan in
Note that such a hash function holds a property that the Hamming space can be done much faster by using multi-
chance of two sets having the same MinHash values is equal table indexing as shown by Norouzi et al. [59].
to the Jaccard similarity between them [47]
A. Data-Dependent vs. Data-Independent
P rrhk pSi q “ hk pSj qs “ JpSi , Sj q. (12) Based on whether design of hash functions requires anal-
ysis of a given dataset, there are two high-level cate-
The definition of the Jaccard similarity can be extended gories of hashing techniques: data-independent and data-
to two vectors xi “ txi1 , ¨ ¨ ¨ , xid , ¨ ¨ ¨ , xiD u and xj “ dependent. As one of the most popular data-independent
txj1 , ¨ ¨ ¨ , xjd , ¨ ¨ ¨ , xjD u as approaches, random projection has been used widely for
řD designing data-independent hashing techniques such as
d“1 minpxid , xjd q LSH and SIKH mentioned earlier. LSH is arguably the
Jpxi , xj q “ řD .
most popular hashing method and has been applied to
d“1 maxpxid , xjd q
a variety of problem domains, including information re-
Similar min-hash functions can be defined for the above trieval and computer vision. In both LSH and SIKH,
vectors and the property of the collision probability shown the projection vector w and intersect b, as defined in
in Eq. 12 still holds [48]. Compared to the random pro- Eq. 4, are randomly sampled from certain distributions.
jection based LSH family, the min-hash functions generate Although these methods have strict performance guaran-
non-binary hash values that can be potentially extended tees, they are less efficient since the hash functions are
to continuous cases. In practice, the min-hash scheme not specifically designed for a certain dataset or search
has shown powerful performance for high-dimensional and task. Based on the random projection scheme, there have
sparse vectors like the bag-of-word representation of docu- been several efforts to improve the performance of the LSH
ments or feature histograms of images. In a large scale eval- method [37] [39] [41].
uation conducted by Google Inc., the min-hash approach Realizing the limitation of data-independent hashing
outperforms other competing methods for the application approaches, many recent methods use data and possibly
of webpage duplicate detection [49]. In addition, the min- some form of supervision to design more efficient hash
hash scheme is also applied for Google news personal- functions. Based on the level of supervision, the data-
ization [50] and near duplicate image detection [51] [52]. dependent methods can be further categorized into three
Some recent efforts have been made to further improve subgroups as described below.
the min-hash technique, including b-bit minwise hash-
ing [53] [54], one permutation approach [55], geometric B. Unsupervised, Supervised, and Semi-Supervised
min-Hashing [56], and a fast computing technique for im- Many emerging hashing techniques are designed by ex-
age data [57]. ploiting various machine learning paradigms, ranging from
unsupervised and supervised to semi-supervised settings.
III. Categories of Learning Based Hashing For instance, unsupervised hashing methods attempt to
Methods integrate the data properties, such as distributions and
Among the three key steps in hashing-based ANN manifold structures to design compact hash codes with
search, design of improved data-dependent hash functions improved accuracy. Representative unsupervised methods
has been the focus in learning to hash paradigm. Since the include spectral hashing [32], graph hashing [28], mani-
proposal of LSH in [58], many new hashing techniques have fold hashing [60], iterative quantization hashing [61], ker-
been developed. Note that most of the emerging hashing nalized locality sensitive hashing [41][21], isotropic hash-
methods are focused on improving the search performance ing [62], angular quantization hashing [63], and spherical
using a single hash table. The reason is that these tech- hashing [64]. Among these approaches, spectral hashing
niques expect to learn compact discriminative codes such explores the data distribution and graph hashing utilizes
that searching within a small Hamming ball of the query the underlying manifold structure of data captured by
or even exhaustive scan in Hamming space is both fast and a graph representation. In addition, supervised learning
accurate. Hence, in the following, we primarily focus on paradigms ranging from kernel learning to metric learning
various techniques and algorithms for designing a single to deep learning have been exploited to learn binary codes,
hash table. In particular, we provide different perspectives and many supervised hashing methods have been proposed
such as the learning paradigm and hash function charac- recently [19][22][65][66][67][68]. Finally, semi-supervised
teristics to categorize the hashing approaches developed learning paradigm was employed to design hash functions
recently. It is worth mentioning that a few recent stud- by using both labeled and unlabeled data. For instance,
ies have shown that exploring the power of multiple hash Wang et. al proposed a regularized objective to achieve
tables can sometimes generate superior performance. In accurate yet balanced hash codes to avoid overfitting [14].
WANG, LIU, KUMAR, AND CHANG: LEARNING TO HASH FOR INDEXING BIG DATA - A SURVEY 7

0
1
0
(a) (b)
less similar
Fig. 5
(a) (b) (c)
Comparison of hash bits generated using a) PCA hashing and
Fig. 4 b) Spectral Hashing.
An illustration of different levels of supervised
information: a) pairwise labels; b) a triplet
simpq, x` q ą simpq, x´ q; and c) a distance based rank list
px4 , x1 , x2 , x3 q to a query point q. D. Linear vs. Nonlinear
Based on the form of function f p¨q in Eq. 4, hash func-
tions can also be categorized in two groups: linear and
nonlinear. Due to their computational efficiency, linear
In [69][70], authors proposed to exploit the metric learn- functions tend to be more popular, which include ran-
ing and locality sensitive hashing to achieve fast similarity dom projection based LSH methods. The learning based
based search. Since the labeled data is used for deriv- methods derive optimal projections by optimizing different
ing optimal metric distance while the hash function design types of objectives. For instance, PCA hashing performs
uses no supervision, the proposed hashing technique can principal component analysis on the data to derive large
be regarded as a semi-supervised approach. variance projections [63][77][78], as shown in Figure 5(a).
In the same league, supervised methods have used Lin-
ear Discriminant Analysis to design more discriminative
C. Pointwise, Pairwise, Triplet-wise and Listwise hash codes [79][80]. Semi-supervised hashing methods es-
timate the projections that have minimum empirical loss
Based on the level of supervision, the supervised or semi- on pair-wise labels while partitioning the unlabeled data
supervised hashing methods can be further grouped into in a balanced way [14]. Techniques that use variance of the
several subcategories, including pointwise, pairwise, triplet- projections as the underlying objective, also tend to use or-
wise, and listwise approaches. For example, a few existing thogonality constraints for computational ease. However,
approaches utilize the instance level semantic attributes these constraints lead to a significant drawback since the
or labels to design the hash functions [71][18][72]. Addi- variance for most real-world data decays rapidly with most
tionally, learning methods based on pairwise supervision of the variance contained only in top few directions. Thus,
have been extensively studied, and many hashing tech- in order to generate more bits in the code, one is forced
niques have been proposed [19][22][65][66][14][69][70][73]. to use progressively low-variance directions due to orthog-
As demonstrated in Figure 4(a), the pair px2 , x3 q con- onality constraints. The binary codes derived from these
tains similar points and the other two pairs px1 , x2 q and low-variance projections tend to have significantly lower
px1 , x3 q contain dissimilar points. Such relations are con- performance. Two types of solutions based on relaxation
sidered in the learning procedure to preserve the pairwise of the orthogonality constraints or random/learned rota-
label information in the learned Hamming space. Since tion of the data have been proposed in the literature to
the ranking information is not fully utilized, the perfor- address these issues [14][61]. Isotropic hashing is proposed
mance of pairwise supervision based methods could be to derive projections with equal variances and is shown to
sub-optimal for nearest neighbor search. More recently, be superior to anisotropic variances based projections [62].
a triplet ranking that encodes the pairwise proximity com- Instead of performing one-shot learning, sequential projec-
parison among three data points is exploited to design hash tion learning derives correlated projections with the goal
codes [74][65][75]. As shown in Figure 4(b), the point x` is of correcting errors from previous hash bits [25]. Finally,
more similar to the query point q than the point x´ . Such to reduce the computational complexity of full projection,
a triplet ranking information, i.e., simpq, x` q ą simpq, x´ q circulant binary embedding was recently proposed to sig-
is expected to be encoded in the learned binary hash codes. nificantly speed up the encoding process using the circulant
Finally, the listwise information indicates the rank order convolution [81].
of a set of points with respect to the query point. In Fig- Despite its simplicity, linear hashing often suffers from
ure 4(c), for the query point q, the rank list px4 , x1 , x2 , x3 q insufficient discriminative power. Thus, nonlinear meth-
shows the ranking order of their similarities to the query ods have been developed to override such limitations. For
point q, where x4 is the nearest point and x3 is the farthest instance, spectral hashing first extracts the principal pro-
one. By converting rank lists to a triplet tensor matrix, jections of the data, and then partitions the projected data
listwise hashing is designed to preserve the ranking in the by a sinusoidal function (nonlinear) with a specific angu-
Hamming space [76]. lar frequency. Essentially, it prefers to partition projec-
8 PROCEEDINGS OF THE IEEE

tions with large spread and small spatial frequency such bits. However, it is easy to observe that different bits often
that the large variance projections can be reused. As il- behave differently [14][32]. In general, for linear projection
lustrated in Figure 5(b), the fist principal component can based hashing methods, the binary code generated from
be reused in spectral hashing to divide the data into four large variance projection tends to perform better due to
parts while being encoded with only one bit. In addition, its superior discriminative power. Hence, to improve dis-
shift-invariant kernel-based hashing chooses f p¨q to be a crimination among hash codes, techniques were designed
shifted cosine function and samples the projection vector to learn a weighted hamming embedding as
in the same way as standard LSH does [33]. Another cate-
H : X Ñ tα1 h1 pxq, ¨ ¨ ¨ , αK hK pxqu. (13)
gory of nonlinear hashing techniques employs kernel func-
tions [21][22][28][82]. Anchor graph hashing proposed by Hence the conventional hamming distance is replaced by a
Liu et al. [28] uses a kernel function to measure similarity weighted version as
of each points with a set of anchors resulting in nonlinear K
ÿ
hashing. Kernerlized LSH uses a sparse set of datapoints dWH “ αk |hk pxi q ´ hk pxj q|. (14)
to compute a kernel matrix and preform random projection k“1
in the kernel space to compute binary codes [21]. Based on One of the representative approaches is Boosted Similarity
similar representation of kernel metric, Kulis and Darrell Sensitive Coding (BSSC) [18]. By learning the hash func-
propose learning of hash functions by explicitly minimizing tions and the corresponding weights tα1 , ¨ ¨ ¨ , αk u jointly,
the reconstruction error in the kernel space and Hamming the objective is to lower the collision probability of non-
space [27]. Liu et al. applies kernel representation but neighbor pair pxi , xj q P C while improving the collision
optimizes the hash functions by exploring the equivalence probability of neighboring pair pxi , xj q P M. If one treats
between optimizing the code inner products and the Ham- each hash function as a decision stump, the straightfor-
ming distances to achieve scale invariance [22]. ward way of learning the weights is to directly apply adap-
tive boosting algorithm [87] as described in [18]. In [88],
E. Single-Shot Learning vs. Multiple-Shot Learning
a boosting-style method called BoostMAP is proposed
For learning based hashing methods, one first formu- to map data points to weighted binary vectors that can
lates an objective function reflecting desired characteris- leverage both metric and semantic similarity measures.
tics of the hash codes. In a single-shot learning paradigm, Other weighted hashing methods include designing spe-
the optimal solution is derived by optimizing the objective cific bit-level weighting schemes to improve the search ac-
function in a single-shot. In such a learning to hash frame- curacy [74][89][90][91][92]. In addition, a recent work about
work, the K hash functions are learned simultaneously. In designing a unified bit selection framework can be regarded
contrast, the multiple-shot learning procedure considers a as a special case of weighted hashing approach, where the
global objective, but optimizes a hash function considering weights of hash bits are binary [93]. Another effective hash
the bias generated by the previous hash functions. Such code ranking method is the query-sensitive hashing, which
a procedure sequentially trains hash functions one bit at explores the raw feature of the query sample and learns
a time [83][25][84]. The multiple-shot hash function learn- query-specific weights of hash bits to achieve accurate ǫ-
ing is often used in supervised or semi-supervised settings nearest neighbor search [94].
since the given label information can be used to assess the
quality of the hash functions learned in previous steps. For IV. Methodology Review and Analysis
instance, the sequential projection based hashing aims to In this section, we will focus on review of several rep-
incorporate the bit correlations by iteratively updating the resentative hashing methods that explore various ma-
pairwise label matrix, where higher weights are imposed on chine learning techniques to design data-specific indexing
point pairs violated by the previous hash functions [25]. In schemes. The techniques consist of unsupervised, semi-
the complementary projection learning approach [84], the supervised, as well as supervised approaches, including
authors present a sequential learning procedure to obtain spectral hashing, anchor graph hashing, angular quanti-
a series of hash functions that cross the sparse data re- zation, binary reconstructive embedding based hashing,
gion, as well as generate balanced hash buckets. Column metric learning based hashing, semi-supervised hashing,
generation hashing learns the best hash function during column generation hashing, and ranking supervised hash-
each iteration and updates the weights of hash functions ing. Table II summarizes the surveyed hashing techniques,
accordingly. Other interesting learning ideas include two- as well as their technical merits.
step learning methods which treat hash bit learning and Note that this section mainly focuses on describing
hash function learning separately [85][86]. the intuition and formulation of each method, as well
as discussing their pros and cons. The performance
F. Non-Weighted vs. Weighted Hashing of each individual method highly depends on practical
Given the Hamming embedding defined in Eq. 2, tra- settings, including learning parameters and dataset it-
ditional hashing based indexing schemes map the original self. In general, the nonlinear and supervised techniques
data into a non-weighted Hamming space, where each bit tend to generate better performance than linear and un-
contributes equally. Given such a mapping, the Hamming supervised methods, while being more computationally
distance is calculated by counting the number of different costly [14][19][21][22][27].
WANG, LIU, KUMAR, AND CHANG: LEARNING TO HASH FOR INDEXING BIG DATA - A SURVEY 9

Method Hash Function/Objective Function Parameters Learning Paradigm Supervision

Spectral Hashing sgnpcospαwJ xqq w, α unsupervised NA


J
Anchor Graph Hashing sgnpw xq w unsupervised NA
ř bJ
Angular Quantization pb, wq “ arg max i }bii}2 wJ xi b, w unsupervised NA
Binary Reconstructive Embedding sgnpwJ Kpxqq w unsupervised/supervised pairwise distance
J J
Metric Learning Hashing sgnpw G xq G, w supervised pairwise similarity
J
Semi-Supervised Hashing sgnpw xq w semi-supervised pairwise similarity
J
Column generation hashing sgnpw x ` bq w supervised triplet

Listwise hashing sgnpwJ x ` bq w supervised ranking list

Circulant binary embedding sgnpcircprq ¨ xq r unsupervised/supervised pairwise similarity

TABLE II
A summary of the surveyed hashing techniques in this article.

A. Spectral Hashing B. Anchor Graph Hashing


In the formulation of spectral hashing, the desired prop- Following the similar objective as spectral hashing, an-
erties include keeping neighbors in input space as neigh- chor graph hashing was designed to solve the problem from
bors in the hamming space and requiring the codes to be a different perspective without the assumption of uniform
balanced and uncorrelated [32]. Hence, the objective of distribution [28]. Note that the critical bottleneck for solv-
spectral hashing is formulated as: ing Eq. 15 is the cost of building a pariwsie similarity graph
ÿ1 1 A, the computation of associated graph Laplacian, as well
min Aij }yi ´ yj }2 “ trpYJ LYq (15) as solving the corresponding eigen-system, which at least
2 2
ij has a quadratic complexity. The key idea is to use a small
subject to: Y P t´1, 1uN ˆK set of M pM ! N q anchor points to approximate the graph
1J yk¨ “ 0, k “ 1, ¨ ¨ ¨ , K structure represented by the matrix A such that the sim-
ilarity between any pair of points can be approximated
YJ Y “ nIKˆK , using point-to-anchor similarities [97]. In particular, the
where A “ tAij uN i,j“1 is a pairwise similarity matrix and
truncated point-to-anchor similarity Z P RN ˆM gives the
the Laplacian matrix is calculated as L “ diagpA1q ´ A. similarities between N database points to the M anchor
The constraint 1J yk “ 0 ensures that the hash bit yk points. Thus, the approximated similarity matrix  can
reaches a balanced partitioning of the data and the con- be calculated as  “ ZΛZJ , where Λ “ diagpZ1q is the
straint YJ Y “ nIKˆK imposes orthogonality between degree matrix of the anchor graph Z. Based on such an ap-
hash bits to minimize the redundancy. proximation, instead of solving the eigen-system of the ma-
The direct solution for the above optimization is non- trix  “ ZΛZJ , one can alternatively solve a much smaller
trivial for even a single bit since it is essentially a balanced eigen-system with an M ˆ M matrix Λ1{2 ZJ ZΛ1{2 . The
graph partition problem, which is NP hard. The orthog- final binary codes can be obtained through calculating the
onality constraints for K-bit balanced partitioning make sign function over a spectral embedding as
the above problem even harder. Motivated by the well-
Y “ sgnpZΛ1{2 VΣ1{2 q, (16)
known spectral graph analysis [95], the authors suggest to
minimize the cost function with relaxed constraints. In Here we have the matrices V “ rv1 , ¨ ¨ ¨ , vk , ¨ ¨ ¨ , vK s P
particular, with the assumption of uniform data distribu- RMˆK and Σ “ diagpσ1 , ¨ ¨ ¨ , σk , ¨ ¨ ¨ , σK q P RKˆK , where
tion, the spectral solution can be efficiently computed us- tvk , σk u are the eigenvector-eigenvalue pairs [28]. Figure 6
ing 1D-Laplacian eigenfunctions [32]. The final solution for shows the two-bit partitioning on a synthetic data with
spectral hashing equals to apply a sinusoidal function with nonlinear structure using different hashing methods, in-
pre-computed angular frequency to partition data along cluding spectral hashing, exact graph hashing, and anchor
PCA directions. Note that the projections are computed graph hashing. Note that since spectral hashing computes
using data but learned in an unsupervised manner. As two smoothest pseudo graph Laplacian eigenfunctions in-
most of the orthogonal projection based hashing methods, stead of performing real spectral embedding, it can not
spectral hashing suffers from the low-quality binary cod- handle such type of nonlinear data structures. The exact
ing using low-variance projections. Hence, a “kernel trick” graph hashing method first constructs an exact neighbor-
is used to alleviate the degraded performance when using hood graph, e.g., kNN graph, and then performs parti-
long hash bits [96]. Moreover, the assumption of uniform tioning with spectral techniques to solve the optimization
data distribution usually hardly hold for real-world data. problem in Eq.15. The anchor graph hashing archives a
10 PROCEEDINGS OF THE IEEE

(a) (b) (c)

Fig. 7
Illustration of angular quantization based hashing
(d) (e) (f)
method [63]. The binary code of a data point x is assigned as
Fig. 6 the nearest binary vertex in the hypercube, which is
Comparison of partitioning a two-moon data by the first b4 “ r0 1 1sJ in the illustrated example [63].
two hash bits using different methods: a) the first bit using
spectral hashing; b) the first bit using exact graph hashing;
c) the first bit using anchor graph hashing; d) the second formulated as the following
bit using spectral hashing; e) the second bit using exact
graph hashing; f) the second bit using anchor graph hashing;
ÿ bJ
pb˚i , R˚ q “ arg max i
RJ xi (18)
bi ,R
i
}b i } 2

subject to: b P t0, 1uK


good separation (by the first bit) of the nonlinear mani- RJ R “ IDˆD
fold and balancing partitioning, even performs better than Note that the above formulation still generates a D-bit bi-
the exact graph hashing, which loses the property of bal- nary code for each data point, while compact codes are
ancing partitioning for the second bit. The anchor graph often desired in many real-world applications [14]. To gen-
hashing approach was recently further improved by lever- erate a K=bit code, a projection matrix S P RDˆK with
aging a discrete optimization technique to directly solve orthogonal columns can be used to replace the rotation
binary hash codes without any relaxation [98]. matrix R in the above objective with additional normal-
ization, as discussed in [63]. Finally, the optimal binary
C. Angular Quantization Based Hashing codes and the projection/rotation matrix are learned us-
ing an alternating optimization scheme.
Since similarity is often measured by the cosine of the D. Binary Reconstructive Embedding
angle between pairs of samples, angular quantization is
thus proposed to map non-negative feature vectors onto a Instead of using data-independent random projections as
vertex of the binary hypercube with the smallest angle [63]. in LSH or principal components as in SH, Kulis and Dar-
In such a setting, the vertices of the hypercube is treated as rell [27] proposed data-dependent and bit-correlated hash
quantization landmarks that grow exponentially with the functions as:
data dimensionality D. As shown in Figure 7, the nearest
˜ ¸
s
ÿ
binary vertex b in a hypercube to the data point x is given hk pxq “ sgn Wkq κpxkq , xq (19)
by q“1

The sample set txkq u, q “ 1, ¨ ¨ ¨ , s is the training data for


˚ bJ x learning hash function hk and κp¨q is a kernel function, and
b “ arg max (17) W is a weight matrix.
b }b}2
Based on the above formulation, a method called Bi-
subject to: b P t0, 1uK ,
nary Reconstructive Embedding (BRE) was designed to
minimize a cost function measuring the difference between
Although it is an integer programming problem, its global the metric and reconstructed distance in hamming space.
maximum can be found with a complexity of OpD log Dq. The Euclidean metric dM and the binary reconstruction
The optimal binary vertices will be used as the binary hash distance dR are defined as:
codes for data points as y “ b˚ . Based on this angular 1
quantization framework, a data-dependent extension is de- dM pxi , xj q “ }xi ´ xj }2 (20)
2
signed to learn a rotation matrix R P RDˆD to align the K
projected data RJ x to the binary vertices without chang- 1 ÿ 2
dR pxi , xj q “ phk pxi q ´ hk pxj qq
ing the similarity between point pairs. The objective is K k“1
WANG, LIU, KUMAR, AND CHANG: LEARNING TO HASH FOR INDEXING BIG DATA - A SURVEY 11

(a) (b) (c)

Fig. 9
Fig. 8 Illustration of one-bit partitioning of different linear
Illustration of the hashing method based on metric projection based hashing methods: a) unsupervised hashing;
learning. The left shows the partitioning using standard b) supervised hashing; and c) semi-supervised hashing. The
LSH method and the right shows the partitioning of the similar point pairs are indicated in the green rectangle
metric learning based LSH method (modified the original shape and the dissimilar point pairs are with a red triangle
figure in [19]). shape.

The objective is to minimize the following reconstruction likely to collide in the same hash buck and dissimilar pairs
error to derive the optimal W: are less likely to share the same hash codes, as illustrated
ÿ 2
in Figure 8.
W˚ “ arg min rdM pxi , xj q ´ dR pxi , xj qs , (21) The parameterized inner product is defined as
W
pxi ,xj qPN

simpxi , xj q “ xJ
i Mxj ,
where the set of sample pairs N is the training data.
Optimizing the above objective function is difficult due where M is a positive-definite d ˆ d matrix to be learned
to the non-differentiability of sgnp¨q function. Instead, a from the labeled data. Note that this similarity mea-
coordinate-descent algorithm was applied to iteratively up- sure corresponds to the parameterized squared Maha-
date the hash functions to a local optimum. This hashing lanobis distance dM . Assume that M can be factorized
method can be easily extended to a supervised scenario as M “ GJ G. Then the parameterized squared Maha-
by setting pairs with same labels to have zero distance lanobis distance can be written as
and pairs with different labels to have a large distance.
However, since the binary reconstruction distance dR is dM pxi , xj q “ pxi ´ xj qJ Mpxi ´ xj q (22)
bounded in r0, 1s while the metric distance dM has no up-
“ pGxi ´ Gxi qJ pGxi ´ Gxi q.
per bound, the minimization problem in Eq. (21) is only
meaningful when input data is appropriately normalized.
Based on the above equation, the distance dM pxi , xj q can
In practice, the original data point x is often mapped to
be interpreted as the Euclidian distance between the pro-
a hypersphere with unit length so that 0 ď dM ď 1. This
jected data points Gxi and Gxj . Note that the matrix
normalization removes the scale of data points, which is
M can be learned through various metric learning method
often not negligible for practical applications of nearest
such as information-theoretic metric learning [100]. To ac-
neighbor search. In addition, Hamming distance based
commodate the learned distance metric, the randomized
objective is hard to optimize due to its nonconvex and
hash function is given as
nonsmooth properties. Hence, Liu et al. proposed to uti-
lize the equivalence between code inner products and the hk pxq “ sgnpwk GJ xq. (23)
Hamming distances to design supervised and kernel-based
hash functions [22]. The objective is to ensure the inner It is easy to see that the above hash function generates
product of hash codes consistent with the given pairwise the hash codes which preserve the parameterized similarity
supervision. Such a strategy of optimizing the hash code measure in the Hamming space. Figure 8 demonstrates the
inner product in KSH rather than the Hamming distance difference between standard random projection based LSH
like what’s done in BRE pays off nicely and leads to major and the metric learning based LSH, where it is easy to see
performance gains in similarity-based retrieval consistently that the learned metric help assign the same hash bit to the
confirmed in extensive experiments reported in [22] and re- similar sample pairs. Accordingly, the collision probability
cent studies [99]. is given as
E. Metric Learning based Hashing 1 xJ GJ Gxj
The key idea for metric learning based hashing method Pr rhk pxi q “ hk pxi qs “ 1 ´ cos´1 i (24)
π }Gxi }}Gxj }
is to learn a parameterized Mahalanobis metric using pair-
wise label information. Such learned metrics are then em- Realizing that the pairwise constraints often come to be
ployed to the standard random projection based hash func- available incrementally, Jain et al exploit an efficient online
tions [19]. The goal is to preserve the pairwise relationship locality-sensitive hashing with gradually learned distance
in the binary code space, where similar data pairs are more metrics [70].
12 PROCEEDINGS OF THE IEEE

F. Semi-Supervised Hashing G. Column Generation Hashing


Supervised hashing techniques have been shown to be Beyond pairwise relationship, complex supervision like
superior than unsupervised approaches since they lever- ranking triplets and ranking lists has been exploited to
age the supervision information to design task-specific hash learn hash functions with the property of ranking preserv-
codes. However, for a typical setting of large scale prob- ing. In many real applications such as image retrieval and
lem, the human annotation process is often costly and the recommendation system, it is often easier to receive the
labels can be noisy and sparse, which could easily lead to relative comparison instead of instance-wise or pari-wise
overfitting. labels. For a general claim, such relative comparison infor-
Considering a small set of pairswise labels and a mation is given in a triplet form. Formally, a set of triplets
large amount of unlabled data, semi-supervised hashing are represented as:
aims in designing hash functions with minimum empiri-
cal loss while maintaining maximum entropy over the en- E “ tpqi , x` ´ ` ´
i , xi q|simpqi , xi q ą simpqi , xi qu,
tire dataset. Specifically, assume the pairwise labels are
where the the function simp¨q could be an unknown sim-
given as two type of sets M and C. A pair of data point
ilarity measure. Hence, the triplet pqi , x` ´
i , xi q indicates
pxi , xj q P M indicates that xi and xj are similar and `
that the sample point xi is more semantically similar or
pxi , xj q P C means that xi and xj . Hence, the empirical
closer to a query point q than the point x´ i , as demon-
accuracy on the labeled data for a family of hash functions
strated in Figure 4(b).
H “ rh1 , ¨ ¨ ¨ , hK s is given as
» fi As one of the representative methods falling into this
ÿ ÿ ÿ category, column generation hashing explores the large-
JpHq“ – hk pxi qhk pxj q ´ hk pxi qhk pxj qfl . (25) margin framework to leverage such type of proximity
k pxi ,xj qPM pxi ,xj qPC comparison information to design weighted hash func-
lˆl tions [74]. In particular, the relative comparison infor-
If define a matrix S P R incorporating the pairwise la-
mation simpqi , x` ´
i q ą simpqi , xi q will be preserved in a
beled information from Xl as:
$ weighted Hamming space as dWH pqi , x` ´
i q ă dWH pqi , xi q,
& 1 : pxi , xj q P M where dWH is the weighted Hamming distance as de-
Sij “ ´1 : pxi , xj q P C , (26) fined in Eq. 14. To impose a large-margin, the constraint
0 : otherwise. dWH pqi , x` ´
i q ă dWH pqi , xi q should be satisfied as well as
%

The above empirical accuracy can be written in a compact possible. Thus, a typical large-margin objective with ℓ1
matrix form after dropping off the sgnp¨q function norm regularization can be formulated as
1 ` |E|
JpHq “ tr WJ Xl S WJ XJ
˘
l . (27) ÿ
2 arg min ξi ` C}w}1 (29)
w,ξ
However, only considering the empirical accuracy during i“1

the design of the hash function can lead to undesired re- subject to: w ľ 0, ξ ľ 0;
sults. As illustrated in Figure 9(b), although such a hash dWH pqi , x´ ´
i q ´ dWH pqi , xi q ě 1 ´ ξi , @i,
bit partitions the data with zero error over the pairwise
labeled data, it results in imbalanced separation of the un- where w is the random projections for computing the hash
labeled data, thus being less informative. Therefore, an codes. To solve the above optimization problem, the au-
information theoretic regularization is suggested to max- thors proposed using column generation technique to learn
imize the entropy of each hash bit. After relaxation, the the hash function and the associated bit weights iteratively.
final objective is formed as For each iteration, the best hash function is generated and
the weight vector is updated accordingly. In addition, dif-
1 ` ˘ η ` J
W˚ “ arg max tr WJ Xl SXJ J
˘
l W ` tr W XX W , ferent loss functions with other regularization terms such
W 2 2 as ℓ8 are also suggested as alternatives in the above for-
(28) mulation.
where the first part represents the empirical accuracy and
the second component encourages partitioning along large H. Ranking Supervised Hashing
variance projections. The coefficient η weighs the contri- Different from other methods that explore the triplet
bution from these two components. The above objective relationship [74][75][65], the ranking supervised hashing
can be solved using various optimization strategies, result- method attempt to preserve the ranking order of a set of
ing in orthogonal or correlated binary codes, as described database points corresponding to the query point [76]. As-
in [14][25]. Figure 9 illustrates the comparison of one-bit sume that the training dataset X “ txn u has N points with
linear partition using different learning paradigms, where xn P RD . In addition, a query set is given as Q “ tqm u,
the semi-supervised method tends to produce balanced yet and qm P RD , m “ 1, ¨ ¨ ¨ , M . For any specific query point
accurate data separation. Finally, Xu et al. employs simi- qm , we can derive a ranking list over X , which can be writ-
lar semi-supervised formulation to sequentially learn mul- ten as a vector as rpqm , X q “ pr1m , ¨ ¨ ¨ , rnm , ¨ ¨ ¨ , rN
m
q. Each
m
tiple complementary hash tables to further improve the element rn falls into the integer range r1, N s and no two el-
performance [31]. ements share the same value for the exact ranking case. If
WANG, LIU, KUMAR, AND CHANG: LEARNING TO HASH FOR INDEXING BIG DATA - A SURVEY 13

Semantic/Feature Space Assume we utilize linear hash functions, the final objective
Query Point: q
Ranking List Ranking Triplet
is formed as the following constrained quadratic problem
x2 Matrix
x1 x1 x2 x3 x4
x1 3 x1 0 1 -1 -1 W˚ “ arg max LH “ arg max trpWWJ Bq (31)
x2 4 x2 -1 0 -1 -1 W W
x3 2
s.t. WJ W “ I,
x3 1 1 0 -1
x3 x4 1 x4 1 1 1 0

q Ground Truth Triplet


x4 0 1 -1 -1 0 2 -2 -2
-1 0 -1 -1 -2 0 -4 -4 where the constant matrix B is computed as B “
J
1 1 0 -1 2 4 0 0
ř ř
q x1 x2 x3 x4
1 1 1 0 2 4 0 0
p
m m m q with p m “ i,j rxi ´ xj s Smij . The orthogo-
Hash Functions
nality constraint is utilized to minimize the redundancy
0 2 -2 -2
between different hash bits. Intuitively, the above formula-
Hash Codes

1 -1 -1 -1 1 -2 0 -4 -4
1
1
1
-1
-1
-1
1
1
1
-1
2
2
4
4
0
0
0
0
tion is to preserve the ranking list in the Hamming space,
as shown in the conceptual diagram in Figure 10. The
Fig. 10 augmented Lagrangian multiplier method was introduced
The conceptual diagram of the Rank Supervised Hashing to derive feasible solutions for the above constrained prob-
method. The top left component demonstrates the lem, as discussed in [76].
procedure of deriving ground truth ranking list r using the
I. Circulant Binary Embedding
semantic relevance or feature similarity/distance, and then
converting it to a triplet matrix Spqq for a given query q. The Realizing that most of the current hashing techniques
bottom left component describes the estimation of relaxed rely on linear projections, which could suffer from very high
ranking triplet matrix S̃pqq from the binary hash codes. The computational and storage costs for high-dimensional data,
right component shows the objective of minimizing the circulant binary embedding was recently developed to han-
inconsistency between the two ranking triplet matrices. dle such a challenge using the circulant projection [81].
Briefly, given a vector r “ tr0 , ¨ ¨ ¨ , rd´1 u, we can gener-
ate its corresponding circulant matrix R “ circprq [101].
Therefore, the binary embedding with the circulant pro-
rim ă rjm (i, j “ 1, ¨ ¨ ¨ , N ), it indicates sample xi has higher jection is defined as:
rank than xj , which means xi is more relevant or similar
to qm than xj . To represent such a discrete ranking list, a hpxq “ sgnpRxq “ sgnpcircprq ¨ xq. (32)
ranking triplet matrix S P RN ˆN is defined as
Since the circulant projection circprqx is equivalent to cir-
cular convolution r g x, the computation of linear projec-
1 : riq ă rjq
$
& tion can be eventually realized using fast Fourier transform
Spqm ; xi , xj q “ ´1 : riq ą rjq (30) as
0 : riq “ rjq .
%
circprqx “ r g x “ F ´1 pF prq ˝ F pxqq . (33)
Hence for a set of query points Q “ tqm u, we can derive a
Thus, the time complexity is reduced from d2 to d log d.
triplet tensor, i.e., a set of triplet matrices
Finally, one could randomly select the circulant vector r
or design specific ones using supervised learning methods.
S “ tSpqm q u P RMˆN ˆN .
V. Deep Learning for Hashing
In particular, the element of the triplet tensor is During the past decade (since around 2006), Deep Learn-
defined as Smij “ Spqm q pi, jq “ Spqm ; xi , xj q, m “ ing [102], also known as Deep Neural Networks, has drawn
1, ¨ ¨ ¨ , M , i, j “ 1, ¨ ¨ ¨ , N . The objective is to increasing attention and research efforts in a variety of arti-
preserve the ranking lists in the mapped Hamming ficial intelligence areas including speech recognition, com-
space. In other words, if Spqm ; xi , xj q “ 1, we puter vision, machine learning, text mining, etc. Since one
tend to ensure dH pqm , xi q ă dH pqm , xi q, otherwise main purpose of deep learning is to learn robust and power-
dH pqm , xi q ą dH pqm , xi q. Assume the hash code has the ful feature representations for complex data, it is very nat-
value as t´1, 1u, such ranking order is equivalent to the ural to leverage deep learning for exploring compact hash
similarity measurement using the inner products of the bi- codes which can be regarded as binary representations of
nary codes, i.e., data. In this section, we briefly introduce several recently
proposed hashing methods that employ deep learning. In
dH pqm , xi qădH pqm , xi qôHpqm qJ Hpxi qąHpqm qJ Hpxj q. Table III, we compare eight deep learning based hashing
methods in terms of four key characteristics that can be
Then the empirical loss function LH over the ranking list used to differentiate the approaches.
can be represented as The earliest work in deep learning based hashing may
be Semantic Hashing [103]. This method builds a deep
generative model to discover hidden binary units (i.e., la-
ÿÿ
LH “ ´ Hpqm qJ rHpxi q ´ Hpxj qs Smij .
m i,j tent topic features) which can model input text data (i.e.,
14 PROCEEDINGS OF THE IEEE

word-count vectors). Such a deep model is made as a stack Hashing 3 was also presented in [106], where a discrimina-
of Restricted Boltzmann Machines (RBMs) [104]. After tive term incorporating pairwise supervised information is
learning a multi-layer RBM through pre-training and fine- added to the objective function of the deep hashing model.
tuning on a collection of documents, the hash code of any The authors of [106] showed the superiority of the super-
document is acquired by simply thresholding the output of vised deep hashing model over its unsupervised counter-
the deepest layer. Such hash codes provided by the deep part. Both of them produce hash codes through thresh-
RBM were shown to preserve semantically similar relation- olding the output of the top layer in the neural network,
ships among input documents into the code space, in which where all activation functions are hyperbolic tangent func-
each hash code (or hash key) is used as a memory address tions.
to locate corresponding documents. In this way, semanti- It is worthwhile to point out that the above methods, in-
cally similar documents are mapped to adjacent memory cluding Sparse Similarity-Preserving Hashing, Deep Hash-
addresses, thereby enabling efficient search via hash ta- ing and Supervised Deep Hashing, did not include a pre-
ble lookup. To enhance the performance of deep RBMs, training stage during the training of the deep neural net-
a supervised version was proposed in [66], which borrows works. Instead, the hash codes are learned from scratch
the idea of nonlinear Neighbourhood Component Analy- using a set of training data. However, the absence of
sis (NCA) embedding [105]. The supervised information pre-training may make the generated hash codes less effec-
stems from given neighbor/nonneighbor relationships be- tive. Specifically, the Sparse Similarity-Preserving Hashing
tween training examples. Then, the objective function of method is found to be inferior to the state-of-the-art su-
NCA is optimized on top of a deep RBM, making the deep pervised hashing method Kernel-Based Supervised Hash-
RBM yield discriminative hash codes. Note that super- ing (KSH) [22] in terms of search accuracy on some im-
vised deep RBMs can be applied to broad data domains age datasets [99]; the Deep Hashing method and its su-
other than text data. In [66], supervised deep RBMs using pervised version are slightly better than ITQ and its su-
a Gaussian distribution to model visible units in the first pervised version CCA+ITQ, respectively [111][106]. Note
layer were successfully applied to handle massive image that KSH, ITQ and CCA+ITQ exploit relatively shallow
data. learning frameworks.
A recent work named Sparse Similarity-Preserving Almost all existing hashing techniques including the
Hashing [99] tried to address the low recall issue pertaining aforementioned ones relying on deep neural networks take
to relatively long hash codes, which affect most of previous a vector of hand-crafted visual features extracted from an
hashing techniques. The idea is enforcing sparsity into the image as input. Therefore, the quality of produced hash
hash codes to be learned from training examples with pair- codes heavily depends on the quality of hand-crafted fea-
wise supervised information, that is, similar and dissimilar tures. To remove this barrier, a recent method called Con-
pairs of examples (also known as side information in the volutional Neural Network Hashing [107] was developed to
machine learning literature). The relaxed hash functions, integrate image feature learning and hash value learning
actually nonlinear embedding functions, are learned by into a joint learning model. Given pairwise supervised in-
training a Tailored Feed-Forward Neural Network. Within formation, this model consists of a stage of learning ap-
this architecture, two ISTA-type networks [110] that share proximate hash codes and a stage of training a deep Con-
the same set of parameters and conduct fast approxima- volutional Neural Network (CNN) [112] that outputs con-
tions of sparse coding are coupled in the training phase. tinuous hash values. Such hash values can be generated
Since each output of the neural network is continuous al- by activation functions like sigmoid, hyperbolic tangent
beit sparse, a hyperbolic tangent function is applied to the or softmax, and then quantized into binary hash codes
output followed by a thresholding operation, leading to through appropriate thresholding. Thanks to the power of
the final binary hash code. In [99], an extension to hash- CNNs, the joint model is capable of simultaneously learn-
ing multimodal data, e.g., web images with textual tags, ing image features and hash values, directly working on
was also presented. raw image pixels. The deployed CNN is composed of three
Another work named Deep Hashing [106] developed a convolution-pooling layers that involve rectified linear ac-
deep neural network to learn a multiple hierarchical nonlin- tivation, max pooling, and local contrast normalization, a
ear transformation which maps original images to compact standard fully-connected layer, and an output layer with
binary hash codes and hence supports large-scale image re- softmax activation functions.
trieval with the learned binary image representation. The Also based on CNNs, a latest method called as Deep Se-
deep hashing model is established under three constraints mantic Ranking Hashing [108] was presented to learn hash
which are imposed on the top layer of the deep neural net- values such that multilevel semantic similarities among
work: 1) the reconstruction error between an original real- multi-labeled images are preserved. Like the Convolu-
valued image feature vector and the resulting binary code tional Neural Network Hashing method, this method takes
is minimized, 2) each bit of binary codes has a balance, image pixels as input and trains a deep CNN, by which
and 3) all bits are independent of each other. Similar con- image feature representations and hash values are jointly
straints have been adopted in prior unsupervised hashing learned. The deployed CNN consists of five convolution-
or binary coding methods such as Iterative Quantization 3 It is essentially semi-supervised as abundant unlabeled examples
(ITQ) [111]. A supervised version called Supervised Deep are used for training the deep neural network.
WANG, LIU, KUMAR, AND CHANG: LEARNING TO HASH FOR INDEXING BIG DATA - A SURVEY 15

TABLE III
The characteristics of eight recently proposed deep learning based hashing methods.

Deep Learning based Data Learning Learning Hierarchy of


Hashing Methods Domain Paradigm Features? Deep Neural Networks
Semantic Hashing [103] text unsupervised no 4
Restricted Boltzmann Machine [66] text and image supervised no 4 and 5
Tailored Feed-Forward Neural Network [99] text and image supervised no 6
Deep Hashing [106] image unsupervised no 3
Supervised Deep Hashing [106] image supervised no 3
Convolutional Neural Network Hashing [107] image supervised yes 5
Deep Semantic Ranking Hashing [108] image supervised yes 8
Deep Neural Network Hashing [109] image supervised yes 10

pooling layers, two fully-connected layers, and a hash layer age is yielded by thresholding the output of the hash layer
(i.e., output layer). The key hash layer is connected to at 0.5. In [109], the Deep Neural Network Hashing method
both fully-connected layers and in the function expression was shown to surpass the Convolutional Neural Network
as Hashing method as well as several shallow learning based
hpxq “ 2σ wJ rf1 pxq; f2 pxqs ´ 1,
` ˘
supervised hashing methods in terms of image search ac-
in which x represents an input image, hpxq represents the curacy.
vector of hash values for image x, f1 pxq and f2 pxq respec- Last, we a few observations are worth mentioning deep
tively denote the feature representations from the outputs learning based hashing methods introduced in this section.
of the first and second fully-connected layers, w is the 1. The majority of those methods did not report the
weight vector, and σpq is the logistic function. The Deep time of hash code generation. In real-world search
Semantic Ranking Hashing method leverages listwise su- scenarios, the speed for generating hashes should be
pervised information to train the CNN, which stems from substantially fast. There might be concern about the
a collection of image triplets that encode the multilevel hashing speed of those deep neural network driven ap-
similarities, i.e., the first image in each triplet is more sim- proaches, especially the approaches involving image
ilar to the second one than the third one. The hash code feature learning, which may take much longer time to
of image x is finally obtained by thresholding the output hash an image compared to shallow learning driven
hpxq of the hash layer at zero. approaches like ITQ and KSH.
The above Convolutional Neural Network Hashing 2. Instead of employing deep neural networks to seek
method [107] requires separately learning approximate hash codes, another interesting problem is to design
hash codes to guide the subsequent learning of image rep- a proper hashing technique to accelerate deep neural
resentation and finer hash values.A latest method called network training or save memory space. The latest
Deep Neural Network Hashing [109] goes beyond, in which work [113] presented a hashing trick named Hashed-
the image representation and hash values are learned in Nets, which shrinks the storage costs of neural net-
one stage so that representation learning and hash learn- works significantly while mostly preserving the gener-
ing are tightly coupled to benefit each other. Similar to the alization performance in image classification tasks.
Deep Semantic Ranking Hashing method [108], the Deep VI. Advanced Methods and Related
Neural Network Hashing method incorporates listwise su- Applications
pervised information to train a deep CNN, giving rise to a
currently deepest architecture for supervised hashing. The In this section, we further extend the survey scope to
pipeline of the deep hashing architecture includes three cover a few more advanced hashing methods that are devel-
building blocks: 1) a triplet of images (the first image is oped for specific settings and applications, such as point-
more similar to the second one than the third one) which to-hyperplane hashing, subspace hashing, and multimodal-
are fed to the CNN, and upon which a triplet ranking loss ity hashing.
is designed to characterize the listwise supervised informa- A. Hyperplane Hashing
tion; 2) a shared sub-network with a stack of eight convo-
lution layers to generate the intermediate image features; Distinct from the previously surveyed conventional hash-
3) a divide-and-encode module to divide the intermediate ing techniques all of which address the problem of fast
image features into multiple channels, each of which is en- point-to-point nearest neighbor search (see Figure 12(a)),
coded into a single hash bit. Within the divide-and-encode a new scenario “point-to-hyperplane” hashing emerges to
module, there are one fully-connected layer and one hash tackle fast point-to-hyperplane nearest neighbor search
layer. The former uses sigmoid activation, while the latter (see Figure 12(b)), where the query is a hyperplane instead
uses a piecewise thresholding scheme to produce a nearly 4 The illustration figure is from
discrete hash values. Eventually, the hash code of any im- http://vision.cs.utexas.edu/projects/activehash/
16 PROCEEDINGS OF THE IEEE

Fig. 11 Fig. 12
Active learning framework with a hashing-based fast query Two distinct nearest neighbor search problems. (a)
selection strategy 4 . Point-to-point search, the blue solid circle represents a
point query and the red circle represents the found
nearest neighbor point. (b) Point-to-hyperplane search, the
of a data point. Such a new scenario requires hashing the blue plane denotes a hyperplane query Pw with w being its
hyperplane query to near database points, which is diffi- normal vector, and the red circle denotes the found
cult to accomplish because point-to-hyperplane distances nearest neighbor point.
are quite different from routine point-to-point distances in
terms of the computation mechanism. Despite the bulk
of research on point-to-point hashing, this special hash-
sired nonneighborsˇ (e.g., x3 , x4 ) with wide αx,w . Because
ing paradigm is rarely touched. For convenience, we call ˇ
αx,w “ ˇθx,w ´ π2 ˇ, the point-to-hyperplane search problem
point-to-hyperplane hashing as Hyperplane Hashing.
can be equivalently transformed to a specific point-to-point
Hyperplane hashing is actually fairly important for
search problem where the query is the hyperplane normal
many machine learning applications such as large-scale ac-
w and the desired nearest neighbor to the raw query Pw
tive learning with SVMs [114]. In SVM-based active learn-
is the one whose angle θx,w from w is closest to π{2, i.e.,
ing [115], the well proven sample selection strategy is to
most closely perpendicular to w (we write “perpendicular
search in the unlabeled sample pool to identify the sample
to w” as K w for brevity). This is very different from tradi-
closest to the current hyperplane decision boundary, thus
tional point-to-point nearest neighbor search which returns
providing the most useful information for improving the
the most similar point to the query point. In the following,
learning model. When making such active learning scal-
several existing hyperplane hashing methods will be briefly
able to gigantic databases, exhaustive search for the point
discussed
nearest to the hyperplane is not efficient for the online sam-
ple selection requirement. Hence, novel hashing methods Jain et al. [117] devised two different families of ran-
that can principally handle hyperplane queries are called domized hash functions to attack the hyperplane hashing
for. A conceptual diagram using hyperplane hashing to problem. The first one is Angle-Hyperplane Hash (AH-
scale up active learning process is demonstrated in Fig- Hash) A, of which one instance function is
ure 11.
We demonstrate the geometric relationship between a hA pzq “
data point x and a hyperplane Pw with the vector normal rsgnpuJ zq, sgnpvJ zqs, z is a database point
"
as w in Figure 13(a). Given a hyperplane query Pw and a rsgnpuJ zq, sgnp´vJ zqs, z is a hyperplane normal
set of points X , the target nearest neighbor is (34)
x˚ “ arg min Dpx, Pw q,
xPX where z P Rd represents an input vector, and u and v are
J both drawn independently from a standard d-variate Gaus-
where Dpx, Pw q “ |w}w}x| is the point-to-hyperplane dis- sian, i.e., u, v ∼ N p0, Idˆd q. Note that hA is a two-bit hash
tance. The existing hyperplane hashing methods [116][117] function which leads to the probability of collision for a hy-
all attempt to minimize a slightly modified “distance” perplane normal w and a database point x:
|wJ x|
}w}}x} , ˇ i.e., theˇ sine of the point-to-hyperplane angle
αx,w “ ˇθx,w ´ π2 ˇ. Note that θx,w P r0, πs is the angle be- ‰ 1 α2x,w
Pr hA pwq “ hA pxq “ ´ 2 .

(35)
tween x and w. The angle measure αx,w P r0, π{2s between 4 π
a database point and a hyperplane query turns out to be
reflected into the design of hash functions. This probability monotonically decreases as the point-to-
As shown in Figure 13(b), the goal of hyperplane hashing hyperplane angle αx,w increases, ensuring angle-sensitive
is to hash a hyperplane query Pw and the desired neighbors hashing.
(e.g., x1 , x2 ) with narrow αx,w into the same or nearby The second family proposed by Jain et al. is Embedding-
hash buckets, meanwhile avoiding to return the unde- Hyperplane Hash (EH-Hash) function family E of which
WANG, LIU, KUMAR, AND CHANG: LEARNING TO HASH FOR INDEXING BIG DATA - A SURVEY 17

(a) p1 vs. r
0.5
AH−Hash
0.45
EH−Hash

p1 (collision probability)
0.4 BH−Hash
0.35

0.3

0.25

0.2

0.15

0.1

0.05

0
0 0.5 1 1.5 2 2.5
r

Fig. 14
Fig. 13 Comparison of the collision probabilities of the three
The hyperplane hashing problem. (a) Point-to-hyperplane randomized hyperplane hashing schemes using p1 (probability
distance Dpx, Pw q and point-to-hyperplane angle αx,w . (b) of collision) vs. r (squared point-to-hyperplane angle)
Neighbors (x1 , x2 ) and nonneighbors (x3 , x4 ) of the
hyperplane query Pw , and the ideal neighbors are the
points K w. functions hA and hE . (1) hB produces a single hash bit
which is the product of the two hash bits produced by hA .
(2) hB may be a rank-one special case of hE in algebra if we
one instance is write uJ zzJ v “ trpzzJ vuJ q and UJ Vpzz J q “ trpzz J Uq.
` J (3) hB appears in a universal form, while both hA and
VpzzJ˘ q , z is a database point
" ˘
hE pzq “ `sgn U hE treat a query and a database item in a distinct man-
sgn ´UJ VpzzJ q , z is a hyperplane normal ner. The computation time of hB is Θp2dq which is the
(36) same as that of hA and one order of magnitude faster than
where VpAq returns the vectorial concatenation of matrix Θp2d2 q of hE . Liu et al. further improved the performance
A, and U ∼ N p0, Id2 ˆd2 q. The EH hash function hE yields of hB through learning the bilinear projection directions
2 u, v in hB from the data. Gong et al. extended the bilin-
hash bits on an embedded space Rd resulting from vec-
ear formulation to the conventional point-to-point hashing
torizing rank-one matrices zz and ´zzJ . Compared with
J
scheme through designing compact binary codes for high-
hA , hE gives a higher probability of collision:
dimensional visual descriptors [118].
‰ cos´1 sin2 pαx,w q
Pr hE pwq “ hE pxq “

, (37) B. Subspace Hashing
π
which also bears the angle-sensitive hashing property. Beyond the aforementioned conventional hashing which
However, it is much more expensive to compute than AH- tackles searching in a database of vectors, subspace hash-
Hash. ing [119], which has been rarely explored in the literature,
More recently, Liu et al. [117] designed a randomized attempts to efficiently search through a large database of
function family with bilinear Bilinear-Hyperplane Hash subspaces. Subspace representation is very common in
(BH-Hash) as: many computer vision, pattern recognition, and statisti-
cal learning problems, such as subspace representations of
B “ hB pzq “ sgnpuJ zzJ vq, i.i.d. u, v ∼ N p0, Idˆd q .
(
image patches, image sets, video clips, etc. For example,
(38) face images of the same subject with fixed poses but dif-
ferent illuminations are often assumed to reside near linear
As a core finding, Liu et al. proved in [117] that the proba- subspaces. A common use scenario is to use a single face
bility of collision for a hyperplane query Pw and a database image to find the subspace (and the corresponding subject
point x under hB is ID) closest to the query image [120]. Given a query in the
‰ 1 2α2x,w form of vector or subspace, searching for a nearest subspace
Pr hB pPw q “ hB pxq “ ´

. (39) in a subspace database is frequently encountered in a vari-
2 π2 ety of practical applications including example-based im-
Specifically, hB pPw q is prescribed to be ´hB pwq. Eq. (39) age synthesis, scene classification, speaker recognition, face
endows hB with the angle-sensitive hashing property. It is recognition, and motion-based action recognition [120].
important to find that the collision probability given by the However, hashing and searching for subspaces are both
BH hash function hB is always twice of the collision prob- different from the schemes used in traditional vector hash-
ability by the AH hash function hA , and also greater than ing and the latest hyperplane hashing. [119] presented a
the collision probability by the EH hash function hE . As general framework to the problem of Approximate Near-
illustrated in Figure 14, for any fixed r, BH-Hash accom- est Subspace (ANS) search, which uniformly deals with
plishes the highest probability of collision, which indicates the cases that query is a vector or subspace, query and
that the BH-Hash has a better angle-sensitive property. database elements are subspaces of fixed dimension, query
In terms of the formulation, the bilinear hash function and database elements are subspaces of different dimen-
hB is correlated with yet different from the linear hash sion, and database elements are subspaces of varying di-
18 PROCEEDINGS OF THE IEEE

mension. The critical technique exploited by [119] is two- between image and text. The authors formulate their ob-
step: 1) a simple mapping that maps both query and jective as a boosted co-regularization framework with the
database elements to “points” in a new vector space, and 2) cost component as a weighted sum of the intra-modality
doing approximate nearest neighbor search using conven- and inter-modality loss. The learning process of the hash
tional vector hashing algorithms in the new space. Conse- functions is performed via a boosting procedure to that the
quently, the main contribution of [119] is reducing the diffi- bias introduced by previous hash function can be sequen-
cult subspace hashing problem to a regular vector hashing tially minimized. Dual-view hashing attempts to derive
task. [119] used LSH for the vector hashing task. While a hidden common Hamming embedding of data from two
simple, the hashing technique (mapping + LSH) of [119] views, while maintaining the predictability of the binary
perhaps suffers from the high dimensionality of the con- codes [124]. A probabilistic model called multimodal la-
structed new vector space. tent binary embedding was recently presented to derive bi-
More recently, [120] exclusively addressed the point-to- nary latent factors in a common Hamming space for index-
subspace query where query is a vector and database items ing multimodal data [125]. Other closely related hashing
are subspaces of arbitrary dimension. [120] proposed a rig- methods include the design of multiple feature hashing for
orously faster hashing technique than that of [119]. Its near-duplicate duplicate detection [126], submodular hash-
hash function can hash D-dimensional vectors (D is the ing for video indexing [127], and Probabilistic Attributed
ambient dimension of the query) or D ˆ r-dimensional sub- Hashing for integrating low-level features and semantic at-
spaces (r is arbitrary) in a linear time complexity OpDq, tributes [128].
which is computationally more efficient than the hash func-
tions devised in [119]. [120] further proved the search time
under the OpDq hashes to be sublinear in the database D. Applications with Hashing
size.
Based on the nice finding of [120], we would like to Indexing massive multimedia data, such as images and
achieve faster hashing for the subspace-to-subspace query video, are the natural applications for learning based
by means of crafted novel hash functions to handle sub- hashing. Especially, due to the well-known seman-
spaces in varying dimension. Both theoretical and practi- tic gap, supervised and semi-supervised hashing meth-
cal explorations in this direction will be beneficial to the ods have been extensively studied for image search
hashing area. and retrieval [69][29][41][129][130][131], mobile product
search[132]. Other closely related computer vision ap-
C. MultiModality Hashing plications include image patch matching [133], image
Note that the majority of the hash learning methods are classification [118], face recognition [134][135], pose esti-
designed for constructing the Hamming embedding for a mation [71], object tracking [136], and duplicate detec-
single modality or representation. Some recent advanced tion [126][137][52][51][138]. In addition, this emerging hash
methods are proposed to design the hash functions for more learning framework can be exploited for some general ma-
complex settings, such as that the data are represented by chine learning and data mining tasks, including cross-
multimodal features or the data are formed in a heteroge- modality data fusion [139], large scale optimization [140],
neous way [121]. Such type of hashing methods are closely large scale classification and regression [141], collaborative
related to the applications in social network, whether mul- filtering [142], and recommendation [143]. For indexing
timodality and heterogeneity are often observed. Below we video sequences, a straightforward method is to indepen-
survey several representative methods that are proposed dently compute binary codes for each key frames and use a
recently. set of hash code to represent video index. More recently, Ye
Realizing that data items like webpage can be described et al. proposed a structure learning framework to derive a
from multiple information sources, composing hashing was video hashing technique that incorporates both temporal
recently proposed to design hashing schme using several and spatial structure information [144]. In addition, ad-
information sources [122]. Besides the intuitive way of con- vanced hashing methods are also developed for document
catenating multiple features to derive hash functions, the search and retrieval. For instance, Wang et al. proposed
author also presented an iterative weighting scheme and to leverage both tag information and semantic topic mod-
formulated convex combination of multiple features. The eling to achieve more accurate hash codes [145]. Li et al.
objective is to ensure the consistency between the seman- designed a two-stage unsupervised hashing framework for
tic similarity and Hamming similarity of the data. Fi- fast document retrieval [146].
nally, a joint optimization strategy is employed to learn Hashing techniques have also been applied to the ac-
the importance of individual type of features and the hash tive learning framework to cope with big data applica-
functions. Co-regularized hashing was proposed to inves- tions. Without performing exhaustive test on all the
tigate the hashing learning across multiparity data in a data points, hyperplane hashing can help significantly
supervised setting, where similar and dissimilar pairs of speed up the interactive training sample selection pro-
intra-modality points are given as supervision informa- cedure [114][117][116]. In addition, a two-stage hashing
tion [123]. One of such a typical setting is to index im- scheme is developed to achieve fast query pair selection for
ages and the text jointly to preserve the semantic relations large scale active learning to rank [147].
WANG, LIU, KUMAR, AND CHANG: LEARNING TO HASH FOR INDEXING BIG DATA - A SURVEY 19

VII. Open Issues and Future Directions [10] T. Ge, K. He, Q. Ke, and J. Sun, “Optimized product quan-
tization for approximate nearest neighbor search,” in Proceed-
Despite the tremendous progress in developing a large ings of the IEEE International Conference on Computer Vi-
array of hashing techniques, several major issues remain sion and Pattern Recognition, Portland, Oregon, USA, 2013,
pp. 2946–2953.
open. First, unlike the locality sensitive hashing family, [11] A. Torralba, R. Fergus, and W. Freeman, “80 million tiny im-
most of the learning based hashing techniques lack the the- ages: A large data set for nonparametric object and scene
oretical guarantees on the quality of returned neighbors. recognition,” IEEE Trans. on Pattern Analysis and Machine
Intelligence, vol. 30, no. 11, pp. 1958–1970, 2008.
Although several recent techniques have presented theo- [12] D. Knuth, Art of Computer Programming, Volume 3: Sorting
retical analysis of the collision probability, they are mostly and searching. Addison-Wesley Professional, 1997.
based on randomized hash functions [116][117][19]. Hence, [13] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in
high dimensions via hashing,” in Proc. of 25th International
it is highly desired to further investigate such theoretical Conference on Very Large Data Bases, 1999, pp. 518–529.
properties. Second, compact hash codes have been mostly [14] J. Wang, S. Kumar, and S.-F. Chang, “Semi-supervised hash-
studied for large scale retrieval problems. Due to their ing for large scale search,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2012.
compact form, the hash codes also have great potential in
[15] A. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain,
many other large scale data modeling tasks such as efficient “Content-based image retrieval at the end of the early years,”
nonlinear kernel SVM classifiers [148] and rapid kernel ap- IEEE Trans. on Pattern Analysis and Machine Intelligence,
proximation [149]. A bigger question is: instead of using vol. 22, no. 12, pp. 1349–1380, 2000.
[16] L. Cayton and S. Dasgupta, “A learning framework for nearest
the original data, can one directly use compact codes to neighbor search,” in Advances in Neural Information Process-
do generic unsupervised or supervised learning without af- ing Systems 20, J. Platt, D. Koller, Y. Singer, and S. Roweis,
fecting the accuracy? To achieve this, theoretically sound Eds. Cambridge, MA: MIT Press, 2008, pp. 233–240.
[17] J. He, S.-F. Chang, R. Radhakrishnan, and C. Bauer, “Com-
practical methods need to be devised. This will make ef- pact hashing with joint optimization of search accuracy and
ficient large-scale learning possible with limited resources, time,” in IEEE Conference on Computer Vision and Pattern
for instance on mobile devices. Third, most of the current Recognition, Colorado Springs, CO, USA, 2011, pp. 753–760.
[18] G. Shakhnarovich, “Learning task-specific similarity,” Ph.D.
hashing technicals are designed for given feature represen- dissertation, Massachusetts Institute of Technology, 2005.
tations that tend to suffer from the semnatic gap. One of [19] B. Kulis, P. Jain, and K. Grauman, “Fast similarity search for
the possible future directions is to integrate representation learned metrics,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 31, no. 12, pp. 2143–2157, 2009.
learning with binary code learning using advanced learn- [20] A. Gordo and F. Perronnin:, “Asymmetric distances for binary
ing schemes such as deep neural network. Finally, since embeddings,” in IEEE Conference on Computer Vision and
heterogeneity has been an important characteristics of the Pattern Recognition, Colorado Springs, CO, USA, 2011, pp.
729–736.
big data applications, one of the future trends will be to [21] B. Kulis and K. Grauman, “Kernelized locality-sensitive hash-
design efficient hashing approaches that can leverage het- ing,” IEEE Transactions on Pattern Analysis and Machine
erogeneous features and multi-modal data to improve the Intelligence, 2011.
[22] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang, “Super-
overall indexing quality. Along those lines, developing new vised hashing with kernels,” in Proceedings of the IEEE Inter-
hashing techniques for composite distance measures, i.e., national Conference on Computer Vision and Pattern Recog-
those based on combinations of different distances acting nition, Providence, RI, USA, 2012, pp. 2074–2081.
[23] Y. Lin, R. Jin, D. Cai, S. Yan, and X. Li, “Compressed hash-
on different types of features will be of great interest. ing,” in Proceedings of the IEEE International Conference on
Computer Vision and Pattern Recognition, Portland, Oregon,
References USA, 2013, pp. 446–451.
[1] R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Image retrieval: [24] A. Joly and O. Buisson, “Random maximum margin hashing,”
Ideas, influences, and trends of the new age,” ACM Computing in IEEE Conference on Computer Vision and Pattern Recog-
Surveys, vol. 40, no. 2, pp. 1–60, 2008. nition, Colorado Springs, CO, USA, 2011, pp. 873–880.
[2] G. Shakhnarovich, T. Darrell, and P. Indyk, Nearest-neighbor [25] J. Wang, S. Kumar, and S.-F. Chang, “Sequential projection
methods in learning and vision: theory and practice. MIT learning for hashing with compact codes,” in Proceedings of the
Press, 2006. 27th International Conference on Machine Learning, 2010, pp.
[3] R. Bellman, Dynamic programming, ser. Rand Corporation re- 1127–1134.
search study. Princeton University Press, 1957. [26] K. He, F. Wen, and J. Sun, “K-means hashing: An affinity-
[4] J. Bentley, “Multidimensional binary search trees used for asso- preserving quantization method for learning binary compact
ciative searching,” Communications of the ACM, vol. 18, no. 9, codes,” in Proceedings of the IEEE International Conference
p. 517, 1975. on Computer Vision and Pattern Recognition, Portland, Ore-
[5] S. Omohundro, “Efficient algorithms with neural network be- gon, USA, 2013, pp. 2938–2945.
havior,” Complex Systems, vol. 1, no. 2, pp. 273–347, 1987. [27] B. Kulis and T. Darrell, “Learning to Hash with Binary Recon-
[6] J. Uhlmann, “Satisfying general proximity/similarity queries structive Embeddings,” in Proc. of Advances in Neural Infor-
with metric trees,” Information Processing Letters, vol. 40, mation Processing Systems, Y. Bengio, D. Schuurmans, J. Laf-
no. 4, pp. 175–179, 1991. ferty, C. K. I. Williams, and A. Culotta, Eds., 2009, vol. 20,
[7] P. Yianilos, “Data structures and algorithms for nearest neigh- pp. 1042–1050.
bor search in general metric spaces,” in Proc. of the fourth [28] W. Liu, J. Wang, S. Kumar, and S.-F. Chang, “Hashing with
annual ACM-SIAM Symposium on Discrete algorithms, 1993, graphs,” in Proc. of ICML, Bellevue, Washington, USA, 2011.
pp. 311–321. [29] J. Wang, S. Kumar, and S.-F. Chang, “Semi-supervised hash-
[8] P. Indyk, “Nearest-neighbor searching in high dimensions,” in ing for scalable image retrieval,” in IEEE Computer Society
Handbook of discrete and computational geometry, J. E. Good- Conference on Computer Vision and Pattern Recognition, San
man and J. O’Rourke, Eds. Boca Raton, FL: CRC Press LLC, Francisco, USA, 2010, pp. 3424–3431.
2004. [30] M. Norouzi and D. Fleet, “Minimal loss hashing for compact
[9] H. Jegou, M. Douze, and C. Schmid, “Product quantization binary codes,” in Proceedings of the 27th International Con-
for nearest neighbor search,” IEEE Transactions on Pattern ference on Machine Learning, 2011.
Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117–128, [31] H. Xu, J. Wang, Z. Li, G. Zeng, S. Li, and N. Yu, “Comple-
2011. mentary hashing for approximate nearest neighbor search,” in
20 PROCEEDINGS OF THE IEEE

Proceedings of the IEEE International Conference on Com- [53] P. Li and C. König, “b-bit minwise hashing,” in Proceedings of
puter Vision, 2011, pp. 1631–1638. the 19th International Conference on World Wide Web, 2010,
[32] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in pp. 671–680.
Proc. of Advances in Neural Information Processing Systems, [54] P. Li, A. Konig, and W. Gui, “b-bit minwise hashing for es-
D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, Eds. timating three-way similarities,” in Advances in Neural Infor-
MIT Press, 2008, vol. 21, pp. 1753–1760. mation Processing Systems 23, J. Lafferty, C. K. I. Williams,
[33] M. Raginsky and S. Lazebnik, “Locality-sensitive binary codes J. Shawe-Taylor, R. Zemel, and A. Culotta, Eds., 2010, pp.
from shift-invariant kernels,” in Advances in Neural Informa- 1387–1395.
tion Processing Systems 22, Y. Bengio, D. Schuurmans, J. Laf- [55] P. Li, A. Owen, and C.-H. Zhang, “One permutation hash-
ferty, C. K. I. Williams, and A. Culotta, Eds. MIT Press, 2009, ing,” in Advances in Neural Information Processing Systems
pp. 1509–1517. 25, P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Wein-
[34] M. S. Charikar, “Similarity estimation techniques from round- berger, Eds., 2012, pp. 3122–3130.
ing algorithms,” in Proceedings of the thiry-fourth annual ACM [56] O. Chum, M. Perdoch, and J. Matas, “Geometric min-hashing:
symposium on Theory of computing. ACM, 2002, pp. 380–388. Finding a (thick) needle in a haystack,” in IEEE Computer
[35] M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni, “Locality- Society Conference on Computer Vision and Pattern Recogni-
sensitive hashing scheme based on p-stable distributions,” in tion, Miami, Florida, USA, 2009, pp. 17–24.
Proceedings of the twentieth annual Symposium on Computa- [57] O. Chum and J. Matas, “Fast computation of min-hash signa-
tional Geometry, 2004, pp. 253–262. tures for image collections,” in Proceedings of the IEEE Inter-
[36] M. Bawa, T. Condie, and P. Ganesan, “LSH forest: self-tuning national Conference on Computer Vision and Pattern Recog-
indexes for similarity search,” in Proceedings of the 14th inter- nition, Providence, RI, USA, 2012, pp. 3077–3084.
national conference on World Wide Web, Chiba, Japan, 2005, [58] P. Indyk and R. Motwani, “Approximate nearest neighbors:
pp. 651–660. towards removing the curse of dimensionality,” in Proc. of 30th
[37] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi- ACM Symposium on Theory of Computing, 1998, pp. 604–613.
probe LSH: efficient indexing for high-dimensional similarity [59] M. Norouzi, A. Punjani, and D. J. Fleet, “Fast search in ham-
search,” in Proceedings of the 33rd international conference ming space with multi-index hashing,” in IEEE Conference
on Very large data bases, 2007, pp. 950–961. on Computer Vision and Pattern Recognition, Providence, RI,
[38] W. Dong, Z. Wang, W. Josephson, M. Charikar, and K. Li, USA, 2012, pp. 3108–3115.
“Modeling lsh for performance tuning,” in Proceedings of the [60] F. Shen, C. Shen, Q. Shi, A. van den Hengel, and Z. Tang,
17th ACM Conference on Information and Knowledge Man- “Inductive hashing on manifolds,” in Proceedings of the IEEE
agement, 2008, pp. 669–678. International Conference on Computer Vision and Pattern
[39] A. Dasgupta, R. Kumar, and T. Sarlos, “Fast locality-sensitive Recognition, Portland, Oregon, USA, 2013, pp. 1562–1569.
hashing,” in Proceedings of the 17th ACM SIGKDD Interna- [61] Y. Gong and S. Lazebnik, “Iterative quantization: A pro-
tional Conference on Knowledge Discovery and Data Mining, crustean approach to learning binary codes,” in IEEE Confer-
2011, pp. 1073–1081. ence on Computer Vision and Pattern Recognition, Colorado
[40] V. Satuluri and S. Parthasarathy, “Bayesian locality sensitive Springs, CO, USA, 2011, pp. 817–824.
hashing for fast similarity search,” Proceedings of the VLDB
[62] W. Kong and W.-J. Li, “Isotropic hashing,” in Advances in
Endowment, vol. 5, no. 5, pp. 430–441, 2012.
Neural Information Processing Systems 25, 2012, pp. 1655–
[41] B. Kulis and K. Grauman, “Kernelized locality-sensitive hash- 1663.
ing for scalable image search,” in IEEE International Con-
[63] Y. Gong, S. Kumar, V. Verma, and S. Lazebnik, “Angular
ference on Computer Vision, kyoto, Japan, 2009, pp. 2130 –
quantization based binary codes for fast similarity search,” in
2137.
Advances in Neural Information Processing Systems 25, 2012,
[42] Y. Mu, J. Shen, and S. Yan, “Weakly-supervised hashing in
pp. 1205–1213.
kernel space,” in IEEE Computer Society Conference on Com-
puter Vision and Pattern Recognition, San Francisco, USA, [64] J.-P. Heo, Y. Lee, J. He, S.-F. Chang, and S.-E. Yoon, “Spher-
2010, pp. 3344–3351. ical hashing,” in IEEE Conference on Computer Vision and
Pattern Recognition, Providence, RI, USA, 2012, pp. 2957–
[43] J. Ji, J. Li, S. Yan, B. Zhang, and Q. Tian, “Super-bit locality-
2964.
sensitive hashing,” in Advances in Neural Information Process-
ing Systems 25, 2012, pp. 108–116. [65] M. Norouzi, D. Fleet, and R. Salakhutdinov, “Hamming dis-
[44] Y. Mu and S. Yan, “Non-metric locality-sensitive hashing.” in tance metric learning,” in Advances in Neural Information
Proceedings of the Twenty-Fourth AAAI Conference on Arti- Processing Systems, 2012, pp. 1070–1078.
ficial Intelligence, 2010, pp. 539–544. [66] A. Torralba, R. Fergus, and Y. Weiss, “Small codes and large
[45] A. Broder, “On the resemblance and containment of docu- image databases for recognition,” in IEEE Conference on
ments,” in Compression and Complexity of Sequences Proceed- Computer Vision and Pattern Recognition, Anchorage, Alaska,
ings, 1997, pp. 21–29. USA, 2008, pp. 1–8.
[46] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher, [67] L. Fan, “Supervise binary hash code learning with jensen shan-
“Min-wise independent permutations,” in Proceedings of the non divergence,” in Proceedings of the IEEE International
Thirtieth Annual ACM Symposium on Theory of Computing, Conference on Computer Vision, 2013.
1998, pp. 327–336. [68] F. Shen, C. Shen, W. Liu, and H. T. Shen, “Supervised dis-
[47] A. Rajaraman and J. D. Ullman, Mining of massive datasets. crete hashing,” in Proceedings of the IEEE International Con-
Cambridge University Press, 2012. ference on Computer Vision and Pattern Recognition, Boston,
[48] S. Ioffe, “Improved consistent sampling, weighted minhash and Massachusetts, USA, 2015.
l1 sketching,” in IEEE International Conference on Data Min- [69] P. Jain, B. Kulis, and K. Grauman, “Fast image search for
ing. IEEE, 2010, pp. 246–255. learned metrics,” in IEEE Conference on Computer Vision
[49] M. Henzinger, “Finding near-duplicate web pages: a large-scale and Pattern Recognition, Anchorage, Alaska, USA, 2008, pp.
evaluation of algorithms,” in Proceedings of the 29th annual 1–8.
international ACM SIGIR conference on Research and devel- [70] P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman, “Online met-
opment in information retrieval, 2006, pp. 284–291. ric learning and fast similarity search,” in Advances in neural
[50] A. S. Das, M. Datar, A. Garg, and S. Rajaram, “Google news information processing systems, 2008, pp. 761–768.
personalization: scalable online collaborative filtering,” in Pro- [71] G. Shakhnarovich, P. Viola, and T. Darrell, “Fast pose estima-
ceedings of the 16th international conference on World Wide tion with parameter-sensitive hashing,” in IEEE International
Web, 2007, pp. 271–280. Conference on Computer Vision, Nice, France, 2003, pp. 750–
[51] O. Chum, J. Philbin, and A. Zisserman, “Near duplicate image 757.
detection: min-hash and tf-idf weighting.” in Proceedings of [72] M. Rastegari, A. Farhadi, and D. A. Forsyth, “Attribute dis-
the the British Machine Vision Conference, vol. 810, 2008, pp. covery via predictable discriminative binary codes,” in Euro-
812–815. pean Conference on Computer Vision, Florence, Italy, 2012,
[52] D. C. Lee, Q. Ke, and M. Isard, “Partition min-hash for partial pp. 876–889.
duplicate image discovery,” in European Conference on Com- [73] M. Ou, P. Cui, F. Wang, J. Wang, and W. Zhu, “Non-transitive
puter Vision, Heraklion, Crete, Greece, 2012, pp. 648–662. hashing with latent similarity components,” in Proceedings of
WANG, LIU, KUMAR, AND CHANG: LEARNING TO HASH FOR INDEXING BIG DATA - A SURVEY 21

the 21th ACM SIGKDD International Conference on Knowl- IEEE Conference on Computer Vision and Pattern Recogni-
edge Discovery and Data Mining, 2015, pp. 895–904. tion, Providence, RI, USA, 2012, pp. 2058–2065.
[74] X. Li, G. Lin, C. Shen, A. V. den Hengel, and A. Dick, “Learn- [95] C. Fowlkes, S. Belongie, F. Chung, and J. Malik, “Spectral
ing hash functions using column generation,” in Proceedings of grouping using the Nyström method,” IEEE Transactions on
the 30th International Conference on Machine Learning, 2013, Pattern Analysis and Machine Intelligence, no. 2, pp. 214–225,
pp. 142–150. 2004.
[75] J. Wang, J. Wang, N. Yu, and S. Li, “Order preserving hashing [96] Y. Weiss, R. Fergus, and A. Torralba, “Multidimensional spec-
for approximate nearest neighbor search,” in Proceedings of the tral hashing,” in European Conference on Computer Vision,
21st ACM international conference on Multimedia, 2013, pp. Florence, Italy, 2012, pp. 340–353.
133–142. [97] W. Liu, J. He, and S.-F. Chang, “Large graph construction for
[76] J. Wang, W. Liu, A. X. Sun, and Y.-G. Jiang, “Learning hash scalable semi-supervised learning,” in Proceedings of the 27th
codes with listwise supervision,” in Proceedings of the IEEE International Conference on Machine Learning, 2010, pp. 679–
International Conference on Computer Vision, 2013. 686.
[77] H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregat- [98] W. Liu, C. Mu, S. Kumar, and S.-F. Chang, “Discrete graph
ing local descriptors into a compact image representation,” in hashing,” in Advances in Neural Information Processing Sys-
IEEE Conference on Computer Vision and Pattern Recogni- tems, 2014, pp. 3419–3427.
tion, 2010, pp. 3304–3311. [99] J. Masci, A. Bronstein, M. Bronstein, P. Sprechmann, and
[78] J. He, S. Kumar, and S.-F. Chang, “On the difficulty of near- G. Sapiro, “Sparse similarity-preserving hashing,” in Proc. In-
est neighbor search,” in Proceedings of the 29th international ternational Conference on Learning Representations, 2014.
conference on Machine learning, 2012. [100] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon,
[79] C. Strecha, A. M. Bronstein, M. M. Bronstein, and P. Fua, “Information-theoretic metric learning,” in Proceedings of the
“Ldahash: Improved matching with smaller descriptors,” IEEE 24th international conference on Machine learning, 2007, pp.
Transactions on Pattern Analysis and Machine Intelligence, 209–216.
vol. 34, no. 1, pp. 66–78, 2012. [101] R. M. Gray, Toeplitz and circulant matrices: A review. now
[80] T. Trzcinski and V. Lepetit, “Efficient discriminative projec- publishers inc, 2006.
tions for compact binary descriptors,” in European Conference [102] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,
on Computer Vision, Florence, Italy, 2012, pp. 228–242. vol. 521, pp. 436–444, 2015.
[81] F. Yu, S. Kumar, Y. Gong, and S.-F. Chang, “Circulant binary [103] R. Salakhutdinov and G. Hinton, “Semantic hashing,” Inter-
embedding,” in Proc. of ICML, Beijing, China, 2014, pp. 946– national Journal of Approximate Reasoning, vol. 50, no. 7, pp.
954. 969–978, 2009.
[82] J. He, W. Liu, and S.-F. Chang, “Scalable similarity search [104] G. Hinton and R. Salakhutdinov, “Reducing the dimensionality
with optimized kernel hashing,” in Proceedings of the 16th of data with neural networks,” Science, vol. 313, no. 5786, pp.
ACM SIGKDD International Conference on Knowledge Dis- 504–507, 2006.
covery and Data Mining, 2010, pp. 1129–1138. [105] R. Salakhutdinov and G. Hinton, “Learning a nonlinear em-
[83] R.-S. Lin, D. A. Ross, and J. Yagnik, “Spec hashing: Similar- bedding by preserving class neighbourhood structure,” in Proc.
ity preserving algorithm for entropy-based coding,” in IEEE International Conference on Artificial Intelligence and Statis-
Computer Society Conference on Computer Vision and Pat- tics, 2007, pp. 412–419.
tern Recognition, San Francisco, USA, 2010, pp. 848–854. [106] V. E. Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou, “Deep
[84] Z. Jin, Y. Hu, Y. Lin, D. Zhang, S. Lin, D. Cai, and X. Li, hashing for compact binary codes learning,” in Proc. IEEE
“Complementary projection hashing,” in Proceedings of the Conference on Computer Vision and Pattern Recognition,
IEEE International Conference on Computer Vision, 2013. 2015.
[85] G. Lin, C. Shen, A. V. den Hengel, and D. Suter, “A general [107] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan, “Supervised hashing
two-step approach to learning-based hashing,” in Proceedings for image retrieval via image representation learning,” in Proc.
of the IEEE International Conference on Computer Vision, AAAI Conference on Artificial Intelligence, 2014, pp. 2156–
2013. 2162.
[86] D. Zhang, J. Wang, D. Cai, and J. Lu, “Self-taught hashing [108] F. Zhao, Y. Huang, L. Wang, and T. Tan, “Deep semantic
for fast similarity search,” in Proceedings of the 33rd Interna- ranking based hashing for multi-label image retrieval,” in Proc.
tional ACM SIGIR Conference on Research and Development IEEE Conference on Computer Vision and Pattern Recogni-
in Information Retrieval, 2010, pp. 18–25. tion, 2015.
[87] Y. Freund and R. Schapire, “A desicion-theoretic generaliza- [109] H. Lai, Y. Pan, Y. Liu, and S. Yan, “Simultaneous feature
tion of on-line learning and an application to boosting,” in learning and hash coding with deep neural networks,” in Proc.
Computational Learning Theory, 1995, pp. 23–37. IEEE Conference on Computer Vision and Pattern Recogni-
[88] V. Athitsos, J. Alon, S. Sclaroff, and G. Kollios, “Boostmap: tion, 2015.
An embedding method for efficient nearest neighbor retrieval,” [110] K. Gregor and Y. LeCun, “Learning fast approximations of
IEEE Transactions on Pattern Analysis and Machine Intelli- sparse coding,” in Proc. International Conference on Machine
gence, vol. 30, no. 1, pp. 89–104, Jan. 2008. Learning, 2010, pp. 399–406.
[89] Y.-G. Jiang, J. Wang, X. Xue, and S.-F. Chang, “Query- [111] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative
adaptive image search with hash codes,” IEEE Transactions quantization: A procrustean approach to learning binary codes
on Multimedia, vol. 15, no. 2, pp. 442–453, 2013. for large-scale image retrieval,” IEEE Transactions on Pattern
[90] Y.-G. Jiang, J. Wang, and S.-F. Chang, “Lost in binarization: Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2916–
Query-adaptive ranking for similar image search with compact 2929, 2013.
codes,” in Proceedings of ACM International Conference on [112] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet clas-
Multimedia Retrieval, 2011. sification with deep convolutional neural networks,” in Proc.
[91] Q. Wang, D. Zhang, and L. Si, “Weighted hashing for fast Advances in Neural Information Processing Systems, vol. 25,
large scale similarity search,” in Proceedings of the 22nd ACM 2012, pp. 1106–1114.
Conference on information and knowledge management, 2013, [113] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and
pp. 1185–1188. Y. Chen, “Compressing neural networks with the hashing
[92] L. Zhang, Y. Zhang, X. Gu, and Q. Tian, “Binary code rank- trick,” in Proc. International Conference on Machine Learn-
ing with weighted hamming distance,” in Proceedings of the ing, 2015, pp. 2285–2294.
IEEE International Conference on Computer Vision and Pat- [114] S. Vijayanarasimhan, P. Jain, and K. Grauman, “Hashing hy-
tern Recognition, Portland, Oregon, USA, 2013, pp. 1586–159. perplane queries to near points with applications to large-scale
[93] X. Liu, J. He, B. Lang, and S.-F. Chang, “Hash bit selection: active learning,” IEEE Transactions on Pattern Analysis and
a unified solution for selection problems in hashing,” in Pro- Machine Intelligence, 2013.
ceedings of the IEEE International Conference on Computer [115] S. Tong and D. Koller, “Support vector machine active learning
Vision and Pattern Recognition, Portland, Oregon, USA, 2013, with applications to text classification,” Journal of Machine
pp. 1570–1577. Learning Research, vol. 2, pp. 45–66, 2001.
[94] X. Zhang, L. Zhang, and H.-Y. Shum, “Qsrank: Query- [116] P. Jain, S. Vijayanarasimhan, and K. Grauman, “Hashing hy-
sensitive hash code ranking for efficient ǫ-neighbor search,” in perplane queries to near points with applications to large-scale
22 PROCEEDINGS OF THE IEEE

active learning,” in Advances in Neural Information Process- [137] J. Yuan, G. Gravier, S. Campion, X. Liu, and H. Jgou, “Effi-
ing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, cient mining of repetitions in large-scale tv streams with prod-
R. Zemel, and A. Culotta, Eds. MIT Press, 2010, pp. 928–936. uct quantization hashing,” in European Conference on Com-
[117] W. Liu, J. Wang, Y. Mu, S. Kumar, and S.-F. Chang, “Com- puter Vision, Florence, Italy, 2012, pp. 271–280.
pact hyperplane hashing with bilinear functions,” in Proceed- [138] G. S. Manku, A. Jain, and A. Das Sarma, “Detecting near-
ings of the 29th international conference on Machine learning, duplicates for web crawling,” in Proceedings of the 16th inter-
2012. national conference on World Wide Web, 2007, pp. 141–150.
[118] Y. Gong, S. Kumar, H. Rowley, and S. Lazebnik, “Learning [139] M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Para-
binary codes for high-dimensional data using bilinear projec- gios, “Data fusion through cross-modality metric learning us-
tions,” in Proceedings of the IEEE International Conference on ing similarity-sensitive hashing,” in IEEE Computer Society
Computer Vision and Pattern Recognition, Portland, Oregon, Conference on Computer Vision and Pattern Recognition, San
USA, 2013, pp. 484–491. Francisco, USA, 2010, pp. 3594–3601.
[119] R. Basri, T. Hassner, and L. Zelnik-Manor, “Approximate [140] Y. Mu, J. Wright, and S.-F. Chang, “Accelerated large scale
nearest subspace search,” IEEE Transactions on Pattern Anal- optimization by concomitant hashing,” in European Confer-
ysis and Machine Intelligence, vol. 33, no. 2, pp. 266–278, 2011. ence on Computer Vision, Florence, Italy, 2012, pp. 414–427.
[120] X. Wang, S. Atev, J. Wright, and G. Lerman, “Fast subspace [141] P. Li, A. Shrivastava, J. L. Moore, and A. C. Knig, “Hashing
search via grassmannian based hashing,” in Proceedings of the algorithms for large-scale learning,” in Advances in Neural In-
IEEE International Conference on Computer Vision, 2013. formation Processing Systems 24, J. Shawe-Taylor, R. Zemel,
[121] S. Kim, Y. Kang, and S. Choi, “Sequential spectral learning to P. Bartlett, F. Pereira, and K. Weinberger, Eds., 2011, pp.
hash with multiple representations,” in European Conference 2672–2680.
on Computer Vision, Florence, Italy, 2012, pp. 538–551. [142] K. Zhou and H. Zha, “Learning binary codes for collaborative
[122] D. Zhang, F. Wang, and L. Si, “Composite hashing with mul- filtering,” in Proceedings of the 18th ACM SIGKDD Interna-
tiple information sources,” in Proceedings of the 34th interna- tional Conference on Knowledge Discovery and Data Mining,
tional ACM SIGIR conference on Research and development 2012, pp. 498–506.
in Information Retrieval, 2011, pp. 225–234. [143] M. Ou, P. Cui, F. Wang, J. Wang, W. Zhu, and S. Yang,
[123] Y. Zhen and D.-Y. Yeung, “Co-regularized hashing for multi- “Comparing apples to oranges: A scalable solution with hetero-
modal data,” in Advances in Neural Information Processing geneous hashing,” in Proceedings of the 19th ACM SIGKDD
Systems 25, P. Bartlett, F. Pereira, C. Burges, L. Bottou, and International Conference on Knowledge Discovery and Data
K. Weinberger, Eds. MIT Press, 2012, pp. 1385–1393. Mining, 2013, pp. 230–238.
[124] M. Rastegari, J. Choi, S. Fakhraei, D. Hal, and L. Davis, “Pre- [144] G. Ye, D. Liu, J. Wang, and S.-F. Chang, “Large-scale video
dictable dual-view hashing,” in Proceedings of the 30th Inter- hashing via structure learning,” in Proceedings of the IEEE
national Conference on Machine Learning, 2013, pp. 1328– International Conference on Computer Vision, 2013.
1336. [145] Q. Wang, D. Zhang, and L. Si, “Semantic hashing using tags
[125] Y. Zhen and D.-Y. Yeung, “A probabilistic model for mul- and topic modeling,” in Proceedings of the 36th International
timodal hash function learning,” in Proceedings of the 18th ACM SIGIR Conference on Research and Development in In-
ACM SIGKDD International Conference on Knowledge Dis- formation Retrieval, 2013, pp. 213–222.
covery and Data Mining, 2012, pp. 940–948. [146] H. Li, W. Liu, and H. Ji, “Two-stage hashing for fast document
[126] J. Song, Y. Yang, Z. Huang, H. T. Shen, and R. Hong, “Mul- retrieval,” in Proceedings of the Annual Meeting of the Asso-
tiple feature hashing for real-time large scale near-duplicate ciation for Computational Linguistics, Baltimore, Maryland,
video retrieval,” in Proceedings of the 19th ACM international USA, 2014.
conference on Multimedia, 2011, pp. 423–432. [147] B. Qian, X. Wang, J. Wang, WeifengZhi, H. Li, and I. David-
[127] L. Cao, Z. Li, Y. Mu, and S.-F. Chang, “Submodular video son, “Fast pairwise query selection for large-scale active learn-
hashing: a unified framework towards video pooling and index- ing to rank,” in Proceedings of the IEEE International Con-
ing,” in Proceedings of the 20th ACM international conference ference on Data Mining, 2013.
on Multimedia, 2012, pp. 299–308. [148] Y. Mu, G. Hua, W. Fan, and S.-F. Chang, “Hash-svm: Scal-
[128] M. Ou, P. Cui, J. Wang, F. Wang, and W. Zhu, “Probabilistic able kernel machines for large-scale visual classification,” in
attributed hashing,” in Twenty-Ninth AAAI Conference on Proceedings of the IEEE International Conference on Com-
Artificial Intelligence, 2015, pp. 2894–2900. puter Vision and Pattern Recognition, Columbus, Ohio, USA,
[129] K. Grauman and R. Fergus, “Learning binary hash codes for 2014, pp. 446–451.
large-scale image search,” in Machine Learning for Computer [149] Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, and
Vision. Springer, 2013, pp. 49–87. S. Vishwanathan, “Hash kernels for structured data,” The
[130] K. Grauman, “Efficiently searching for similar images,” Com- Journal of Machine Learning Research, vol. 10, pp. 2615–2637,
mun. ACM, vol. 53, no. 6, 2010. 2009.
[131] W. Kong, W.-J. Li, and M. Guo, “Manhattan hashing for large-
scale image retrieval,” in Proceedings of the 35th international
ACM SIGIR conference on Research and development in in-
formation retrieval, 2012, pp. 45–54.
[132] J. He, J. Feng, X. Liu, T. Cheng, T.-H. Lin, H. Chung, and
S.-F. Chang, “Mobile product search with bag of hash bits
and boundary reranking,” in IEEE Conference on Computer
Vision and Pattern Recognition, Providence, RI, USA, 2012,
pp. 3005–3012.
[133] S. Korman and S. Avidan, “Coherency sensitive hashing,” in
Proceedings of the IEEE International Conference on Com-
puter Vision, 2011, pp. 1607–1614.
[134] Q. Shi, H. Li, and C. Shen, “Rapid face recognition using hash-
ing,” in IEEE Computer Society Conference on Computer Vi-
sion and Pattern Recognition, San Francisco, USA, 2010, pp.
2753–2760.
[135] Q. Dai, J. Li, J. Wang, Y. Chen, and Y.-G. Jiang, “Optimal
bayesian hashing for efficient face recognition,” in Proceedings
of the Twenty-Fourth International Joint Conference on Arti-
ficial Intellige, 2015, pp. 3430–3437.
[136] X. Li, C. Shen, A. Dick, and A. van den Hengel, “Learning
compact binary codes for visual tracking,” in Proceedings of
the IEEE International Conference on Computer Vision and
Pattern Recognition, Portland, Oregon, USA, 2013, pp. 2419–
2426.

You might also like