Online Hashing: Long-Kai Huang, Qiang Yang, and Wei-Shi Zheng

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Online Hashing
Long-Kai Huang, Qiang Yang, and Wei-Shi Zheng

Abstract— Although hash function learning algorithms have information of data distribution or class labels would make
achieved great success in recent years, most existing hash models significant improvement in fast search, more efforts are
are off-line, which are not suitable for processing sequen- devoted to the data-dependent approach [10]–[16]. For the
tial or online data. To address this problem, this paper proposes
an online hash model to accommodate data coming in stream data-dependent hashing methods, they are categorized into
for online learning. Specifically, a new loss function is proposed unsupervised-based [17]–[20], supervised-based [21]–[26],
to measure the similarity loss between a pair of data samples and semisupervised-based [27]–[29] hash models. In addition
in hamming space. Then, a structured hash model is derived to these works, multiview hashing [30], [31], multimodal
and optimized in a passive-aggressive way. Theoretical analysis hashing [32]–[36], and active hashing [37], [38] have also been
on the upper bound of the cumulative loss for the proposed
online hash model is provided. Furthermore, we extend our online developed.
hashing (OH) from a single model to a multimodel OH that trains In the development of hash models, a challenge remained
multiple models so as to retain diverse OH models in order to unsolved is that most hash models are learned in an off-line
avoid biased update. The competitive efficiency and effectiveness mode or batch mode, that is to assume all data are available in
of the proposed online hash models are verified through extensive advance for learning the hash function. However, learning hash
experiments on several large-scale data sets as compared with
related hashing methods. functions with such an assumption has the following critical
limitations.
Index Terms— Hashing, online hashing (OH). 1) First, they are hard to be trained on very large-scale
training data sets, since they have to make all learning
I. I NTRODUCTION
data kept in the memory, which is costly for processing.

T HERE have been great interests in representing data using


compact binary codes in recent developments. Compact
binary codes not only facilitate storage of large-scale data but
Even though the memory is enough, the training time
of these methods on large-scale data sets is intolerable.
With the advent of big data, these limitations become
also benefit fast similarity computation, so that they are applied more and more urgent to solve.
to fast nearest neighbor search [1]–[4], as it only takes very 2) Second, they cannot adapt to sequential data or new
short time (generally less than a second) to compare a query coming data. In real life, data samples are usually
with millions of data points [5]. collected sequentially as time passes, and some early
For learning compact binary codes, a number of hash collected data may be outdated. When the differences
function learning algorithms have been developed in the last between the already collected data and the new coming
five years. There are two types of hashing methods: the data are large, current hashing methods usually lose
data-independent ones and the data-dependent ones. Typical their efficiency on the new data samples. Hence, it is
data-independent hash models include locality sensitive hash- important to develop online hash models, which can be
ing (LSH) [6] and its variants like  p -stable hashing [7], efficiently updated to deal with sequential data.
min-hash [8], and kernel LSH (KLSH) [9]. Since using In this paper, we overcome the limitations of batch mode
Manuscript received July 14, 2016; revised November 10, 2016 and hash methods [17]–[19], [21], [23]–[26] by developing an
March 1, 2017; accepted March 24, 2017. This work was supported effective online hashing learning method called online hash-
in part by the National Natural Science Foundation of China under ing (OH). We propose a one-pass online adaptation criterion
Grant 61522115, Grant 61472456, and Grant 61661130157, in part by the
Guangdong Natural Science Funds for Distinguished Young Scholar under in a passive-aggressive way [39], which enables the newly
Grant S2013050014265, in part by the Guangdong Program for Support of learned hash model to embrace information from a new pair
Top-notch Young Professionals under Grant 2014T Q01X779, and in part by of data samples in the current round and meanwhile retain
the RS-Newton Advanced Fellowship under Grant NA150459. (Correspond-
ing Author: Wei-Shi Zheng.) important information learned in the previous rounds. In addi-
L.-K. Huang is with the School of Data and Computer Science, Sun Yat-sen tion, the exact labels of data are not required, and only the
University, Guangzhou 510006, China (e-mail: hlongkai@gmail.com). pairwise similarity is needed. More specifically, a similarity
Q. Yang is with the School of Data and Computer Science, Sun Yat-sen
University, Guangzhou 510006, China, and also with the Collaborative Innova- loss function is first designed to measure the confidence of the
tion Center of High Performance Computing, National University of Defense similarity between two hash codes of a pair of data samples,
Technology, Changsha 410073, China (e-mail: mmmyqmmm@gmail.com). and then based on that similarity loss function, a prediction
W.-S. Zheng is with the School of Data and Computer Science, Sun Yat-sen
University, Guangzhou 510006, China, and also with the Key Laboratory of loss function is proposed to evaluate whether the current
Machine Intelligence and Advanced Computing, Sun Yat-sen University, Min- hash model fits the current data under a structured prediction
istry of Education, Guangzhou 510006, China (e-mail: wszheng@ieee.org). framework. We then minimize the proposed prediction loss
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. function on the current input pair of data samples to update
Digital Object Identifier 10.1109/TNNLS.2017.2689242 the hash model. During the online update, we wish to make
2162-237X © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

the updated hash model approximate the model learned in the existing works considering active learning and online learning
last round as much as possible for retaining the most historical together [44]–[46], but they are not for hash function learning.
discriminant information during the update. An upper bound Although it is challenging to design hash models in an
on the cumulative similarity loss of the proposed online online learning mode, several hashing methods are related to
algorithm is derived, so that the performance of our online online learning [22], [41], [47]–[50]. Jain et al. [41] realized an
hash function learning can be guaranteed. online LSH by applying an online metric learning algorithm,
Since one-pass online learning only relies on the new data namely, LogDet Exact Gradient Online (LEGO) to LSH.
at the current round, the adaptation could be easily biased Since [41] is operated on LSH, which is a data-independent
by the current round data. Hence, we introduce a multimodel hash model, it does not directly optimize hash functions for
online strategy in order to alleviate such a kind of bias, where generating compact binary code in an online way. The other
multiple but not just one OH models are learned and they are five related works can be categorized into two groups: one is
expected to suit more diverse data pairs and will be selectively the stochastic gradient descent (SGD)-based online methods,
updated. A theoretical bound on the cumulative similarity loss including minimal loss hashing (MLH) [22], online supervised
is also provided. hashing [49], and Adaptive Hashing (AdaptHash) [50]; another
In summary, the contributions of this paper are as follows. group is matrix sketch-based methods, including online
1) Developing a weakly supervised online hash function sketching hashing (OSH) [47] and stream spectral binary
learning model. In our development, a novel similarity coding (SSBC) [48].
loss function is proposed to measure the difference of the MLH [22] follows the loss-adjusted inference used in struc-
hash codes of a pair of data samples in Hamming space. tured SVMs and deduces the convex–concave upper bound on
Following the similarity loss function is the prediction the loss. Since MLH is a hash model relying on SGD update
loss function to penalize the violation of the given for optimization, it can naturally be used for online update.
similarity between the hash codes in Hamming space. However, there are several limitations that make MLH unsuit-
Detailed theoretical analysis is presented to give a the- able for online processing. First, the upper loss bound derived
oretical upper loss bound for the proposed OH method. by MLH is actually related to the number of the historical
2) Developing a multimodel OH (MMOH), in which a samples used from the beginning. In other words, the upper
multimodel similarity loss function is proposed to guide loss bound of MLH may grow as the number of samples
the training of multiple complementary hash models. increases and, therefore, its online performance could not be
The rest of this paper is organized as follows. In Section II, guaranteed. Second, MLH assumes that all input data are
related studies are reviewed. In Section III, we present centered (i.e. with zero mean), but such a preprocessing is
our online algorithm framework, including the optimization challenging for online learning, since all data samples are not
method. Section IV further elaborates one left issue in available in advance.
Section III for acquiring zero-loss binary codes. Then, we give Online supervised hashing [49] is an SGD version of
analysis on the upper bound and time complexity of OH in the supervised hashing with error correcting codes (ECCs)
Section V, and extend our algorithm to a multimodel one in algorithm [51]. It employs a 0-1 loss function which outputs
Section VI. Experimental results for evaluation are reported either 1 or 0 to indicate whether the binary code generated
in Section VII and finally we conclude this paper in by existing hash model is in the codebook generated by error
Section VIII. correcting output codes algorithm. If it is not in the codebook,
the loss is 1. After replacing the 0-1 loss with a convex loss
II. R ELATED W ORK
function and dropping the nondifferentiable sign function in
Online learning, especially one-pass online learning, plays the hash function, SGD is applied to minimize the loss and
an important role for processing large-scale data sets, as it update the hash model online. AdaptHash [50] is also an
is time and space efficient. It is able to learn a model SGD-based methods. It defines a loss function the same as the
based on streaming data, making dynamic update possible. hingle-like loss function used in [22] and [52]. To minimize
In typical one-pass online learning algorithms [39], [40], when this loss, the authors approximated the hash function by a
an instance is received, the algorithm makes a prediction, differentiable sigmoid function and then used SGD to optimize
receives the feedback, and then updates the model based the problem in an online mode. Both Online supervised
on this new data sample only upon the feedback. Generally, hashing with Error Corrected Code (OECC) and AdaptHash
the performance of an online algorithm is guaranteed by the do not assume that data samples have zero mean as used
upper loss bound in the worst case. in MLH. They handle the zero-mean issue by a method similar
There are a lot of existing works of online algorithms to to the one in [52]. All these three SGD-based hashing methods
solve specific machine learning problems [39]–[43]. However, enable online update by applying SGD, but they all cannot
it is difficult to apply these online methods to online hash guarantee a constant loss upper bound.
function learning, because the sign function used in hash OSH [47] was recently proposed to enable learning a hash
models is nondifferentiable, which makes the optimization model on stream data by combining PCA hashing [27] and
problem more difficult to solve. Although one can replace the matrix sketching [53]. It first sketches stream samples into
sign function with sigmoid type functions or other approximate a small size matrix and meanwhile guarantees approximating
differentiable functions and then apply gradient descent, this data covariance, and then PCA hashing can be applied on this
becomes an obstacle on deriving the loss bound. There are sketch matrix to learn a hash model. Sketching overcomes
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HUANG et al.: OH 3

the challenge of training a PCA-based model on sequential


data using limited storage. SSBC [48] is another learning to
hash method on stream data based on the matrix sketch [53]
skill. It applies matrix sketch processing on the Gaussian
affinity matrix in the spectral hashing algorithm [17]. Since
the sketched matrix reserves global information of previously Fig. 1. Similarity loss function. The top is for a pair of similar data samples,
observed data, the new update model may not be adapted well and the bottom is for a pair of dissimilar ones.
on new observed data samples after a large number of samples
have been sketched in the previous steps. In the structured hash function (2), f T WT x can be regarded as a
A preliminary version of this paper was presented in [52]. prediction value that measures the extent that the binary code f
Apart from more detailed analysis, this submission differs matches the hash code of x obtained through (1). Obviously,
from the conference version in the following aspects: 1) we f T W T x is maximized only when each element of f has the
have further modified the similarity loss function to evaluate same sign as that of WT x.
the hash model and to guide the update of the hash model;
2) we have developed a multimodel strategy to train multiple A. Formulation
models to improve OH; 3) we have modified the strategy of We start presenting the online hash function learning. In this
zero-loss codes inference, which suits the passive-aggressive paper, we assume that the sequential data are in pairs. Suppose
scheme better; and 4) much more experiments have been a new pair of data points xi t and xj t comes in the t th
conducted to demonstrate the effectiveness of our method. (t = 1, 2, . . .) round with a similarity label s t . The label
indicates whether these two points are similar or not and it
III. O NLINE M ODELING FOR H ASH F UNCTION L EARNING is defined as

Hash models aim to learn a model to map a given data t 1, if xi t and xj t are similar
vector x ∈ Rd to an r -bit binary code vector h ∈ {−1, 1}r . s =
−1, if xi t and xj t are not similar.
For r -bit linear projection hashing, the kth (1 ≤ k ≤ r ) hash
function is defined as We denote the new coming pair of data points xi t and xj t by
 t= [xi t , xj t ]. In the t th round, based on the hash projection
  1, if wkT x + bk ≥ 0
h k (x) = sgn wkT x + bk = matrix Wt learned in the last round, we can compute the hash
−1, otherwise codes of xi t and xj t denoted by hi t , hj t , respectively. Such a
where wk ∈ Rd is a projection vector, bk is a scalar threshold, pair of hash codes is denoted by t = [hi t , hj t ].
and h k ∈ {−1, 1} is the binary hash code. However, the prediction does not always match the given
Regarding bk in the above equation, a useful guideline similarity label information of t . When a mismatch case
proposed in [17], [27], and [54] is that the scalar threshold is observed, Wt , the model learned in the (t − 1)th round,
bk should be the median of {wkT xi }ni=1 , where n is the number needs to be updated for obtaining a better hash model Wt+1 .
of the whole input samples, in order to guarantee that half In this paper, the Hamming distance is employed to measure
of the output codes are 1 while the other half are −1. This the match or mismatch cases. In order to learn an optimal
guarantees achieving maximal information entropy of every hash model that minimizes the loss caused by mismatch cases
hash bit [17], [27], [54]. A relaxation is to set bk to the mean and maximizes the confidence of match cases, we first design
of {wkT xi }ni=1 and bk will be zero if the data are zero centered. a similarity loss function to quantify the difference between
Such a pretreatment not only helps to improve performance the pairwise hash codes t with respect to the corresponding
but also simplifies the hash function learning. Since data come similarity label s t , which is formulated as follows:
    
in sequence, it is impossible to obtain the mean in advance. max 0, Dh hi t , hj t − α , if s t = 1
In our online algorithm, we will estimate the mean after R( , s ) =
t t   t t  (3)
max 0, βr − Dh hi , hj , if s t = −1
a few data samples are available, update it after new data
samples arrive, and perform update of zero-centered operation where Dh (hi t , hj t ) is the Hamming distance between hi t and
afterward. Hence, the r -bit hash function becomes hj t , α is an integer playing as the similarity threshold, and β is
the dissimilarity ratio threshold ranging from 0 to 1. Generally,
h = h(x) = sgn(W T x) (1)
βr > α , so that there is a certain margin between the match
where W = [w1 , w2 , . . . , wr ] ∈ Rd×r is the hash projection and mismatch cases.
matrix and h = [h 1 (x), h 2 (x), . . . , h r (x)]T is the r-bit hash code. In the above loss function, a relaxation is actually introduced
Here, x is the data point after zero-mean shifting. by employing the threshold parameters α and βr , as shown
Unfortunately, due to the nondifferentiability of the sign in Fig. 1. α should not be too large, so that the similarity of
function in (1), the optimization of the hash model becomes the pairwise data samples can be preserved in the Hamming
difficult. To settle such a problem, by borrowing the ideas space. In contrast, βr should not be too small; otherwise two
from the structured prediction in structured SVMs [55], [56] dissimilar data points cannot be well separated in Hamming
and MLH [22], we transform (1) equivalently to the following space. From Fig. 1, we can see that the mismatch can be one
structured prediction form: of the following two cases: 1) the Hamming distance between
the prediction codes hi t and hj t is larger than α for a similar
h = arg max f T W T x. (2)
f∈{−1,1}r pair and 2) the Hamming distance between the prediction
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

codes hi t and hj t is smaller than βr for a dissimilar pair. Since we are formulating a one-pass learning algorithm,
These two measure the risk of utilizing the already learned the previously observed data points are not available for the
hash projection matrix Wt on a new pair of data points, learning in the current round, and the only information we
i.e., R(t , s t ). can make use of is the current round projection matrix Wt .
If the model learned in the last round predicts zero-loss hash In this case, we force that the newly learned Wt+1 should stay
code pair on a new pair, that is, R(t , s t ) = 0, our strategy close to the projection matrix Wt as much as possible so as to
is to retain the current model. If the model is unsatisfactory, preserve the information learned in the last round as much as
i.e., predicting inaccurate hash codes having R(t , s t ) > 0, possible. Hence, the objective function for updating the hash
we need to update the inaccurate previous hash projection projection matrix becomes
matrix Wt . 1
To update the hash model properly, we claim that the zero- Wt+1 = arg min ||W − Wt ||2F + Cξ
W 2
loss hash code pair t = [gi t , gj t ] for current round data pair t s.t.  (W) ≤ ξ and ξ ≥ 0
t
(7)
satisfying R(t , s t ) = 0 is available, and we use the zero-loss
hash code pair to guide update of the hash model, deriving the where · F is the Frobenius norm, ξ is a nonnegative auxiliary
updated Wt+1 toward a better prediction. We leave the details variable to relax the constraint on the prediction loss function
about how to obtain the zero-loss hash code pair presented t (W) = 0, and C is a margin parameter to control the
in Section IV. effect of the slack term, whose influence will be observed
Now, we wish to obtain an updated hash projection matrix in Section VII. Through this objective function, the difference
Wt+1 such that it predicts similar hash code pair toward the between the new projection matrix Wt+1 and the last one Wt is
zero-loss hash code pair t for the current input pair of data minimized, and meanwhile the prediction loss function t (W)
samples. Let us define of the new Wt+1 is bounded by a small value. We call the
T T above-mentioned model the OH model.
H t (W) = hi t W T xi t + hj t W T xj t (4)
Finally, we wish to provide a comment on the function
tT tT
G (W) = gi
t
W xi + gj
T t
W xj .
T t
(5) H t (W) in (4) and (6). Actually, an optimal case should be
to refine function H t (W) as a function of variables W and a
Given hash function (2) with respect to Wt , we have
code pair  = [fi , fj ] ∈ {−1, 1}r×2 as follows:
H t (Wt ) ≥ G t (Wt ), since hi t and hj t are the binary solutions
for xi and xj t for the maximization, respectively, while gi t
t
H t (W,  ) = fit W T xi t + fjt W T xj t
T T
(8)
and gj t are not. This also suggests that Wt is not suitable for
the generated binary code to approach the zero-loss hash code and then to refine the prediction loss function when an optimal
pair t , and thus a new projection Wt+1 has to be learned. update Wt+1 is used
When updating the projection matrix from Wt to Wt+1 , 
we expect that the binary code generated for xi t is gi t . Accord- t (Wt+1 ) = max H t (Wt+1 ,  ) − G t (Wt+1 ) + R(t , s t ).
∈{−1,1}r×2
ing to the hash function in structured prediction form in (2), (9)
T T
our expectation is to require gi t T Wt+1 xi t > hi t T Wt+1 xi t .
Similarly, we expect the binary code generated for xj t is gj t The above refinement in theory can make max H t (Wt+1 ,  ) −
T T
and this is also to require gj t T Wt+1 xj t > hj t T Wt+1 xj t . Com- G t (Wt+1 ) be a more rigorous loss on approximating the zero-
bining these two inequalities together, it would be expected loss hash code pair t . But, it would be an obstacle to the
that the new Wt+1 should meet the condition that G t (Wt+1 ) > optimization, since Wt+1 is unknown when max H t (Wt+1 ,  )
H t (Wt+1 ). To achieve this objective, we derive the following is computed. Hence, we avert this problem by implicitly intro-
prediction loss function t (W) for our algorithm: ducing an alternating optimization by first fixing  to be t ,
 then optimizing Wt+1 by Criterion (7), and finally predicting
t (W) = H t (W) − G t (W) + R(t , s t ). (6) the best  for max ∈{−1,1}r×2 H t (Wt+1 ,  ). This process can be
iterative. Although this may be useful to further improve
In the above loss function, t , t , and R(t , s t ) are constants
our online model, we do not completely follow this implicit
rather than variables dependent on Wt . R(t , s t ) can be treated
alternating processing to learn Wt+1 iteratively. This is because
a loss penalization. When used in Criterion (7) later, a small
data are coming in sequence and it would be demanded to
R(t , s t ) means a slight update is expected, and a large
process a new data pair after an update of the projection
R(t , s t ) means a large update is necessary. Note that the
matrix W. Hence, in our implementation, we only update Wt+1
square root of similarity loss function R(t , s t ) is utilized here,
once, and we provide the bound for R(t , s t ) under such a
because it enables an upper bound on the cumulative loss
processing in Theorem 2.
functions, which will be shown in Section V-A.
Note that if t (Wt+1 ) = 0, we can have G t (Wt+1 ) =
√ B. Optimization
H t (Wt+1 ) + R(t , s t ) > H t (Wt+1 ). Let 
t be the hash codes
of t computed using the updated Wt+1 by (2). Even though When R(t , s t ) = 0, t is the optimal code pair and t is the
G t (Wt+1 ) > H t (Wt+1 ) cannot guarantee that  t is exactly t , same as t , and thus t (Wt ) = 0. In this case, the solution to
it is probable that  t is very close to t rather than t . Criterion (7) is Wt+1 = Wt . That is, when the already learned
It, therefore, makes sense to force t (Wt+1 ) to be zero or close hash projection matrix Wt can correctly predict the similarity
to zero. label of the new coming pair of data points t , there is no
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HUANG et al.: OH 5

need to update the hash function. When R(t , s t ) > 0, the Algorithm 1 OH
solution is INITIALIZE W1
Wt+1 = Wt + τ t t
(t − t )T for t = 1, 2, . . . do
 Receive a pairwise instance t and similarity label s t ;
t (Wt )
t
τ = min C, . (10) Compute the hash code pair t of t by Eq. (1);
|| t (t − t )T ||2F
Compute the similarity loss R(t , s t ) by Eq. (3);
The procedure for deriving the solution formulation (10) if R(t , s t ) > 0 then
when R(t , s t ) > 0 is detailed as follows. Get the zero-loss code pair t that makes R(t , s t ) = 0;
First, the objective function [see Criterion (7)] can be Compute the prediction loss t (Wt ) by Eq. (6);
Set τ t = min{C, ||t ( t(W )
t t
rewritten in the following when we introduce the Lagrange −t )T ||2F
};
multipliers Update Wt+1 = Wt + τ t t (t − t )T ;
L(W, τ t , ξ, λ) =
||W−Wt ||2F
+ Cξ + τ t (t (W) − ξ ) − λξ (11) else
2
Wt+1 = Wt ;
where τt
≥ 0 and λ ≥ 0 are the Lagrange multipliers. Then, end if
by computing ∂L/∂W = 0, ∂L/∂ξ = 0, and ∂L/∂τ t = 0, we can end for
have
∂L
0 =
∂W   T  For our online hash model learning, we assume that at
⇒ W = Wt + τ t xi t gi t − hi t + xj t (gj t − hj t )T least m data points have been provided in the initial stage;
= Wt + τ t t (t − t )T (12) otherwise, the online learning will not start until at least m data
∂L
0 = = C − τt − λ points have been collected, and then these m data points are
∂ξ
considered as the m anchors used in the kernel trick. Regarding
⇒ τ t = C − λ. (13)
the kernel used in this paper, we employ the Gaussian RBF
Since λ > 0, we have τ t < C . By putting (12) and (13) back kernel κ (x, y) = ex p(−||x − y||2 /2σ 2 ), where we set σ to 1 in
into (6), we obtain our algorithm.
t (W) = −τ t || t (t − t )T ||2F + t (Wt ). (14)
IV. Z ERO -L OSS B INARY C ODE PAIR I NFERENCE
Also, by putting (12)–(14) back into (11), we have In Section III-A, we have mentioned that our OH algorithm
1 2
L(τ ) = − τ t || t (t − t )T ||2F + τ t t (Wt ).
t relies on the zero-loss code pair t = [gi t , gj t ] which satisfies
2 R(t , s t ) = 0. Now, we detail how to acquire t .
By taking the derivative of L with respect to τ t and setting it
to zero, we get A. Dissimilar Case
∂L
0 = = −τ t || t (t − t )T ||2F + t (Wt ) We first present the case for dissimilar pairs. As mentioned
∂τ t in Section III-A, to achieve zero similarity loss, the Hamming
t (Wt )
⇒ τt = . (15) distance between the hash codes of nonneighbors should not
|| (t − t )T ||2F
t
be smaller than βr . Therefore, we need to seek t such
Since τ t < C , we can obtain that Dh (gi t , gj t ) ≥ βr . Denote the k th bit of hi t by hi t[k] ,
 and similarly we have hj t[k] , gi t[k] , andgj t[k] . Then, Dh (hi t , hj t ) =
t (Wt )
r
k=1 Dh (hi [k] , hj [k] ), where
t t t
τ = min C, . (16)
|| (t − t )T ||2F
t

  0, if hi t[k] = hj t[k]
In summary, the solution to the optimization problem in Dh hi t[k] , hj t[k] =
Criterion (7) is (10), and the whole procedure of the proposed 1, if hi t[k] = hj t[k] .
OH is presented in Algorithm 1. Let K1 = {k|Dh (hi t[k] , hj t[k] ) = 1} and K0 = {k|Dh (hi t[k] , hj t[k] ) =
C. Kernelization 0}. To obtain t , we first set gi t[k] = hi t[k] and gj t[k] = hj t[k] for
k ∈ K1 , so as to retain the Hamming distance obtained
Kernel trick is well known to make machine learning models
through the hash model learned in the last round. Next,
better adapted to nonlinearly separable data [21]. In this
in order to increase the Hamming distance, we need to make
context, a kernel-based OH is generated by employing explicit
Dh (gi t[k] , gj t[k] ) = 1 for k ∈ K0 . That is, we need to set1 either
kernel mapping to cope with the nonlinear modeling. In detail,
gi t[k] = −hi t[k] or gj t[k] = −hj t[k] , for all k ∈ K0 . Hence, we can
we aim at mapping data in the original space Rd into a feature
space Rm through a kernel function based on m (m < d) anchor pick up p bits, whose indices are in set K0 to change/update
points, and, therefore, we have a new representation of x which such that
can be formulated as follows:  
Dh (gi t , gj t ) = Dh hi t , hj t + p. (17)
T
z(x) = [κ(x, x(1) ), κ(x, x(2) ), . . . , κ(x, x(m) )] 1 The hash code in our algorithm is −1 and 1. Note that, g t = −h t
i [k] i [k]
where x(1) , x(2) , . . . , x(m) are m anchors. means set gi t[k] to be different from hi t[k]
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Now the problem is how to set p, namely, the number Algorithm 2 Inference of t for a Dissimilar Pair
of hash bits to update. We first investigate the relationship Calculate the Hamming distance Dh (hi t , hj t ) between hi t and
between the update of projection vectors and t . Note that W hj t ;
consists of r projection vectors wk (k = 1, 2, . . . , r ). From (12), Calculate p0 = βr
− Dh (hi t , hj t );
we can deduce that Compute δk for k ∈ K0 by Eq. (19);
     Sort δk ;
wkt+1 = wkt + τ t xi t gi t[k] − hi t[k] + xj t gj t[k] − hj t[k] . (18)
Set the corresponding hash bits of the p0 smallest δk opposite
It can be found that wkt+1 = wkt , when gi t[k] = hi t[k] and to the corresponding ones in t by following the rule in
gj t[k] = hj t[k] ; otherwise, wkt will be updated. So the more Eq.(20) and keep the others in t without change.
wk in W we update, the more corresponding hash bits of
all data points we subsequently have to update when applied
to real-world system. This takes severely much time which Algorithm 3 Inference of t for a Similar Pair
cannot be ignored for online applications. Hence, we should Calculate the Hamming distance Dh (hi t , hj t ) between hi t and
change hash bits as few as possible; in other words, we aim hj t ;
to update wk as few as possible. This means that p should be Calculate p0 = Dh (hi t , hj t ) − α ;
as small as possible, meanwhile guaranteeing that t satisfies Compute δk for k ∈ K1 by Eq. (19);
the constrain R(t , s t ) = 0. Based on the earlier discussion, Sort δk ;
the minimum of p is computed as p0 = βr
− Dh (hi t , hj t ) Set the corresponding hash bits of the p0 smallest δk opposite
by setting Dh (gi t , gj t ) = βr
, as p = Dh (gi t , gj t ) − Dh (hi t , hj t ) to the corresponding values in t by following the rule in
and Dh (gi t , gj t ) ≥ βr
≥ βr . Then, t is ready by selecting Eq.(20) and keep the others in t with no change.
p0 hash bits whose indexes are in K0 .
After determining the number of hash bits to update,
namely, p0 , the problem now becomes which p0 bits should B. Similar Case
be picked up from K0 . To establish the rule, it is necessary to Regarding similar pairs, the Hamming distance of the opti-
measure the potential loss for every bit of hit and htj . For this mal hash code pairs t should be equal or smaller than α . Since
purpose, the prediction loss function in (6) can be reformed the Hamming distance between the predicted hash codes of
as similar pairs may be larger than α , we should pick up p0 bits

T
2hi t[k] wkt xi t +
T
2hj t[k] wkt xj t + R(t , s t ). from set K1 instead of from set K0 , and set them opposite to
hi t[k]  =gi t[k] hj t[k]  =gj t[k]
the corresponding values in t so as to achieve R(t , s t ) = 0.
Similar to the case for dissimilar pairs as discussed earlier,
T T
This tells that hi t[k] wkt xi t or hj t[k] wkt xj t are parts of the the number of hash bits to be updated is p0 = Dh (hi t , hj t ) − α ,
prediction loss, and thus we use it to measure the potential but these bits are selected in K1 . We will compute δk for k ∈ K1
loss for every bit. The problem is which bit should be picked and pick up p0 bits with the smallest δk for update. Since the
up to optimize. For our one-pass online learning, a large whole processing is similar to the processing for the dissimilar
update does not mean a good performance will be gained, pairs, we only summarize the processing for similar pairs in
since every time we update the model only based on a new Algorithm 3 and skip the details.
arrived pair of samples, and thus a large change on the hash Finally, when wkt is a zero vector, δk is zero as well no
function would not suit the passed data samples very well. This matter what the values of hi t[k] , hj t[k] , xi t , and xj t are. This leads
also conforms to the spirit of passive-aggressive idea that the to the failure in selecting hash bits to be updated. To avert
change of an online model should be smooth. To this end, this, we initialize W1 by applying LSH. In other words, W1 is
we take a conservative strategy by selecting the p0 bits that sampled from a zero-mean multivariate Gaussian N (0, I ), and
corresponding to smallest potential loss as introduced in the we denote this matrix by W L S H .
following. First, the potential loss of every bit with respect to
H (Wt ) is calculated by V. A NALYSIS
 T T 
δk = min hi t[k] wkt xi t , hj t[k] wkt xj t , k ∈ K0 . (19) A. Bounds for Similarity Loss and Prediction Loss
T In this section, we discuss the loss bounds for the proposed
We only select the smaller one between hi t[k] wkt xi t and OH algorithm. For convenience, at step t , we define
hj t[k] wkt xj t , because we will never set t simultaneously by
T

gi [k] = −hi t[k] and gj t[k] = −hj t[k] for any k ∈ K0 . After sorting δk ,
t t
U = t (U) = H t (U) − G t (U) + R(t , s t ) (21)
the p0 smallest δk are picked up and their corresponding hash where U is an arbitrary matrix in Rd×r . Here, Ut is considered
bits are updated by the following rule: as the prediction loss based on U in the t th round.

gi [k] = −hi[k] ,
T
if hi t[k] wkt xi t ≤ hj t[k] wkt xj t
T We first present a lemma that will be utilized to prove
(20) Theorem 2.
gj [k] = −hj [k] , otherwise.
Lemma 1: Let ( 1 , s 1 ), · · · , ( t , s t ) be a sequence of pair-
The procedure of obtaining t for a dissimilar pair is wise examples, each with a similarity label s t ∈ {1, −1}. The
summarized in Algorithm 2. data pair t ∈ Rd×2 is mapped to a r -bit hash code pair
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HUANG et al.: OH 7

t ∈ Rr×2 through the hash projection matrix Wt ∈ Rd×r . This deduces that
Let U be an arbitrary matrix in Rd×r . If τ t is defined as that  
in (10), we then have τ t (2t (Wt ) − τ t || t (t − t )T ||2F

≥ τ t (2t (Wt ) − t (Wt )) = τ t t (Wt ). (24)
τ (2 (W ) − τ || ( − 
t t t t t t t T 2
) || F t
− 2U ) ≤ ||U − W1 ||2F
t=1
According to the definition of prediction loss function in (6)
and the upper bound assumption, we know that for any t
where W1
is the initialized hash projection matrix that consists 
of nonzero vectors. R(t , s t ) ≤ t (Wt )
Proof: By using the definition || ( − t )T ||2F ≤ F 2 , and
t t

t = ||Wt − U||2F − ||Wt+1 − U||2F R(t , s t )
≤ C. (25)
F2
we can have


With these three inequalities and (16), it can be deduced that
 
t = ||Wt − U||2F − ||Wt+1 − U||2F 
t t t t (Wt )2
t=1 t=1 τ  (W ) = min , Ct (Wt )
= ||W1 − U||2F − ||Wt+1 − U||2F ≤ ||W1 − U||2F . (22) || t (t − t )T ||2F

R(t , s t ) 
From (12), we know Wt+1 = Wt + τt t( t  − t )T , so we can ≥ min , C R(t , s t )
rewrite t as || t (t − t )T ||2F
 
R(t , s t ) R(t , s t )
t = ||Wt − U||2F − ||Wt+1 − U||2F ≥ min ,
F2 F2
= ||Wt − U||2F − ||Wt − U + τ t t
(t − t )T ||2F R( , s )
t t
= . (26)
= ||Wt − U||2F − (||Wt − U||2F F2
+ 2τ t (H t (W) − G t (W) − (H t (U) − G t (U))) By combining (23) and (26), we obtain that
+ (τ ) || ( − 
t 2 t t
) || F )
t T 2 ∞ ∞
  R(t , s t ) 1 2

≥ −2τ t (( R(t , s t ) −  (W )) − ( R(t , s t ) − U
t t t
)) ≤ ||U − W || F + 2 τ t U
t
.
F2
t=1 t=1
− (τ t )2 || t (t − T )T ||2F
 t  Since, τt ≤ C for all t , we have
= τ t (2t (Wt ) − τ t || t (t − t )T ||2F − 2U .



By computing the sum of the left and the right of the above R( , s ) ≤ F
t t 2
||U − W1 ||2F + 2C t
U . (27)
inequality, we can obtain t=1 t=1


The theorem is proved. 
t ≥ τ t (2t (Wt ) − τ t ||( t
(t − t )T ||2F − 2U
t
). Note: In particular, if the assumption that there exists a
t=1 t=1
projection matrix satisfying zero or negative prediction loss for
Finally, putting (22) into the above equation, we prove any pair of data samples holds, there would exist a U satisfying
Lemma 1.  U
t ≤ 0 for any t . If so, the second term in inequality (27)
Theorem 2: Let ( 1 , s 1 ), · · · , ( t , s t ) be a sequence of pair- vanishes. In other words, the similarity loss is bounded by a
wise examples, each with a similarity label s t ∈ {1, −1} for constant, i.e., F 2 ||U − W1 ||2F . After a certain amount of update,
all t . The data pair t ∈ Rd×2 is mapped to a r -bit hash code the similarity loss for new data sample pairs becomes zero in
pair t ∈ Rr×2 through the hash projection matrix Wt ∈ Rd×r . such an optimistic case.
If || t (t −  )T ||2F is upper bounded by F 2 and √
the margin
parameter C is set as the upper bound of R(F2 ,s ) , then
t t

B. Time and Space Complexity


the cumulative similarity loss (3) is bounded for any matrix
U ∈ Rd×r , that is Based on the algorithm summarized in Algorithm 1, we can


find that the time of computing the prediction code for t is

R(t , s t ) ≤ F 2 ||U − W1 ||2F + 2C t
U O(dr ) and that of obtaining the similarity loss is O(r ). The
t=1 t=1 process of obtaining the zero-loss code pair t takes at most
where C is the margin parameter defined in Criterion (7). O(rlogr + r d) with O(r d) to compute all δk and O(rlogr )
Proof: Based on Lemma 1, we can obtain to sort hash bits according to δk . As for the update process

of the projection matrix, it takes O(r d). Therefore, the time
τ t (2t (Wt ) − τ t ||( t (t − t )T ||2F ) complexity for training OH at each round is O(dr + rlogr ).
t=1 Overall, if n pairs of data points participate in the training

stage, the whole time complexity is O((dr + rlogr )n). For
≤ ||U − W1 ||2F + 2 τ t U
t
. (23) the space complexity, it is O(d + dr ) = O(dr ), with O(d)
t=1
to store the data pairs and O(dr ) to store the projection
Based on (16), we get that matrix. Overall, the space complexity remains unchanged
t (Wt ) during training and is independent of the number of training
≥ τt.
|| t (t − t )T ||2F samples.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

VI. M ULTIMODEL O NLINE H ASHING and semantic neighbor search. First, four selected data sets are
In order to make the OH model more robust and less introduced in Section VII-A. And then, we evaluate the pro-
biased by current round update, we extend the proposed OH posed models in Section VII-B. Finally, we make comparison
from updating one single model to updating the T models. between the proposed algorithms and several related hashing
Suppose that we are going to train T models, which are models in Section VII-C.
initialized randomly by LSH. Each model is associated with
A. Data Sets
the optimization of its own similarity loss function in terms
of (3), denoted by Rm (tm , s t ) (m = 1, 2, · · · , T ), where tm is The four selected large-scale data sets are: Photo
the binary code of a new pair t predicted by the m th model Tourism [57], 22K LabelMe [58], GIST1M [59], and
at step t . At step t , if t is a similar pair, we only select CIFAR-10 [60], which are detailed in the following.
one of the T models to update. To do that, we compute the 1) Photo Tourism [57]: It is a large collection of 3-D
similarity loss function for each model Rm (tm , s t ), and then photographs, including three subsets, each of which has about
we select the model, supposed the m 0 th model that obtains the 100k patches with 64 × 64 grayscale. In the experiment,
smallest similarity loss, i.e., m 0 = arg minm Rm (tm , s t ). Note we selected one subset consisting of 104k patches taken from
that for a similar pair, it is enough that one of the models Half Dome in Yosemite. We extracted 512-D GIST feature
has positive output, and thus the selected model is the closest vector for each patch and randomly partitioned the whole data
one to suit this similar pair and is more easier to update. set into a training set with 98k patches and a testing set with
If t is a dissimilar pair, all models will be updated if the 6k patches. The pairwise label s t is generated based on the
corresponding loss is not zero, since we cannot tolerate an matching information. That is, s t is 1 if a pair of patches is
wrong prediction for a dissimilar pair. By performing OH matched; otherwise s t is −1.
in this way, we are able to learn diverse models that could 2) 22K LabelMe [58]: It contains 22/,019 images. In the
fit different data samples locally. The update of each model experiment, each image was represented by 512-D GIST
follows the algorithm presented in our section III.B. feature vector. We randomly selected 2k images from the
To guarantee the rationale of the MMOH, we also provide data set as the testing set, and set the remaining images as
the upper bound for the accumulative multimodel similarity the training set. To set the similarity label between two data
loss in Theorem 3. samples, we followed [21] and [27]: if either one is within the
Theorem 3: Let ( 1 , s 1 ), · · · , ( t , s t ) be a sequence of pair- top 5% nearest neighbors of the other measured by Euclidean
wise examples, each with a similarity label s t ∈ {1, −1} for distance, s t = 1 (i.e., they are similar); otherwise s t = −1
all t . The data pair t ∈ Rd×2 is mapped to a r -bit hash (i.e., they are dissimilar).
code pair t ∈ Rr×2 through the hash projection matrix 3) Gist1M [59]: It is a popular large-scale data set to
Wt ∈ Rd×r . Suppose || t (m t −  )T ||2 is upper bounded
 F
evaluate hash models [20], [61]. It contains one million
byF 2 , and the margin parameter C is set as the upper bound of unlabeled data with each data represented by a 960-D GIST
(( Rm ∗ (t , s t ))/F 2 ) for all m , where R ∗ (t , s t ) is an auxiliary feature vector. In the experiment, we randomly picked up
m m m
function defined as 500 000 points for training and the nonoverlapped 1000 points
⎧ for testing. Owing to the absence of label information, we uti-
⎨ Rm (m , s ),
⎪ t t if the m th model is selected for lize pseudolabel information by thresholding the top 5% of

Rm (tm , s t ) = update at step t

⎩ 0, the whole data set as the true neighbors of an instance based
otherwise. on Euclidean distance, so every point has 50 000 neighbors.
(28) 4) CIFAR-10 and Tiny Image 80M [60]: CIFAR-10 is
Then, for any matrix U ∈ Rd×r , the cumulative similarity a labeled subset of the 80M Tiny Images collection [62].
loss (3) is bounded, that is It consists of ten classes with each class containing 6k 32 × 32
color images, leading to 60k images in total. In the experiment,

T ∞
every image was represented by 2048-D deep features, and

Rm 
( tm , s t ) ≤ TF 2
||U − W1 ||2F + 2C t
U
t=1 m=1 t=1
59k samples were randomly selected to set up the training set
with the remained 1k as queries to search through the whole
where C is the margin parameter defined in Criterion (7).
80M Tiny Image collection.
Proof: Based on Theorem 2, the following inequality
For measurement, the mean average precision (mAP)
holds for m = 1, 2, . . . T :
[63], [64] is used to measure the performance of different


algorithms, and mAP is regarded as a better measure than

Rm 
( tm , s t ) ≤F 2
||U − W1 ||2F + 2C t
U .
precision and recall when evaluating the quality of results in
t=1 t=1
retrieval [63], [64]. All experiments were independently run
By summing these multimodel similarity losses of all models, on a server with CPU Intel Xeon X5650, 12-GB memory and
Theorem 3 is proved.  64-b CentOS system.

VII. E XPERIMENTS B. Evaluation of the Proposed Methods


In this section, extensive experiments were conducted In the proposed models, there are three key parameters,
to verify the efficiency and effectiveness of the proposed namely, β , C , and T . In this section, we mainly investigate the
OH models from two aspects: metric distance neighbor search influence of these parameters. Additionally, we will observe
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HUANG et al.: OH 9

TABLE I
D EFAULT VALUES OF THE PARAMETERS IN MMOH

Fig. 3. mAP comparison results of MMOH with respect to different T values


on all data sets. (Best viewed in color.) (a) Photo Tourism. (b) 22K LabelMe.
(c) GIST1M. (d) CIFAR-10.

In summary, a moderately large β is better on the other three


Fig. 2. mAP comparison results of OH with respect to different β on
data sets. Based on the experimental results, β = 0.5 is set as a
all data sets. (Best viewed in color.) (a) Photo tourism. (b) 22K LabelMe. default value for OH on Photo Tourism and 22K LabelMe data
(c) GIST1M. (d) CIFAR-10. sets in the below-mentioned experiments, and β = 0.4 is the
default value for OH tested on GIST1M and CIFAR-10 data
sets.
the influence of the RBF kernel function on the proposed 2) Effect of the Number of Multiple Models T : Before the
models. investigation about the effect of the number of models T
As stated in section IV, the initial projection matrix W1 on MMOH, it should be noticed that when T = 1, MMOH
is randomly set based on Gaussian distribution, which is degrades to OH. And when T > 1, we use the multi-index
similar to the generation of W in LSH [6]. Thus, such an technique [65] to realize fast hash search. The influence of
initialization of W1 is denoted by W1 = W L S H . For the T on MMOH is observed by varying T from 1 to 4. Fig. 3
similarity threshold, α is set to 0, because in most applications, presents the experimental results of MMOH on the four data
the nearest neighbors of a given sample are looked up within 0 sets. From this figure, we can find that when more models
Hamming distance. The dissimilar ratio threshold, β , is set to are used (i.e., larger T ), the performance of MMOH on most
0.4 on CIFAR-10 and GIST1M and set to 0.5 on the other data sets except CIFAR-10 is always better. On 22K LabelMe
two data sets. Besides, the code length r is set as r = 64 in and GIST1M, the improvement of using more models is more
this section, and the RBF kernel is a default kernel used in clear, and it is less on Photo Tourism. On CIFAR-10, MMOH
the proposed models in the experiments. When a parameter seems not sensitive to different values of T . In summary,
is being evaluated, the others are set as the default values, the results in Fig. 3 suggest that MMOH performs overall
as shown in Table I. better and more stably when more models are used.
1) Effect of Dissimilarity Ratio Threshold β : As indicated 3) Effect of Margin Parameter C : In (10), C is the upper
in (3), β is used to control the similarity loss on dissimilar data. bound of τ t , and τ t can be viewed as the step size of the
A large β means that dissimilar data should be critically sepa- update. A large C means that Wt will be updated more toward
rated as far as possible in Hamming space. Thus, a too large β reducing the loss on the current data pair. In Theorem 2, a large
may lead to excessively frequent update of the proposed OH C is necessary to guarantee the bound on the accumulative loss
models. In contrast, a small β indicates that dissimilar data are for OH. In this experiment, the lower bound of C in Theorem 2

less critically separated. Therefore, a too small β may make was 0.017, since the maximum value R(t , s t ) was 8 and the
the hash model less discriminative. Consequently, a proper minimum value of F 2 was 479.
β should be set for the proposed OH models. The effect of C on the mAP performance was also evaluated
We investigate the influence of β on OH on different on all data sets by varying C from 1E − 5 to 1E − 1. From
data sets by varying β from 0.1 to 0.9. Fig. 2 presents the Fig. 4, we find that a larger C (i.e., C = 1E −1) is preferred on
experimental results. On all data sets, when β increases, all data sets. When C increases from C = 1E −5 to C = 1E −2,
the performance of OH becomes better and better at first. the final performance of OH improves by a large margin on all
However, when β is larger than 0.5, further increasing β may data sets. Further increasing the value of C can only slightly
lead to performance degradation. improve the final performance, especially on Photo Tourism,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

C. Comparison With Related Methods


To comprehensively demonstrate the efficiency and effec-
tiveness of the proposed OH and MMOH, we further compared
it with several related hashing algorithms in terms of mAP and
training time complexity.
1) Compared Methods: There is not much work on develop-
ing OH methods. To make a proper comparison, the following
three kinds of hashing methods were selected.
(a) KLSH [9], a variant of LSH [6] which uses RBF
kernel function to randomly sample the hash projec-
tion matrix W, is selected as the baseline comparison
algorithm. This algorithm is a data-independent hashing
method and thus considered as the baseline method.
(b) Five nonbatch-based hashing models were selected:
LEGO-LSH [41], MLH [22], SSBC [48], OSH [47],
and AdaptHash [50]. LEGO-LSH [41], another vari-
ant of LSH, utilizes an online metric learning to
Fig. 4. mAP comparison results of OH with C ranging from 0.00001 to 0.1. learn a metric matrix to process online data, which
(Best viewed in color.) (a) Photo Tourism. (b) 22K LabelMe. (c) GIST1M.
(d) CIFAR-10. does not focus on the hash function learning but on
metric learning. MLH [22] and AdaptHash [50] both
enable online learning by applying SGD to optimize
the loss function. SSBC [48] and OSH [47] are spe-
cially designed for coping with stream data by apply-
ing matrix sketching [53] on existing batch mode
hashing methods. Except AdaptHash and OSH, other
three methods all require that the input data should
be zero-centered, which is impractical for online learn-
ing, since one cannot have all data samples observed
in advance.
(c) Four batch mode learning hashing models were selected.
One is the unsupervised method Iterative Quantization
(ITQ) [19]. ITQ is considered as a representative method
for unsupervised hashing in batch mode. Since OH
is a supervised method, we selected three supervised
methods for comparison in order to see how an online
Fig. 5. Cumulative similarity loss of OH with C ranging from 0.00001 to 10.
(Best viewed in color.) Photo Tourism. approach approximates these off-line approaches. The
three supervised methods are supervised hashing with
kernel (KSH) [21], fast supervised hash (FastHash) [26],
22K LabelMe, and Gist1M. We also find that the performance and supervised discrete hashing (SDH) [16]. KSH is
would nearly not change when setting C > 1E −1 as this value a famous supervised hashing method to preserve the
is larger than (t (Wt ))/(|| t (t − t )T ||2F ) for most t we have pairwise similarity between data samples. FastHash and
investigated. Hence, C = 1E − 1 is chosen as a default value SDH are two recently developed supervised hashing
in the other experiments for OH. method in batch mode, where we run SDH for fast
Finally, we evaluated the effect of C on the cumulative search the same as in [66].
similarity loss experimentally. For this purpose, we took Photo 2) Settings: Since MLH, LEGO-LSH, and OH are based
Tourism data set as example and ran OH over more than 106 on pairwise data input, we randomly sampled data pairs in
loops by varying C from 1E − 5 to 1E1, where duplication sequence from the training set. For OSH, the chunk size of
of training data was allowed. The comparison result in Fig. 5 streaming data was set as 1000 and each data chunk was
shows how the cumulative similarity loss increases when the sequentially picked up from the training sequence as well.
number of iteration rounds increases. The lowest cumulative Additionally, the key parameters in the compared methods
loss is achieved by setting C = 0.1, which is the one were set as the ones recommended in the corresponding
larger but closest to the required lower bound value of C papers. For the proposed OH, the parameters were set
(i.e., 0.017). When C is too small, i.e., C < 1E − 3, the cumu- according to Table I and the number of models T is set to be
lative loss grows strongly. This verifies that C should be lower 4 for MMOH.
bounded in order to make the cumulative loss under control. 3) Comparison With Related Online Methods: We com-
If C ≥ 1E −3, the cumulative loss tends to increase much more pare OH with the selected related methods in two aspects:
slowly, as shown in Fig. 5. mAP comparison and training time comparison.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HUANG et al.: OH 11

Fig. 6. mAP comparison results among different online hash models with
respect to different code lengths. (Best viewed in color.) (a) Photo Tourism.
(b) 22K LabelMe. (c) GIST1M. (d) CIFAR-10.

a) mAP comparisons: Fig. 6 presents the comparison


results among LEGO-LSH, MLH, SSBC, OSH, OH, and
MMOH with the code length varying from 16 to 128.
From this figure, we find that when the hash code length Fig. 7. Training time comparison among different algorithms on Photo
increases, the performance of MMOH and OH becomes better Tourism. (Best viewed in color.) (a) Training time comparison among dif-
ferent algorithms when the number of samples increases. (b) Training time
and better on all data sets. MMOH achieves the best per- comparison among different algorithms with different code lengths.
formance and OH achieves the second on almost all data
sets when the code length is larger than 32, except Photo linearly. However, it is evident that LEGO-LSH
Tourism where LEGO-LSH and OH perform very similarly. increases much faster than the other compared methods.
Specifically, as the bit length increases, MMOH performs In particular, OH, AdaptHash, and MMOH increase the
significantly better than OSH and MLH on all data sets, lowest, and in particular, OH takes only 0.0015 s (using
and it performs better than LEGO-LSH on 22K LabelMe, single thread run in a single CPU) for each pair for
GIST1M, and CIFAR-10. In addition, as the hash bit length update.
increases, it is more clear to see MMOH performed better 2) Compared with LEGO-LSH and OSH, the accumu-
than the compared methods on CIFAR-10. All these observa- lated training time of other three methods are consid-
tions demonstrate the efficiency and effectiveness of OH and erably much smaller. Specifically, when the number of
MMOH to the compared hashing algorithms on processing samples is 8 × 104 , the accumulated training time of
stream data in an online way. LEGO-LSH is ten times more as compared with MMOH
b) Training time comparisons: In this part, we investigate and MLH. Besides, the time of OSH is two times more
the comparison on the accumulated training time, which as compared with MMOH and MLH and is four times
is another main concern for online learning. Without loss more as compared with OH and AdaptHash.
of generality, the Photo Tourism data set was selected and 3) The training time of MMOH and MLH is very similar.
80k training points were randomly sampled to observe the The training time of OH is slightly less than AdaptHash,
performance of different algorithms. All results are averaged and it is always the least one.
over ten independent runs conducted on a server with Intel Second, we further investigate the training time of different
Xeon X5650 CPU, 12-GB memory, and 64-b CentOS system. methods when the hash code length increases, where the
For fairness, in each run, all experiments were conducted training sample size is fixed to be 80 000. The comparison
using only one thread. Specially, SSBC is excluded in this result is displayed in Fig. 7(b). From this figure, we find the
experiment, since its training time is significantly longer than following.
any other methods. 1) When the code length increases, the accumulated train-
First, we investigate the training time of different algorithms ing time of all algorithms increases slightly except
when the number of training samples increases. In this exper- LEGO-LSH. The training time of LEGO-LSH does not
iment, the hash bit length was fixed to 64 and the comparison change obviously when the code length is larger than 24.
result is presented in Fig. 7(a). From this figure, we find the This is because most of the training time was spent on
following. the metric training, which is independent of the hash
1) As the number of training samples increases, the accu- code length, while the generation of hash projection
mulated training time of all methods almost increases matrix in LSH costs very little time [41].
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

When compared with the unsupervised methods ITQ, OH


outperforms it when the code length is larger than 32 b over
all data sets. These comparison results demonstrate that OH
is better than the compared data-independent methods KLSH
and the data-dependent unsupervised methods ITQ overall.
For comparison with three supervised methods, the mAP of
MMOH is the highest on GIST1M data set, slightly higher
than KSH and FastHash, when the code length is larger
than 16 b. KSH and FastHash have their limitation on this data
set. KSH cannot engage all data in training as the similarity
matrix required by KSH is of tremendous space cost, and the
block graph cut technique in FastHash may not work well
on this data set as the supervised information is generated by
the Euclidean distance between data samples. On the other
three data sets, MMOH is not the best, but performs better
than two unsupervised methods KLSH and ITQ. Indeed, from
the performance aspect, when all labeled data are available
at one time, using OH and MMOH is not the best choice
Fig. 8. mAP comparison against KLSH and learning-based batch mode
methods with different code lengths on all data sets. (Best viewed in color.) as compared with batch mode supervised hashing methods.
(a) Photo Tourism. (b) 22K LabelMe. (c) GIST1M. (d) CIFAR-10. However, OH and MMOH solve a different problem as
compared with the supervised batch ones. OH and MMOH
2) MMOH took considerably smaller accumulated training concern how to train supervised hashing model on stream data
time than LEGO-LSH and OSH. This is because only in an online learning framework, while the batch ones are not.
a small part of hash bits are updated in MMOH, which A bound on the cumulative loss function is necessary for an
reduces the time cost in updating. online learning model while it is not necessary for batch mode
3) Compared with MLH, MMOH took a little more time methods. Hence, OH and MMOH have their unique merits.
than MLH when the bit length is smaller than 64. How-
ever, when the bit length is larger than 64, the training VIII. C ONCLUSION & D ISCUSSION
time of MMOH becomes slightly less than that of MLH. In this paper, we have proposed a one-pass online learning
This phenomenon can be ascribed to the fact that the algorithm for hash function learning called OH. OH updates
time spent on selecting the hash model to preserve in its hash model in every step based on a pair of input data
MMOH does not heavily depend on the code length, samples. We first reexpress the hash function as a form
and the time cost of updating hash model in MMOH of structured prediction and then propose a prediction loss
grows slower than that in MLH. function according to it. By penalizing the loss function using
4) OH took the least training time in all the compari- the previously learned model, we update the hash model
son. The training time of AdaptHash is slightly higher. constrained by inferred optimal hash codes, which achieve
In addition, the training time of OH and AdaptHash zero prediction loss. The proposed online hash function model
is only about half of the training time of MLH ensures that the updated hash model is suitable for the current
and MMOH. pair of data and has theoretical upper bounds on the similarity
From the above-mentioned investigation, we can conclude loss and the prediction loss. Finally, an MMOH is developed
that OH and MMOH are very efficient in consuming con- for a more robust OH. Experimental results on different data
siderably small accumulated training time and thus are more sets demonstrate that our approach gains satisfactory results,
suitable for online hash function learning. both efficient on training time and effective on the mAP
4) Comparison to Batch Mode Methods: Finally, we com- results. As part of the future work, it can be expected to
pare batch mode (off-line) learning-based hashing models. The consider updating the model on multiple data pairs at one
main objective here is not to show which is better, as online time. However, the theoretical bound for online learning needs
learning and off-line learning have different focuses. The further investigation in this case.
comparison here is to show how well an online model can
now approach the largely developed batch mode methods. R EFERENCES
Fig. 8 presents the comparison results on different hash code
[1] A. R. Webb and K. D. Copsey, Statistical Pattern Recognition. Hoboken,
lengths varying from 16 to 128 on the four data sets. SDH is NJ, USA: Wiley, 2003.
only conducted on Photo Tourism and CIFAR-10 as only [2] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. J. Russell, “Distance metric
these two data sets provide class label information. From this learning, with application to clustering with side-information,” in Proc.
Neural Inf. Process. Syst., 2002, pp. 521–528.
figure, it is evident that OH significantly outperforms KLSH [3] K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learning
when using different numbers of hash bits on all data sets. for large margin nearest neighbor classification,” in Proc. Neural Inf.
Specifically, when the code length increases, the difference Process. Syst., 2005, pp. 1473–1480.
[4] W.-S. Zheng, S. Gong, and T. Xiang, “Re-identification by relative
between OH and KLSH becomes much bigger, especially on distance comparison,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35,
Photo Tourism and 22K LabelMe. no. 3, pp. 653–668, Mar. 2013.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

HUANG et al.: OH 13

[5] M. Norouzi, A. Punjani, and D. Fleet, “Fast exact search in Ham- [32] D. Zhang and W.-J. Li, “Large-scale supervised multimodal hashing
ming space with multi-index hashing,” IEEE Trans. Pattern Anal. with semantic correlation maximization,” Assoc. Adv. Artif. Intell., vol. 1,
Mach. Intell., vol. 36, no. 6, pp. 1107–1119, Jun. 2014. no. 2, pp. 2177–2183, 2014.
[6] M. S. Charikar, “Similarity estimation techniques from rounding algo- [33] J. Masci, M. M. Bronstein, A. M. Bronstein, and J. Schmidhuber,
rithms,” in Proc. ACM Symp. Theory Comput., 2002, pp. 380–388. “Multimodal similarity-preserving hashing,” IEEE Trans. Pattern Anal.
[7] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive Mach. Intell., vol. 36, no. 4, pp. 824–830, Apr. 2014.
hashing scheme based on p-stable distributions,” in Proc. ACM Symp. [34] Y. Wei, Y. Song, Y. Zhen, B. Liu, and Q. Yang, “Scalable heteroge-
Comput. Geometry, 2004, pp. 253–262. neous translated hashing,” in Proc. ACM Special Interest Group Knowl.
[8] O. Chum, J. Philbin, and A. Zisserman, “Near duplicate image detection: Discovery Data Mining, 2014, pp. 791–800.
Min-hash and TF-IDF weighting,” in Proc. Brit. Mach. Vis. Conf., 2008, [35] D. Wang, X. Gao, X. Wang, and L. He, “Semantic topic multimodal
pp. 1–10. hashing for cross-media retrieval,” in Proc. Int. Joint Conf. Artif. Intell.,
[9] B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing 2015, pp. 3890–3896.
for scalable image search,” in Proc. Int. Conf. Comput. Vis., 2009, [36] D. Wang, X. Gao, X. Wang, L. He, and B. Yuan, “Multimodal discrim-
pp. 2130–2137. inative binary embedding for large-scale cross-modal retrieval,” IEEE
[10] Z. Tang, X. Zhang, and S. Zhang, “Robust perceptual image hashing Trans. Image Process., vol. 25, no. 10, pp. 4540–4554, Oct. 2016.
based on ring partition and NMF,” IEEE Trans. Knowl. Data Eng., [37] Y. Zhen and D.-Y. Yeung, “Active hashing and its application to image
vol. 26, no. 3, pp. 711–724, Mar. 2014. and text retrieval,” in Proc. Data Mining Knowl. Discovery, vol. 26,
[11] L. Zhang, Y. Zhang, X. Gu, J. Tang, and Q. Tian, “Scalable similarity no. 2, pp. 255–274, 2013.
search with topology preserving hashing,” IEEE Trans. Image Process., [38] Q. Wang, L. Si, Z. Zhang, and N. Zhang, “Active hashing with joint
vol. 23, no. 7, pp. 3025–3039, Jul. 2014. data example and tag selection,” in Proc. ACM Special Interest Group
[12] A. Shrivastava and P. Li, “Densifying one permutation hashing via Inf. Retr., 2014, pp. 405–414.
rotation for fast near neighbor search,” in Proc. Int. Conf. Mach. Learn., [39] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer,
2014, pp. 557–565. “Online passive-aggressive algorithms,” J. Mach. Learn. Res., vol. 7,
[13] Z. Yu, F. Wu, Y. Zhang, S. Tang, J. Shao, and Y. Zhuang, “Hashing pp. 551–585, Dec. 2006.
with list-wise learning to rank,” in Proc. ACM Special Interest Group [40] G. Chechik, V. Sharma, U. Shalit, and S. Bengio, “Large scale online
Inf. Retr., 2014, pp. 999–1002. learning of image similarity through ranking,” J. Mach. Learn. Res.,
[14] X. Liu, J. He, and B. Lang, “Multiple feature kernel hashing for large- vol. 11, pp. 1109–1135, Jan. 2010.
scale visual search,” Pattern Recognit., vol. 47, no. 2, pp. 748–757, [41] P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman, “Online metric learning
Feb. 2014. and fast similarity search,” in Proc. Neural Inf. Process. Syst., 2008,
[15] B. Wu, Q. Yang, W. Zheng, Y. Wang, and J. Wang, “Quantized pp. 761–768.
correlation hashing for fast cross-modal search,” in Proc. Int. Joint Conf. [42] Y. Li and P. M. Long, “The relaxed online maximum margin algorithm,”
Artif. Intell., 2015, pp. 3946–3952. in Proc. Neural Inf. Process. Syst., 1999, pp. 498–504.
[16] F. Shen, C. Shen, W. Liu, and H. T. Shen, “Supervised discrete [43] M. K. Warmuth and D. Kuzmin, “Randomized online PCA algo-
hashing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, rithms with regret bounds that are logarithmic in the dimension,” J.
pp. 37–45. Mach. Learn. Res., vol. 9, pp. 2287–2320, Oct. 2008.
[17] Y. Weiss, A. Torralba, and R. Frgus, “Spectral hashing,” in Proc. Neural [44] J. Silva and L. Carin, “Active learning for online Bayesian matrix
Inf. Process. Syst., 2008, pp. 1753–1760. factorization,” in Proc. ACM Conf. Knowl. Discovery Data Mining,
[18] W. Liu, J. Wang, S. Kumar, and S. Chang, “Hashing with graphs,” in 2012, pp. 325–333.
Proc. Int. Conf. Mach. Learn., 2011, pp. 1–8. [45] A. Bordes, S. Ertekin, J. Weston, and L. Bottou, “Fast kernel classi-
[19] Y. Gong and S. Lazebnik, “Iterative quantization: A procrustean fiers with online and active learning,” J. Mach. Learn. Res., vol. 6,
approach to learning binary codes,” in Proc. IEEE Conf. Comput. Vis. pp. 1579–1619, Sep. 2005.
Pattern Recognit., Oct. 2011, pp. 817–824. [46] W. Chu, M. Zinkevich, L. Li, A. Thomas, and B. Tseng, “Unbiased
[20] J. Heo, Y. Lee, J. He, S. Chang, and S. Yoon, “Spherical hash- online active learning in data streams,” in Proc. ACM Conf. Knowl.
ing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Mar. 2012, Discovery Data Mining, 2011, pp. 195–203.
pp. 2957–2964. [47] C. Leng, J. Wu, J. Cheng, X. Bai, and H. Lu, “Online sketching
[21] W. Liu, J. Wang, R. Ji, Y. Jiang, and S.-F. Chang, “Supervised hashing hashing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015,
with kernels,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2503–2511.
Jun. 2012, pp. 2074–2081. [48] M. Ghashami and A. Abdullah. (Mar. 2015). “Binary coding in stream.”
[22] M. Norouzi and D. Fleet, “Minimal loss hashing for compact binary [Online]. Available: https://arxiv.org/abs/1503.06271
codes,” in Proc. Int. Conf. Mach. Learn., 2011, pp. 353–360. [49] F. Çakir and S. Sclaroff, “Online supervised hashing,” in Proc. IEEE
[23] Y. Mu, J. Shen, and S. Yan, “Weakly-supervised hashing in kernel Int. Conf. Image Process., Jun. 2015, pp. 2606–2610.
space,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2010, [50] F. Çakir and S. Sclaroff, “Adaptive hashing for fast similarity search,”
pp. 3344–3351. in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 1044–1052.
[24] P. Zhang, W. Zhang, W.-J. Li, and M. Guo, “Supervised hashing with [51] F. Çakir and S. Sclaroff, “Supervised hashing with error correcting
latent factor models,” in Proc. ACM Special Interest Group Inf. Retr., codes,” in Proc. ACM Int. Conf. Multimedia, 2014, pp. 785–788.
2014, pp. 173–182. [52] L.-K. Huang, Q. Yang, and W.-S. Zheng, “Online hashing,” in Proc. Int.
[25] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan, “Supervised hashing for Joint Conf. Artif. Intell., 2013, pp. 1422–1428.
image retrieval via image representation learning,” Assoc. Adv. Artif. [53] E. Liberty, “Simple and deterministic matrix sketching,” in Proc. ACM
Intell., vol. 1, no. 1, pp. 2156–2162, 2014. Conf. Knowl. Discovery Data Mining, 2013, pp. 581–588.
[26] G. Lin, C. Shen, Q. Shi, A. V. D. Hengel, and D. Suter, “Fast supervised [54] H. Xu, J. Wang, Z. Li, G. Zeng, S. Li, and N. Yu, “Complementary
hashing with decision trees for high-dimensional data,” in Proc. IEEE hashing for approximate nearest neighbor search,” in Proc. Int. Conf.
Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 1971–1978. Comput. Vis., 2011, pp. 1631–1638.
[27] J. Wang, O. Kumar, and S. Chang, “Semi-supervised hashing for scalable [55] T. Finley and T. Joachims, “Training structural SVMS when exact
image retrieval,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., inference is intractable,” in Proc. Int. Conf. Mach. Learn., 2008,
Jun. 2010, pp. 3424–3431. pp. 304–311.
[28] J. Wang, S. Kumar, and S. Chang, “Sequential projection learning for [56] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, “Large
hashing with compact codes,” in Proc. Int. Conf. Mach. Learn., 2010, margin methods for structured and interdependent output variables,”
pp. 1127–1134. J. Mach. Learn. Res., vol. 6, pp. 1453–1484, Sep. 2005.
[29] J. Cheng, C. Leng, P. Li, M. Wang, and H. Lu, “Semi-supervised [57] N. Snavely, S. M. Seitz, and R. Szeliski, “Photo tourism: Explor-
multi-graph hashing for scalable similarity search,” Comput. Vis. Image ing photo collections in 3D,” ACM Trans. Graph., vol. 25, no. 3,
Understand., vol. 124, pp. 12–21, Jul. 2014. pp. 835–846, 2006.
[30] N. Quadrianto and C. H. Lampert, “Learning multi-view neighbor- [58] A. Torralba, R. Fergus, and Y. Weiss, “Small codes and large image
hood preserving projections,” in Proc. Int. Conf. Mach. Learn., 2011, databases for recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern
pp. 425–432. Recognit., Jun. 2008, pp. 1–8.
[31] M. Rastegari, J. Choi, S. Fakhraei, D. Hal, and L. Davis, “Pre- [59] H. Jégou, M. Douze, and C. Schmid, “Product quantization for nearest
dictable dual-view hashing,” in Proc. Int. Conf. Mach. Learn., 2013, neighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1,
pp. 1328–1336. pp. 117–128, Jan. 2011.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

[60] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Qiang Yang received the M.S. degree from the
Tech. Rep., 2009. School of Information Science and Technology, Sun
[61] Z. Jin, C. Li, Y. Lin, and D. Cai, “Density sensitive hashing,” IEEE Yat-sen University, Guangzhou, China, in 2014,
Trans. Cybern., vol. 44, no. 8, pp. 1362–1371, Aug. 2014. where he is currently pursuing the Ph.D. degree with
[62] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: the School of Data and Computer Science.
A large data set for nonparametric object and scene recognition,” IEEE His current research interests include data mining
Trans. Pattern Anal. Mach. Intell., vol. 30, no. 11, pp. 1958–1970, algorithms, machine learning algorithms, evolution-
Nov. 2008. ary computation algorithms, and their applications
[63] A. Turpin and F. Scholer, “User performance versus precision measures on real-world problems.
for simple search tasks,” in Proc. ACM Special Interest Group Inf. Retr.,
2006, pp. 11–18.
[64] C. Wu, J. Zhu, D. Cai, C. Chen, and J. Bu, “Semi-supervised nonlinear
hashing using bootstrap sequential projection learning,” IEEE Trans.
Knowl. Data Eng., vol. 25, no. 6, pp. 1380–1393, Jun. 2013.
[65] M. Norouzi, A. Punjani, and D. J. Fleet, “Fast search in Hamming space
with multi-index hashing,” in Proc. IEEE Conf. Comput. Vis. Pattern Wei-Shi Zheng is currently a Professor with
Recognit., Jun. 2012, pp. 3108–3115. Sun Yat-sen University. He has authored over
[66] L. Huang and S. J. Pan, “Class-wise supervised hashing with label 90 papers, including over 60 publications in main
embedding and active bits,” in Proc. Int. Joint Conf. Artif. Intell., 2016, journals, such as the IEEE T RANSACTIONS ON PAT-
pp. 1585–1591. TERN A NALYSIS AND M ACHINE I NTELLIGENCE,
the IEEE T RANSACTIONS ON I MAGE P ROCESSING,
Long-Kai Huang received the B.Eng. degree from and Public Relations, and top conferences, such as
the School of Information Science and Technology, the International Conference on Computer Vision,
Sun Yat-sen University, Guangzhou, China, in 2013. the Computer Vision and Pattern Recognition, and
He is currently pursuing the Ph.D. degree with the International Joint Conference on Artificial Intel-
the School of Computer Science and Engineering, ligence. His current research interests include dis-
Nanyang Technological University, Singapore. tance metric learning, algorithms for fast search, and deep learning methods,
His current research interests include machine with their particular applications to person/object association and activity
learning and computer vision, and specially focus on understanding in visual surveillance.
fast large-scale image search, nonconvex optimiza- He has joined Microsoft Research Asia Young Faculty Visiting Programme.
tion, and its applications. He was a recipient of the Excellent Young Scientists Fund of the NSFC and
the Royal Society-Newton Advanced Fellowship.

You might also like