Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Learning a Complete Image Indexing Pipeline

Himalaya Jain1,2 , Joaquin Zepeda3 , Patrick Pérez1 , and Rémi Gribonval2


1 2 3
Technicolor, Rennes, France INRIA Rennes, France Amazon, Seattle, WA
arXiv:1712.04480v1 [cs.CV] 12 Dec 2017

Abstract tioning of the feature space into a set of mutually exclusive


bins. Searching in a database thus amounts to first assign-
To work at scale, a complete image indexing system com- ing the query image to one or several such bins, and then
prises two components: An inverted file index to restrict ranking the resulting shortlist of images associated to these
the actual search to only a subset that should contain most bins using the Euclidean distance (or some other distance or
of the items relevant to the query; An approximate dis- similarity measure) in feature space.
tance computation mechanism to rapidly scan these lists. The second method, introduced by Jegou et al. [15], con-
While supervised deep learning has recently enabled im- sists of using efficient approximate distance computations
provements to the latter, the former continues to be based as part of the ranking process. This is enabled by feature
on unsupervised clustering in the literature. In this work, we encoders producing compact representations of the feature
propose a first system that learns both components within a vectors that further do not need to be decompressed when
unifying neural framework of structured binary encoding. computing the approximate distances. This type of ap-
proaches, which can be seen as employing block-structured
binary representations, superseded the (unstructured) binary
1. Introduction hashing schemes that dominated approximate search.
Decades of research have produced powerful means to Despite its impressive impact on the design of image rep-
extract features from images, effectively casting the visual resentations [11, 1, 10, 23], supervised deep learning is still
comparison problem into one of distance computations in limited in what concerns the approximate search system it-
abstract spaces. Whether engineered or trained using con- self. Most recent efforts focus on supervised deep binary
volutional deep networks, such vector representations are at hashing schemes, as discussed in the next section. As an
the core of all content-based visual search engines. This ap- exception, the work of Jain et al. [13] employs a block-
plies particularly to example-based image retrieval systems structured approach inspired by the successful compact en-
where a query image is used to scan a database for images coders referenced above. Yet the binning mechanisms that
that are similar to the query in some way: in that they are enable the usage of inverted files, and hence large-scale
the same image but one has been edited (near duplicate de- search, have so far been neglected.
tection), or because they are images of the same object or In this work we introduce a novel supervised inverted
scene (instance retrieval), or because they depict objects or file system along with a supervised, block-structured en-
scenes from the same semantic class (category retrieval). coder that together specify a complete, supervised, image
Deploying such a visual search system requires conduct- indexing pipeline. Our design is inspired by the two meth-
ing nearest neighbour search in a high-dimensional feature ods of successful indexing pipelines described above, while
space. Both the dimension of this space and the size of borrowing ideas from [13] to implement this philosophy.
the database can be very large, which imposes severe con- Our main contributions are as follows: (1) We propose
straints if the system is to be practical in terms of storage the first, to our knowledge, image indexing system to reap
(memory footprint of database items) and of computation the benefits of deep learning for both data partitioning and
(search complexity). Exhaustive exact search must be re- feature encoding; (2) Our data partitioning scheme, in par-
placed by approximate, non-exhaustive search. To this end, ticular, is the first to replace unsupervised VQ by a super-
two main complementary methods have emerged, both re- vised approach; (3) We take steps towards learning the fea-
lying on variants of unsupervised vector quantization (VQ). ture encoder and inverted file binning mechanism simul-
The first such method, introduced by Sivic and Zisserman taneously as part of the same learning objective; (4) We
[25] is the inverted file system. Inverted files rely on a parti- establish a wide margin of improvement over the existing

1
baselines employing state-of-the art deep features, feature codebook D = [dk ∈ Rd ]N k=1 , the VQ representation of x
encoders and binning mechanism. in D is obtained by solving1

2. Background n = argmink kx − dk k22 , (1)

where n is the codeword index for x and dn its reconstruc-


Approximating distances through compact encoding
tion. Given a database {xi }i of image features, and letting
Concerning approximate distance computations, two main
ni represent the codeword index of xi , the database is par-
approaches exist. Hashing methods [26], on the one hand,
titioned into N index bins Bn . These bins, stored along
employ Hamming distances between binary hash codes.
with metadata that may include the features xi or a com-
Originally unsupervised, these methods have recently ben-
pact representation thereof, is known as an inverted file. At
efited from progress in deep learning [27, 28, 30, 17, 18,
query time, the bins are ranked by decreasing pertinence
19, 7], leading to better systems for category retrieval in
n1 , . . . , nN relative to the query feature x∗ so that
particular. Structured variants of VQ, on the other hand,
produce fine-grain approximations of the high-dimensional
kx∗ − dn1 k ≤ . . . ≤ kx∗ − dnN k, (2)
features themselves through very compact codes [3, 8, 9,
12, 15, 16, 20, 29] that enable look-up table-based efficient
i.e., by increasing order of reconstruction error. Using this
distance computations. Contrary to recent hashing methods,
sorting, one can specify a target number of images T to
VQ-based approaches have not benefited from supervision
so far. However, Jain et al. [13] recently proposed a super-
retrieveP PBsearch only the first B bins
from the database and
B−1
so that k=1 |Bnk | ≤ T ≤ k=1 |Bnk |.
vised deep learning approach that leverages the advantages
It is important to note that all existing state-of-the-art in-
of structured compact encoding and yields state-of-the-art
dexing methods employ a variant of the above described
results on several retrieval tasks. Our work extends this
mechanism that relies on K-means-learned codebooks D.
supervised approach towards a complete indexing pipeline,
To the best of our knowledge, ours is the first method to
that is, a system that also includes an inverted file index.
reap the benefits of deep learning to build an inverted file.
Scanning shorter lists with inverted indexes For further
Feature encoder The S inverted file outputs a shortlist of
efficiency, approximate search is further restricted to a well B
images with indices in k=1 Bnk , which needs to be effi-
chosen fraction of the database. This pruning is carried out
ciently ranked in terms of distance to the query This is en-
by means of an Inverted File (IVF), which relies on a parti-
abled by compact feature encoders that allow rapid distance
tioning of the feature space into Voronoi cells defined using
computations without decompressing features. It is impor-
K-means clustering [14, 2]. Two things should be noted:
tant to note that the storage bitrate of the encoding affects –
The method to build the inverted index is unsupervised and
besides storage cost – search speed, as higher bitrates means
it is independent from the way subsequent distance approx-
that bins need to be stored in secondary storage, where look-
imations are conducted (e.g., while VQ is used to build
up speeds are a significant burden.
the index, short lists can be scanned using binary embed-
State-of-the art image indexing systems use feature en-
dings [14]). In this work, we propose a unifying supervised
coders that employ a residual approach: A residual is com-
framework. Both the inverted index and the encoding of
puted from each database feature x and its reconstruction
features are designed and trained together for improved per-
dn obtained as part of the inverted file bin selection in (1):
formance. In the next section, we expose in more detail the
existing tools to design IVF/approximate search pipelines,
rn = x − dn . (3)
before moving to our proposal in Section 4.
This residual is then encoded using a very high resolution
3. Review of image indexing quantizer. Several schemes exist [6, 15] that exploit struc-
Image indexing systems are based on two main compo- tured quantizers to enable low-complexity, high-resolution
nents: (i) an inverted file and (ii) a feature encoder. In this quantization, and herein we describe product quantizers and
section we describe how these two main components are related variants [15, 20, 8]. Such vector quantizers employ
M
used in image indexing systems, thus laying out the motiva- a codebook C ∈ Rd×K with codewords that are them-
tion for the method we introduce in §4. selves additions of codewords from M smaller constituent
1 Notation: We denote [v , . . . , v ] = [v ∈ Rd ]M
Inverted File (IVF) An inverted file relies on a partition of 1 K k k=1 the matrix in
Rd×M having columns vk ∈ Rd , or simply [vk ]k . For scalars ak , [ak ]k
the database into mutually exclusive bins, a subset of which denotes a column-vector with entries ak . The column vector obtained by
is searched at query time. The partitioning is implemented stacking vertically vectors vk is noted COL (v1 , . . . , vK ). We further let
by means of VQ [25, 15, 2]: Given a vector x ∈ Rd and a v[k] denote the k-th entry of vector v.
codebooks Cm = [cm,k ]k ∈ Rd×K , m = 1, . . . , M , that B LOCK
M
b ∈ KK
T
are orthogonal (∀m 6= l, Cm Cl = 0): O NE -H OT

b̃ ∈ ∆M
h XM i
C= cm,km . (4) I BASE x ReLU B LOCK K
m=1 (k1 ,...,kM )∈(1,...,K)M CNN
W S OFT M AX
z ∈ RKM
+

Accordingly, an encoding of r in this structured codebook is Figure 1. The S U B I C encoder operates on the feature vector x pro-
specified by the indices (k1 , . . . , kM ) which uniquely define duced by a CNN to enable learning of (relaxed) block-structured
the codeword c from C, i.e., the reconstruction of r in C. codes (b̃) b. Blue, yellow, and green blocks are active, respec-
Note that the bitrate of this encoding is M log2 (K). tively, only at training time, only at testing time and at train-
ing/testing times.
Asymmetric distance computation Armed with such a
representation for all database vectors, one can very effi-
ciently compute an approximate distance between a query
x∗ and all database features x ∈ {xi , i ∈ ∪B k=1 Bnk } in
top-ranked bins. The residual of x∗ for bin Bn is

rn∗ = x − dn (5)

and the approach is asymmetrical in that this uncom-


pressed residual is compared to the compressed, recon-
structed residual representation c of the database vectors x Figure 2. The discrete set K3 of one-hot encoded vectors, its
in bin Bn using the distance convex-hull ∆3 , and the distribution of relaxed blocks b̃m en-
forced by the S U B I C entropy losses. Omitting the negative batch
M
X entropy loss (19) would result in situations where p(b̃m ) is con-
krn∗ − ck22 = krn∗ − cm,km k22 . (6) centrated near only k < 3 of the elements in K3 .
m=1

We define the look-up tables (LUT)


computing an inner-product between a bin-dependent map-
zn,m , krn∗ − cm,k k22 k ∈ RK
 
(7) ping zn ∈ RM K of the query feature x∗ and a block-
K
structured binary code b ∈ KM derived from x. A search
containing the distances between rn∗ and all codewords of
then consists of computing all such approximate distances
CmP . Building these LUTs enables us to compute (6) us-
M for the B most pertinent bins and then sorting the corre-
ing m=1 zm [km ], an operation that requires only M table
sponding images in increasing order of these distances.
look-ups and additions, establishing the functional benefit
of the encoding (k1 , . . . , kM ). It is worth noting that most of the recent supervised bi-
To gain some insight into the above encoding, consider nary encoding methods [27, 28, 30, 17, 18, 19, 7] do not
the one-hot representation bm of the indices km given by use structured binary codes of the form b in (12). The main
  exception being S U B I C [13], which further uses a sorting
bm = Jl = km K l ∈ KK , (8) score that is an inner product of the same form as (12).

where J·K denotes the Iverson brackets and


4. A complete indexing pipeline
KK , {a ∈ {0, 1}K , kak1 = 1}. (9)
The previous section established how state-of-the-art
Using stacked column vectors large-scale image search systems rely on two main compo-
M nents: an inverted file and a functional residual encoder that
b = COL (b1 , . . . , bM ) ∈ KK and (10)
produces block-structured binary codes. While compact bi-
zn = COL (zn,1 , . . . , zn,M ) ∈ RM K
+ , (11) nary encoders based on deep learning have been explored
in the literature, inverted file systems continue to rely on
distance (6) can be expressed as follows: unsupervised K-means codebooks.
krn∗ − ck22 = znT b. (12) In this section we first revisit the S U B I C encoder [13],
and then show how it can be used to implement a complete
Namely, computing approximate distances between a query image indexing system that employs deep learning method-
x∗ and the database features x ∈ {xi , i ∈ Bn } amounts to ology both at the IVF stage and compact encoder stage.
Bin selection n
A RG M AX

Encoder M
b ∈ KK
ReLU Res. SM B LOCK
W1 S OFT M AX C1 O NE -H OT
z 0 ∈ RN
+
Non-res.

s0 ∈ ∆C
I BASE x ∈ Rd ReLU B LOCK SM s ∈ ∆C
Residual W2 C2
CNN S OFT M AX
b̃0 ∈ ∆N z∈ RKM
+
+ − ReLU
Q R2 R1
r ∈ Rd b̃ ∈ ∆M
K
Res.
Non-res.

Figure 3. Proposed indexing architecture. The proposed indexing architecture consists of a bin selection component, a residual computa-
tion component, and a feature encoder. We use blocks with square corners (labeled with a weights matrix) to denote fully-connected linear
operations, potentially followed by a ReLU or softmax (SM) nonlinearity. Blue, yellow, and green blocks are active, respectively, only at
training time, only at testing (i.e. database indexing / querying) time and at training/testing times. The residual block can be disabled to
define a new architecture, as illustrated by the switch at the bottom of the diagram.

4.1. Block-structured codes has a minimum equal to zero for deterministic distributions
p ∈ KK , motivating the use of the entropy loss
The S U B I C encoder in Fig. 1 was the first to leverage
supervised deep learning to produce a block-structured code XM
of the form b ∈ KK M
in (10). At learning time, the method `E (b̃) , E(b̃m ) (18)
m=1
relaxes the block-structured constraint. Letting
X to enforce the proximity of the relaxed blocks b̃m to
∆K = a ∈ R K

+ s.t. a[k] = 1 (13) KK .This loss on its own, however, could lead to situations
k
where only some elements of KK are favored (c.f . Fig. 2),
denote the convex hull of KK , said relaxation meaning that only a subset of the support of the bm is used.
Yet entropy likewise has a maximum of log2 (K) for uni-
b̃ ∈ ∆M
K (14) form distributions p = K 1
1. This property can be used to
encourage uniformity in the selection of elements of KK by
is enforced by means of a fully-connected layer of output means of the negative batch entropy loss, computed for a
size KM and ReLU activation with output z that is fed batch A = {b̃(i) }i of size |A| using
to a block softmax non-linearity that operates as follows:
Let zm denote the m-th block of z ∈ RKM such that XM  1 X 
`B (A) , − E b̃(i)
m . (19)
z = COL (z1 , . . . , zM ). Likewise, let b̃m ∈ ∆K denote the m=1 |A| i
m-th block of the relaxed code b̃ ∈ ∆M K . The block soft-
max non-linearity operates by applying a standard softmax For convenience, we define the S U B I C loss computed on
non-linearity to each block zm of z to produce the corre- a batch A as the weighted combination of the two entropy
sponding block b̃m of b̃: losses, parametrized by the hyper-parameters µ, γ ∈ R+ :

exp (zm [k])
 µ X
`µ,γ

b̃m = P . (15) S A , `E (b̃) + γ`B (A) . (20)
|A| b̃∈A
l exp (zm [l]) k

At test time, the block-softmax non-linearity is replaced by It is important to point out that, unlike the residual en-
a block one-hot encoder that projects b̃ unto ∆M coder described in §3, the S U B I C approach operates on the
K . In prac-
tice, this can be accomplished by means of one-hot encod- feature vector x directly. Indeed, the S U B I C method is only
ing of the index of the maximum entry of zm : a feature encoder, and does not implement an entire index-
ing framework.
 
bm = Jk = argmax(zm )K k . (16)
4.2. A novel indexing pipeline
The approach of [13] introduced two losses based on en- We now introduce our proposed network architecture
M
tropy that enforce the proximity of b̃ to KK . The entropy that uses the method of [13] described above to build an
of a vector p ∈ ∆K , defined as entire image indexing system. The system we design im-
XK plements the main ideas of the state-of-the-art pipeline de-
E(p) = p[k] log2 (p[k]) , (17) scribed in §3.
k=1
Our proposed network architecture is illustrated in Fig. z 0∗ at the output of the W1 layer and (ii) the activation
3. The input to the network is the feature vector x consisting coefficients z ∗ at the output of the W2 layer. The IVF
of activation coefficients obtained by running a given image bins
 are then SBranked as per (21) and all database images
I through a CNN feature extractor. We refer to this feature Ii , i ∈ k=1 Bnk in the B most pertinent bins are
extractor as the base CNN of our system. sorted, based on their encoding bi , according to their score
Similarly to the design philosophy described in §3, our
indexing system employs an IVF and a residual feature en- z 0∗T bi . (24)
coder. Accordingly, the architecture in Fig. 3 consists of
two main blocks, Bin selection and Encoder, along with a Training We assume we are given a training set
Residual block that links these two main components. {(I (i) , y (i) )}i organized into C classes, where label y (i) ∈
(1, . . . , C) specifies the class of the i-th image. Various
Bin selection The first block, labeled Bin selection in Fig.
works on learning for retrieval have explored the benefit of
1 can be seen as a S U B I C encoder employing a single block
using ranking losses like the triplet loss and the pair-wise
(i.e. M = 1) of size N , with the block one-hot encoder sub-
loss as opposed to the cross-entropy loss succesfully used
stituted by an argmax operation. The block consists of a sin-
in classification tasks [11, 27, 28, 30, 17, 18, 19, 7, 5]. Em-
gle fully-connected layer with weight matrix W1 and ReLU
pirically, we have found that the cross-entropy loss yields
activation followed by a second activation using softmax.
good results in the retrieval task, and we adopt it in this
When indexing a database image I, this block is responsi-
work.
ble for choosing the bin Bn that I is assigned to, using the
Given an image belonging to class c and a vector p ∈
argmax of the coefficients z 0 .
∆C that is an estimate of class membership probabilities,
Given a query image I ∗ , the same binning block is re-
the cross-entropy loss is given by (the scaling is for conve-
sponsible for sorting the bins in decreasing order of perti-
nience of hyper-parameter cross-validation)
nence Bn1 · · · BnN using the coefficients z 0∗ ∈ RN
+ so that
1
z 0∗ [n1 ] ≥ . . . ≥ z 0∗ [nN ], (21) ` (p, c) = − log2 p[c]. (25)
log2 C
in a manner analogous to (2). Accordingly, we train our network by enforcing that the
(Residual) feature encoding Inspired by the residual en- relaxed block-structured codes b̃0 and b̃ are good feature
coding approach described in §3, we consider a block anal- vectors that can be used to predict class membership. We do
ogous to the residual computation of (3) and (5). The ap- so by feeding each vector to a soft-max classification layer
proach consists of building a vector (denoting ReLU as σ) (layers C1 and C2 in Fig. 3, respectively), thus producing
estimates of class membership s0 and s in ∆C (c.f . Fig.
R2 σ(R1 b̃0 ), (22) 3) from which we derive two possible task-related losses.
Letting T denote a batch specified as a set of training-pair
that is analogous to the reconstruction dn of x obtained
indices, these two losses are
from the encoding n following the IVF stage (c.f . (1) and
discussion thereof), and subtracts it from a linear mapping 1 Xh i
α` s0(i) , y (i) + ` s(i) , y (i)

of x: L1,α = (26)
|T |
i∈T
r = Qx − R2 σ(R1 b̃0 ). (23)
1 X
` s0(i) + s(i) , y (i) ,

Besides the analogy to indexing pipelines, one other moti- and L2 = (27)
|T |
vation for the above approach is to provide information to i∈T

the subsequent feature encoding from the IVF bin selection where the scalar α ∈ {0, 1} is a selector variable. In order
stage (i.e. b̃0 ) as well as the original feature x. For complete- to enforce the proximity of the b̃0 and b̃ to KN and KK M
,
ness, as illustrated in Fig. 3, we also consider architectures respectively, we further employ the loss
that override this residual encoding block, setting r = x
directly. ΩH = `µS 1 ,γ1 {b̃0(i) }i∈T

+ `Sµ2 ,γ2 {b̃(i) }i∈T , (28)

The final stage consists of an M -block S U B I C encoder
M
operating on r and producing test-time encodings b ∈ KK , which depends on the four hyper-parameters H =
M
and training-time relaxed encoding b̃ ∈ ∆K . Note that, {µ1 , γ1 , µ2 , γ2 } (we disuss heuristics for their selection if
unlike the residual approach described in §3, our approach §5).
does not incurr the extra overhead required to compute Accordingly, the general learning objective for our sys-
LUTs using (7). tem is
F∗ = L∗ + ΩH , (29)
Searching Given a query image I ∗ , it is first fed to the
pipeline in Fig. 3 to obtain (i) the activation coefficients and we consider three variants thereof:
(S U B I C-I) a non-residual variant with objective F1,1 cor- γ2 = 0.9. We use the same values for all three variants of
responding to independently training the bin selection our system.
block and the feature encoder;
Evaluation of feature encoder First of all, we evaluate
(S U B I C-R) a residual variant with objective F1,0 where how S U B I C encoding performs on all the test datasets com-
the bin selection block is pre-trained and held fixed pared to the PQ unsupervised vector quantizer. We use
during learning; and M = 8 and K = 256 setups for both codes. S U B I C is
trained for 500K batches of 200 training images, with µ =
(S U B I C-J) a non-residual variant with objective F2 . 0.6 and γ = 0.9. The results reported in Table 1 show that,
as expected, S U B I C outperforms PQ, justifying its selection
5. Experiments as a feature encoder in our system. For reference, the first
row in the table gives the performance with uncompressed
Datasets For large-scale image retrieval, we use three features. While high, each base feature vector has a storage
publicly available datasets to evaluate our approach: Ox- footprint of 8 Kilo bytes (assuming 4-byte floating points).
ford5K [21]2 , Paris6K [22]3 and Holidays [14]4 . For large- S U B I C and PQ, on the other hand, require only 8 bytes of
scale experiments, we add 100K and 1M images from Flickr storage per feature (1000× less).
(Flickr100K and Flickr100K1M respectively) as a noise set. Baseline indexing systems We compare all three variants
For Oxford5K, bounding box information is not used. For of our proposed indexing system against two existing base-
Holidays, images are used without correcting orientation. lines, as well as a straightforward attempt to use deep hash-
For training, we use the Landmarks-full subset of the ing as an IVF system:
Landmarks dataset [4]5 , as in [11]. We could only get
125,610 images for the full set due to broken URLs. In all (IVF-PQ) This approach uses an inverted file with N =
our experiments and for all approaches we use Landmarks- 4096 bins followed by a residual PQ encoder with
full as the training set. M = 8 blocks and constituent codebooks of size
For completeness, we also carry out category retrieval K = 256 (c.f . (4)), resulting in an 8-byte feature size.
[13] test using the Pascal VOC6 and Caltech-1017 datasets. The search employs asymmetric distance computation.
For this test, our method is trained on ImageNet. During retrieval, the top B = 2n , lists are retrieved,
and, for each n = 1, 2, . . . the average mAP and aver-
Base features The base features x are obtained from age aggregate bin size T are plotted.
the ResNet version of the network proposed in [11].
This network extends the ResNet-101 architecture with re- (IMI-PQ) The Inverted Multi-Index (IMI) [2] extends the
gion of interest pooling, fully connected layers, and `2 - standard IVF by substituting a product quantizer with
normalizations to mimic the pipeline used for instance re- M = 2 and K = 4096 in place of the vector quan-
trieval. Their method enjoys state-of-the-art performance tizer. The resulting IVF has more than 16 million
for instance retrieval, motivating its usage as base CNN for bins, meaning that, for practical testing sets (contain-
this task. ing close to 1 million images), most of the bins are
empty. Hence, when employing IMI, we select short-
Hyper-parameter selection For all three variants of our
list sizes T for which to compute average mAP to cre-
approach (S U B I C-I, S U B I C-J, and S U B I C-R), we use N =
ate our plots. Note that, given the small size of the IMI
4096 bins, and a S U B I C-(8, 256) encoder having M = 8
bins, the computation of the look-up tables zn (c.f .
blocks of K = 256 block size (corresponding to 8 bytes
(7)) represents a higher cost per-image for IMI than
per encoded feature). These parameters correspond to com-
for IVF. Furthermore, the fragmented memory reads
monly used values for indexing systems. To select the
required can have a large impact on speed relative to
four hyper-parameters H = {µ1 , γ1 , µ2 , γ2 } (c.f . (29)) we
the contiguous reads implicit in the larger IVF bins.
first cross-validate just the bin selection block to choose
µ1 = 5.0 and γ1 = 6.0. With these values fixed, we then (DSH-S U B I C) In order to explore possible approaches to
cross-validate the encoder block to obtain µ2 = 0.6 and include supervision in the IVF stage of an indexing
2 www.robots.ox.ac.uk/ system, we further considered using the DSH deep
˜vgg/data/oxbuildings/
3 www.robots.ox.ac.uk/
˜vgg/data/parisbuildings/ hash code [19] as a bin selector, carefully selecting the
4 lear.inrialpes.fr/ jegou/data.php
˜ regularization parameter to be 0.03 by means of cross-
5 sites.skoltech.ru/compvision/projects/
validation. We train this network to produce 12-bit im-
neuralcodes/
6 http://host.robots.ox.ac.uk/pascal/VOC/ age representations corresponding to N = 4096 IVF
7 http://www.vision.caltech.edu/Image_Datasets/ bins, where each bin has an associated hash code. Im-
Caltech101/ ages are indexed using their DSH hash, and at query
Method Oxford5K Oxford5K* Paris6K Holidays Oxford105K Paris106K
DIR [11] 84.94 84.09 93.58 90.32 83.52 89.10
PQ [15] 46.57 39.45 57.57 48.23 38.73 42.23
S U B I C [13] 53.25 46.06 71.28 60.52 46.88 58.27

Table 1. Instance retrieval with encoded features. Performance (mAP) comparison using 64-bit codes, first row shows reference results
with original uncompressed features. When bounding box information is used for Oxford5K dataset, the performance degrades for both
PQ and S U B I C, shown in column Oxford5K*, as both are trained on full images.

Oxford5K Oxford105K Oxford1M


0.5 0.3
0.4
mAP

0.4 0.2
0.3
0.3 0.1

102 103 102 103 104 105 103 104 105


Paris6K Paris106K Paris1M
0.6
0.6 0.3
mAP

0.4
0.4 0.2
0.2 0.2
102 103 102 103 104 105 103 104 105
Holidays Holidays101K Holidays1M
0.6 0.3 0.2
0.5 0.25
mAP

0.15
0.4
0.2
0.3
0.15 0.1
0.2
101 102 102 103 104 103 104 105
T T T
IVF-PQ IMI-PQ DSH-S U B I C S U B I C-I S U B I C-J S U B I C-R
Figure 4. Large-scale image retrieval with complete pipelines Plots of mAP vs. (average) shortlist size T . For all methods except
IMI-PQ, the n-th plotted point is obtained from all images in the first B = 2n bins. For IMI-PQ, the mAP is computed on the first T
responses.

time, the Hamming distance between the query’s 12- number of bins, as discussed above.
bit code and each bin’s code is used to rank the lists. We present results for three datastets (Oxford5K,
For the encoder part, we used S U B I C with M = 8 and Paris6K, Holidays), on three different database scales (the
K = 256, the same used in Tab. 1. original dataset, and when also including noise datasets
of 100K and 1M images). Note that on Oxford5K and
Large-scale indexing Fig 4 shows the mAP performance Paris6K, both S U B I C-I and S U B I C-J enjoy large advan-
versus average number of retrieved images T for all three tages relative to all three baselines – at T = 300, the
variants as well as the baselines described above. Note that relative advantage of S U B I C-I over the IMI-PQ is 19%
the number of retrieved images is a measure of complexity, at least. S U B I C-R likewise enjoys an advangate on the
as for IVF, the time complexity of the system is dominated Paris6K dataset, and performs comparably to the baselines
by the approximate distance computations in (12). For IMI, on Oxford5K.
on the other hand, there is a non-negligible overhead on top On Holidays S U B I C-I outperforms IVF-PQ by a large
of the approximate distance computation related to the large margin (18% relative), but not outperform IMI-PQ. As dis-
Oxford5K Oxford105K Oxford1M

0.5 0.45 0.35


mAP

mAP
0.4 0.3
0.4
0.35
0.25
0.3 0.3
2 3
10 10 102 10 3
10 4
105 103 104 105
Paris6K Paris106K Paris1M
0.7 0.6
0.6 0.35
0.5
mAP

0.5 0.4 0.3


0.4
0.3
102 103 102 103 104 105 103 104 105
Holidays Holidays101K Holidays1M

0.6 0.32 0.2


0.3
mAP

0.28 0.18
0.5
0.26
0.16
101 102 102 103 104 103 104 105
T T T
IVF-S U B I C IMI-S U B I C S U B I C-I S U B I C-I-IMI
Figure 5. IMI variant of our approach and comparison for fixed encoder Comparison of an IMI variant of our method to the baselines,
when using the same (non-residual) feature encoder. Note the substantial relative improvements of S U B I C-I-IMI.

cussed above, this comparison does not reflect the overhead ture encoder for all methods including the baselines, which
implicit in an IMI index. To illustrate this overhead, we note are IVF and IMI with unsupervised codebooks (all methods
that, when 1M images are indexed, the average (non-empty) are non-residual). The results, plotted in Fig. 5, establish
bin size for IMI is 18.3, meaning that approximately 54.64 that, for the same number of bins, our method can readily
memory accesses and look-up table constructions need to be outperform the baseline IMI (and IVF) methods. Further-
carried out for each IMI query per 1K images. This com- more, given that we use the best performing feature encoder
pares to an average bin size of 244.14 for IVF, and accord- (S U B I C) for all methods, this experiment also establishes
ingly 4.1 contiguous memory reads and look-up table con- that the S U B I C based binning system that we propose out-
structions. Note, on the other hand, that S U B I C-I readily performs the unsupervised IVF and IMI baselines.
outperforms IVF-PQ in all Holidays experiments.
Concerning the poor performance of S U B I C-R on Holi- Category retrieval For completeness, we also carry out
days, we believe this is due to poor generalization ability of experiments in the category retrieval task which has been
the system because of the three extra fully-connected layers. the main focus of recent deep hashing methods [27, 28, 30,
17, 18, 19, 7]. In this task, a given query image is used
IMI extension Given the high performance of IMI for to rank all database images, with a correct match occur-
the Holidays experiments in Fig. 4, we further consider ing for database images of the same class as the query im-
an IMI variant of our S U B I C-I architecture. To imple- age. For category retrieval experiments, we use VGG-M-
ment this approach, we learn a S U B I C-(2, 4096) encoder 128 base features [24], which have established good perfor-
0
(with µ = 4 and γ = 5). Letting zm denote the m-th mance for classification tasks, and a S U B I C-(1, 8192) for
0
block of z , the (k, l) ∈ (1, . . . , 4096)2 bins of S U B I C- bin selection. We use the ImageNet training set (1M+ im-
IMI are sorted based on the score z10 [k] + z20 [l]. For fair- ages) to train, and the test (training) subsets of Pascal VOC
ness of comparison, we use the same S U B I C−(8, 256) fea- and Caltech-101 as a query (respectively, database) set. We
Caltech-101 Pascal VOC [8] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quan-
0.4 tization for approximate nearest neighbor search. In Proc.
0.4
Conf. Comp. Vision Pattern Rec., 2013. 2
0.35 0.3
[9] T. Ge, K. He, and J. Sun. Product sparse coding. In Proc.
mAP

0.3 IVFPQ 0.2 Conf. Comp. Vision Pattern Rec., 2014. 2


S U B I C-I [10] Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multi-scale
0.25
S U B I C-R 0.1 orderless pooling of deep convolutional activation features.
0.2 In Proc. Europ. Conf. Computer Vision, 2014. 1
102 103 102 103
[11] A. Gordo, J. Almazán, J. Revaud, and D. Larlus. Deep image
T T retrieval: Learning global representations for image search.
Figure 6. Category retrieval Comparing S U B I C-I and S U B I C-R In Proc. Europ. Conf. Computer Vision, 2016. 1, 5, 6, 7
to IVF-PQ on the category retrieval task. joint-residual to non- [12] H. Jain, P. Pérez, R. Gribonval, J. Zepeda, and H. Jégou.
joint S U B I C and IVFPQ for category retrieval on Pascal and Cal- Approximate search with quantized sparse representations.
tech101. All methods are trained on VGG-M-128 features of Ima- In Proc. Europ. Conf. Computer Vision, 2016. 2
geNet images. [13] H. Jain, J. Zepeda, P. Pérez, and R. Gribonval. SuBiC: A
supervised, structured binary code for image search. In Proc.
Int. Conf. Computer Vision, 2017. 1, 2, 3, 4, 6, 7
present results for this task in Fig. 6. Note that, unlike the [14] H. Jegou, M. Douze, and C. Schmid. Hamming embedding
Holidays experiments in Fig. 4, S U B I C-R performs best on and weak geometric consistency for large scale image search.
Caltech-101 and equally well to S U B I C-I on Pascal VOC, Proc. Europ. Conf. Computer Vision, 2008. 2, 6
a consequence of the greater size and diversity of the Ima- [15] H. Jégou, M. Douze, and C. Schmid. Product quantization
geNet datasets relative to the Landmarks dataset. for nearest neighbor search. IEEE Trans. Pattern Anal. Ma-
chine Intell., 33(1):117–128, 2011. 1, 2, 7
6. Conclusion [16] Y. Kalantidis and Y. Avrithis. Locally optimized product
quantization for approximate nearest neighbor search. In
We present a full image indexing pipeline that exploits Proc. Conf. Comp. Vision Pattern Rec., 2014. 2
supervised deep learning methods to build an inverted file as [17] H. Lai, Y. Pan, Y. Liu, and S. Yan. Simultaneous feature
well as a compact feature encoder. Previous methods have learning and hash coding with deep neural networks. In Proc.
either employed unsupervised inverted file mechanisms, or Conf. Comp. Vision Pattern Rec., 2015. 2, 3, 5, 8
employed supervision only to derive feature encoders. We [18] K. Lin, H.-F. Yang, J.-H. Hsiao, and C.-S. Chen. Deep learn-
establish experimentally that our method achieves state of ing of binary hash codes for fast image retrieval. In Conf.
Comp. Vision Pattern Rec. Workshops, 2015. 2, 3, 5, 8
the art results in large scale image retrieval.
[19] H. Liu, R. Wang, S. Shan, and X. Chen. Deep supervised
hashing for fast image retrieval. In Proc. Conf. Comp. Vision
References Pattern Rec., 2016. 2, 3, 5, 6, 8
[20] M. Norouzi and D. Fleet. Cartesian k-means. In Proc. Conf.
[1] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic.
Comp. Vision Pattern Rec., 2013. 2
NetVLAD: CNN architecture for weakly supervised place
recognition. In Proc. Int. Conf. Computer Vision, 2016. 1 [21] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisser-
man. Object retrieval with large vocabularies and fast spatial
[2] A. Babenko and V. Lempitsky. The inverted multi-index. In
matching. In Proc. Conf. Comp. Vision Pattern Rec., 2007. 6
Proc. Conf. Comp. Vision Pattern Rec., 2012. 2, 6
[22] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman.
[3] A. Babenko and V. Lempitsky. Additive quantization for ex- Lost in quantization: Improving particular object retrieval in
treme vector compression. In Proc. Conf. Comp. Vision Pat- large scale image databases. In Proc. Conf. Comp. Vision
tern Rec., 2014. 2 Pattern Rec., 2008. 6
[4] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky. [23] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carls-
Neural codes for image retrieval. In Proc. Europ. Conf. Com- son. CNN features off-the-shelf: An astounding baseline for
puter Vision, 2014. 6 recognition. In Conf. Comp. Vision Pattern Rec. Workshops,
[5] C. Bilen, J. Zepeda, and P. Perez. The CNN News Footage 2014. 1
Datasets : Enabling Supervision in Image Retrieval. EU- [24] K. Simonyan and A. Zisserman. Very deep con-
SIPCO, 2016. 5 volutional networks for large-scale image recognition.
[6] Y. Chen, T. Guan, and C. Wang. Approximate nearest arXiv:1409.1556, 2014. 8
neighbor search by residual vector quantization. Sensors, [25] J. Sivic and A. Zisserman. Video Google: A text retrieval
10(12):11259–11273, 2010. 2 approach to object matching in videos. In Proc. Int. Conf.
[7] T.-T. Do, A.-D. Doan, and N.-M. Cheung. Learning to hash Computer Vision, 2003. 1, 2
with binary deep neural network. In Proc. Europ. Conf. Com- [26] J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity
puter Vision, 2016. 2, 3, 5, 8 search: A survey. arXiv preprint arXiv:1408.2927, 2014. 2
[27] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan. Supervised hash- [29] T. Zhang, C. Du, and J. Wang. Composite quantization for
ing for image retrieval via image representation learning. In approximate nearest neighbor search. In Proc. Int. Conf. Ma-
Proc. AAAI Conf. on Artificial Intelligence, 2014. 2, 3, 5, 8 chine Learning, 2014. 2
[28] R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang. Bit-
scalable deep hashing with regularized similarity learning for [30] F. Zhao, Y. Huang, L. Wang, and T. Tan. Deep semantic rank-
image retrieval and person re-identification. IEEE Trans. Im- ing based hashing for multi-label image retrieval. In Proc.
age Processing, 24(12):4766–4779, 2015. 2, 3, 5, 8 Conf. Comp. Vision Pattern Rec., 2015. 2, 3, 5, 8

You might also like