Professional Documents
Culture Documents
Learning A Complete Image Indexing Pipeline
Learning A Complete Image Indexing Pipeline
1
baselines employing state-of-the art deep features, feature codebook D = [dk ∈ Rd ]N k=1 , the VQ representation of x
encoders and binning mechanism. in D is obtained by solving1
b̃ ∈ ∆M
h XM i
C= cm,km . (4) I BASE x ReLU B LOCK K
m=1 (k1 ,...,kM )∈(1,...,K)M CNN
W S OFT M AX
z ∈ RKM
+
Accordingly, an encoding of r in this structured codebook is Figure 1. The S U B I C encoder operates on the feature vector x pro-
specified by the indices (k1 , . . . , kM ) which uniquely define duced by a CNN to enable learning of (relaxed) block-structured
the codeword c from C, i.e., the reconstruction of r in C. codes (b̃) b. Blue, yellow, and green blocks are active, respec-
Note that the bitrate of this encoding is M log2 (K). tively, only at training time, only at testing time and at train-
ing/testing times.
Asymmetric distance computation Armed with such a
representation for all database vectors, one can very effi-
ciently compute an approximate distance between a query
x∗ and all database features x ∈ {xi , i ∈ ∪B k=1 Bnk } in
top-ranked bins. The residual of x∗ for bin Bn is
rn∗ = x − dn (5)
Encoder M
b ∈ KK
ReLU Res. SM B LOCK
W1 S OFT M AX C1 O NE -H OT
z 0 ∈ RN
+
Non-res.
s0 ∈ ∆C
I BASE x ∈ Rd ReLU B LOCK SM s ∈ ∆C
Residual W2 C2
CNN S OFT M AX
b̃0 ∈ ∆N z∈ RKM
+
+ − ReLU
Q R2 R1
r ∈ Rd b̃ ∈ ∆M
K
Res.
Non-res.
Figure 3. Proposed indexing architecture. The proposed indexing architecture consists of a bin selection component, a residual computa-
tion component, and a feature encoder. We use blocks with square corners (labeled with a weights matrix) to denote fully-connected linear
operations, potentially followed by a ReLU or softmax (SM) nonlinearity. Blue, yellow, and green blocks are active, respectively, only at
training time, only at testing (i.e. database indexing / querying) time and at training/testing times. The residual block can be disabled to
define a new architecture, as illustrated by the switch at the bottom of the diagram.
4.1. Block-structured codes has a minimum equal to zero for deterministic distributions
p ∈ KK , motivating the use of the entropy loss
The S U B I C encoder in Fig. 1 was the first to leverage
supervised deep learning to produce a block-structured code XM
of the form b ∈ KK M
in (10). At learning time, the method `E (b̃) , E(b̃m ) (18)
m=1
relaxes the block-structured constraint. Letting
X to enforce the proximity of the relaxed blocks b̃m to
∆K = a ∈ R K
+ s.t. a[k] = 1 (13) KK .This loss on its own, however, could lead to situations
k
where only some elements of KK are favored (c.f . Fig. 2),
denote the convex hull of KK , said relaxation meaning that only a subset of the support of the bm is used.
Yet entropy likewise has a maximum of log2 (K) for uni-
b̃ ∈ ∆M
K (14) form distributions p = K 1
1. This property can be used to
encourage uniformity in the selection of elements of KK by
is enforced by means of a fully-connected layer of output means of the negative batch entropy loss, computed for a
size KM and ReLU activation with output z that is fed batch A = {b̃(i) }i of size |A| using
to a block softmax non-linearity that operates as follows:
Let zm denote the m-th block of z ∈ RKM such that XM 1 X
`B (A) , − E b̃(i)
m . (19)
z = COL (z1 , . . . , zM ). Likewise, let b̃m ∈ ∆K denote the m=1 |A| i
m-th block of the relaxed code b̃ ∈ ∆M K . The block soft-
max non-linearity operates by applying a standard softmax For convenience, we define the S U B I C loss computed on
non-linearity to each block zm of z to produce the corre- a batch A as the weighted combination of the two entropy
sponding block b̃m of b̃: losses, parametrized by the hyper-parameters µ, γ ∈ R+ :
exp (zm [k])
µ X
`µ,γ
b̃m = P . (15) S A , `E (b̃) + γ`B (A) . (20)
|A| b̃∈A
l exp (zm [l]) k
At test time, the block-softmax non-linearity is replaced by It is important to point out that, unlike the residual en-
a block one-hot encoder that projects b̃ unto ∆M coder described in §3, the S U B I C approach operates on the
K . In prac-
tice, this can be accomplished by means of one-hot encod- feature vector x directly. Indeed, the S U B I C method is only
ing of the index of the maximum entry of zm : a feature encoder, and does not implement an entire index-
ing framework.
bm = Jk = argmax(zm )K k . (16)
4.2. A novel indexing pipeline
The approach of [13] introduced two losses based on en- We now introduce our proposed network architecture
M
tropy that enforce the proximity of b̃ to KK . The entropy that uses the method of [13] described above to build an
of a vector p ∈ ∆K , defined as entire image indexing system. The system we design im-
XK plements the main ideas of the state-of-the-art pipeline de-
E(p) = p[k] log2 (p[k]) , (17) scribed in §3.
k=1
Our proposed network architecture is illustrated in Fig. z 0∗ at the output of the W1 layer and (ii) the activation
3. The input to the network is the feature vector x consisting coefficients z ∗ at the output of the W2 layer. The IVF
of activation coefficients obtained by running a given image bins
are then SBranked as per (21) and all database images
I through a CNN feature extractor. We refer to this feature Ii , i ∈ k=1 Bnk in the B most pertinent bins are
extractor as the base CNN of our system. sorted, based on their encoding bi , according to their score
Similarly to the design philosophy described in §3, our
indexing system employs an IVF and a residual feature en- z 0∗T bi . (24)
coder. Accordingly, the architecture in Fig. 3 consists of
two main blocks, Bin selection and Encoder, along with a Training We assume we are given a training set
Residual block that links these two main components. {(I (i) , y (i) )}i organized into C classes, where label y (i) ∈
(1, . . . , C) specifies the class of the i-th image. Various
Bin selection The first block, labeled Bin selection in Fig.
works on learning for retrieval have explored the benefit of
1 can be seen as a S U B I C encoder employing a single block
using ranking losses like the triplet loss and the pair-wise
(i.e. M = 1) of size N , with the block one-hot encoder sub-
loss as opposed to the cross-entropy loss succesfully used
stituted by an argmax operation. The block consists of a sin-
in classification tasks [11, 27, 28, 30, 17, 18, 19, 7, 5]. Em-
gle fully-connected layer with weight matrix W1 and ReLU
pirically, we have found that the cross-entropy loss yields
activation followed by a second activation using softmax.
good results in the retrieval task, and we adopt it in this
When indexing a database image I, this block is responsi-
work.
ble for choosing the bin Bn that I is assigned to, using the
Given an image belonging to class c and a vector p ∈
argmax of the coefficients z 0 .
∆C that is an estimate of class membership probabilities,
Given a query image I ∗ , the same binning block is re-
the cross-entropy loss is given by (the scaling is for conve-
sponsible for sorting the bins in decreasing order of perti-
nience of hyper-parameter cross-validation)
nence Bn1 · · · BnN using the coefficients z 0∗ ∈ RN
+ so that
1
z 0∗ [n1 ] ≥ . . . ≥ z 0∗ [nN ], (21) ` (p, c) = − log2 p[c]. (25)
log2 C
in a manner analogous to (2). Accordingly, we train our network by enforcing that the
(Residual) feature encoding Inspired by the residual en- relaxed block-structured codes b̃0 and b̃ are good feature
coding approach described in §3, we consider a block anal- vectors that can be used to predict class membership. We do
ogous to the residual computation of (3) and (5). The ap- so by feeding each vector to a soft-max classification layer
proach consists of building a vector (denoting ReLU as σ) (layers C1 and C2 in Fig. 3, respectively), thus producing
estimates of class membership s0 and s in ∆C (c.f . Fig.
R2 σ(R1 b̃0 ), (22) 3) from which we derive two possible task-related losses.
Letting T denote a batch specified as a set of training-pair
that is analogous to the reconstruction dn of x obtained
indices, these two losses are
from the encoding n following the IVF stage (c.f . (1) and
discussion thereof), and subtracts it from a linear mapping 1 Xh i
α` s0(i) , y (i) + ` s(i) , y (i)
of x: L1,α = (26)
|T |
i∈T
r = Qx − R2 σ(R1 b̃0 ). (23)
1 X
` s0(i) + s(i) , y (i) ,
Besides the analogy to indexing pipelines, one other moti- and L2 = (27)
|T |
vation for the above approach is to provide information to i∈T
the subsequent feature encoding from the IVF bin selection where the scalar α ∈ {0, 1} is a selector variable. In order
stage (i.e. b̃0 ) as well as the original feature x. For complete- to enforce the proximity of the b̃0 and b̃ to KN and KK M
,
ness, as illustrated in Fig. 3, we also consider architectures respectively, we further employ the loss
that override this residual encoding block, setting r = x
directly. ΩH = `µS 1 ,γ1 {b̃0(i) }i∈T
+ `Sµ2 ,γ2 {b̃(i) }i∈T , (28)
The final stage consists of an M -block S U B I C encoder
M
operating on r and producing test-time encodings b ∈ KK , which depends on the four hyper-parameters H =
M
and training-time relaxed encoding b̃ ∈ ∆K . Note that, {µ1 , γ1 , µ2 , γ2 } (we disuss heuristics for their selection if
unlike the residual approach described in §3, our approach §5).
does not incurr the extra overhead required to compute Accordingly, the general learning objective for our sys-
LUTs using (7). tem is
F∗ = L∗ + ΩH , (29)
Searching Given a query image I ∗ , it is first fed to the
pipeline in Fig. 3 to obtain (i) the activation coefficients and we consider three variants thereof:
(S U B I C-I) a non-residual variant with objective F1,1 cor- γ2 = 0.9. We use the same values for all three variants of
responding to independently training the bin selection our system.
block and the feature encoder;
Evaluation of feature encoder First of all, we evaluate
(S U B I C-R) a residual variant with objective F1,0 where how S U B I C encoding performs on all the test datasets com-
the bin selection block is pre-trained and held fixed pared to the PQ unsupervised vector quantizer. We use
during learning; and M = 8 and K = 256 setups for both codes. S U B I C is
trained for 500K batches of 200 training images, with µ =
(S U B I C-J) a non-residual variant with objective F2 . 0.6 and γ = 0.9. The results reported in Table 1 show that,
as expected, S U B I C outperforms PQ, justifying its selection
5. Experiments as a feature encoder in our system. For reference, the first
row in the table gives the performance with uncompressed
Datasets For large-scale image retrieval, we use three features. While high, each base feature vector has a storage
publicly available datasets to evaluate our approach: Ox- footprint of 8 Kilo bytes (assuming 4-byte floating points).
ford5K [21]2 , Paris6K [22]3 and Holidays [14]4 . For large- S U B I C and PQ, on the other hand, require only 8 bytes of
scale experiments, we add 100K and 1M images from Flickr storage per feature (1000× less).
(Flickr100K and Flickr100K1M respectively) as a noise set. Baseline indexing systems We compare all three variants
For Oxford5K, bounding box information is not used. For of our proposed indexing system against two existing base-
Holidays, images are used without correcting orientation. lines, as well as a straightforward attempt to use deep hash-
For training, we use the Landmarks-full subset of the ing as an IVF system:
Landmarks dataset [4]5 , as in [11]. We could only get
125,610 images for the full set due to broken URLs. In all (IVF-PQ) This approach uses an inverted file with N =
our experiments and for all approaches we use Landmarks- 4096 bins followed by a residual PQ encoder with
full as the training set. M = 8 blocks and constituent codebooks of size
For completeness, we also carry out category retrieval K = 256 (c.f . (4)), resulting in an 8-byte feature size.
[13] test using the Pascal VOC6 and Caltech-1017 datasets. The search employs asymmetric distance computation.
For this test, our method is trained on ImageNet. During retrieval, the top B = 2n , lists are retrieved,
and, for each n = 1, 2, . . . the average mAP and aver-
Base features The base features x are obtained from age aggregate bin size T are plotted.
the ResNet version of the network proposed in [11].
This network extends the ResNet-101 architecture with re- (IMI-PQ) The Inverted Multi-Index (IMI) [2] extends the
gion of interest pooling, fully connected layers, and `2 - standard IVF by substituting a product quantizer with
normalizations to mimic the pipeline used for instance re- M = 2 and K = 4096 in place of the vector quan-
trieval. Their method enjoys state-of-the-art performance tizer. The resulting IVF has more than 16 million
for instance retrieval, motivating its usage as base CNN for bins, meaning that, for practical testing sets (contain-
this task. ing close to 1 million images), most of the bins are
empty. Hence, when employing IMI, we select short-
Hyper-parameter selection For all three variants of our
list sizes T for which to compute average mAP to cre-
approach (S U B I C-I, S U B I C-J, and S U B I C-R), we use N =
ate our plots. Note that, given the small size of the IMI
4096 bins, and a S U B I C-(8, 256) encoder having M = 8
bins, the computation of the look-up tables zn (c.f .
blocks of K = 256 block size (corresponding to 8 bytes
(7)) represents a higher cost per-image for IMI than
per encoded feature). These parameters correspond to com-
for IVF. Furthermore, the fragmented memory reads
monly used values for indexing systems. To select the
required can have a large impact on speed relative to
four hyper-parameters H = {µ1 , γ1 , µ2 , γ2 } (c.f . (29)) we
the contiguous reads implicit in the larger IVF bins.
first cross-validate just the bin selection block to choose
µ1 = 5.0 and γ1 = 6.0. With these values fixed, we then (DSH-S U B I C) In order to explore possible approaches to
cross-validate the encoder block to obtain µ2 = 0.6 and include supervision in the IVF stage of an indexing
2 www.robots.ox.ac.uk/ system, we further considered using the DSH deep
˜vgg/data/oxbuildings/
3 www.robots.ox.ac.uk/
˜vgg/data/parisbuildings/ hash code [19] as a bin selector, carefully selecting the
4 lear.inrialpes.fr/ jegou/data.php
˜ regularization parameter to be 0.03 by means of cross-
5 sites.skoltech.ru/compvision/projects/
validation. We train this network to produce 12-bit im-
neuralcodes/
6 http://host.robots.ox.ac.uk/pascal/VOC/ age representations corresponding to N = 4096 IVF
7 http://www.vision.caltech.edu/Image_Datasets/ bins, where each bin has an associated hash code. Im-
Caltech101/ ages are indexed using their DSH hash, and at query
Method Oxford5K Oxford5K* Paris6K Holidays Oxford105K Paris106K
DIR [11] 84.94 84.09 93.58 90.32 83.52 89.10
PQ [15] 46.57 39.45 57.57 48.23 38.73 42.23
S U B I C [13] 53.25 46.06 71.28 60.52 46.88 58.27
Table 1. Instance retrieval with encoded features. Performance (mAP) comparison using 64-bit codes, first row shows reference results
with original uncompressed features. When bounding box information is used for Oxford5K dataset, the performance degrades for both
PQ and S U B I C, shown in column Oxford5K*, as both are trained on full images.
0.4 0.2
0.3
0.3 0.1
0.4
0.4 0.2
0.2 0.2
102 103 102 103 104 105 103 104 105
Holidays Holidays101K Holidays1M
0.6 0.3 0.2
0.5 0.25
mAP
0.15
0.4
0.2
0.3
0.15 0.1
0.2
101 102 102 103 104 103 104 105
T T T
IVF-PQ IMI-PQ DSH-S U B I C S U B I C-I S U B I C-J S U B I C-R
Figure 4. Large-scale image retrieval with complete pipelines Plots of mAP vs. (average) shortlist size T . For all methods except
IMI-PQ, the n-th plotted point is obtained from all images in the first B = 2n bins. For IMI-PQ, the mAP is computed on the first T
responses.
time, the Hamming distance between the query’s 12- number of bins, as discussed above.
bit code and each bin’s code is used to rank the lists. We present results for three datastets (Oxford5K,
For the encoder part, we used S U B I C with M = 8 and Paris6K, Holidays), on three different database scales (the
K = 256, the same used in Tab. 1. original dataset, and when also including noise datasets
of 100K and 1M images). Note that on Oxford5K and
Large-scale indexing Fig 4 shows the mAP performance Paris6K, both S U B I C-I and S U B I C-J enjoy large advan-
versus average number of retrieved images T for all three tages relative to all three baselines – at T = 300, the
variants as well as the baselines described above. Note that relative advantage of S U B I C-I over the IMI-PQ is 19%
the number of retrieved images is a measure of complexity, at least. S U B I C-R likewise enjoys an advangate on the
as for IVF, the time complexity of the system is dominated Paris6K dataset, and performs comparably to the baselines
by the approximate distance computations in (12). For IMI, on Oxford5K.
on the other hand, there is a non-negligible overhead on top On Holidays S U B I C-I outperforms IVF-PQ by a large
of the approximate distance computation related to the large margin (18% relative), but not outperform IMI-PQ. As dis-
Oxford5K Oxford105K Oxford1M
mAP
0.4 0.3
0.4
0.35
0.25
0.3 0.3
2 3
10 10 102 10 3
10 4
105 103 104 105
Paris6K Paris106K Paris1M
0.7 0.6
0.6 0.35
0.5
mAP
0.28 0.18
0.5
0.26
0.16
101 102 102 103 104 103 104 105
T T T
IVF-S U B I C IMI-S U B I C S U B I C-I S U B I C-I-IMI
Figure 5. IMI variant of our approach and comparison for fixed encoder Comparison of an IMI variant of our method to the baselines,
when using the same (non-residual) feature encoder. Note the substantial relative improvements of S U B I C-I-IMI.
cussed above, this comparison does not reflect the overhead ture encoder for all methods including the baselines, which
implicit in an IMI index. To illustrate this overhead, we note are IVF and IMI with unsupervised codebooks (all methods
that, when 1M images are indexed, the average (non-empty) are non-residual). The results, plotted in Fig. 5, establish
bin size for IMI is 18.3, meaning that approximately 54.64 that, for the same number of bins, our method can readily
memory accesses and look-up table constructions need to be outperform the baseline IMI (and IVF) methods. Further-
carried out for each IMI query per 1K images. This com- more, given that we use the best performing feature encoder
pares to an average bin size of 244.14 for IVF, and accord- (S U B I C) for all methods, this experiment also establishes
ingly 4.1 contiguous memory reads and look-up table con- that the S U B I C based binning system that we propose out-
structions. Note, on the other hand, that S U B I C-I readily performs the unsupervised IVF and IMI baselines.
outperforms IVF-PQ in all Holidays experiments.
Concerning the poor performance of S U B I C-R on Holi- Category retrieval For completeness, we also carry out
days, we believe this is due to poor generalization ability of experiments in the category retrieval task which has been
the system because of the three extra fully-connected layers. the main focus of recent deep hashing methods [27, 28, 30,
17, 18, 19, 7]. In this task, a given query image is used
IMI extension Given the high performance of IMI for to rank all database images, with a correct match occur-
the Holidays experiments in Fig. 4, we further consider ing for database images of the same class as the query im-
an IMI variant of our S U B I C-I architecture. To imple- age. For category retrieval experiments, we use VGG-M-
ment this approach, we learn a S U B I C-(2, 4096) encoder 128 base features [24], which have established good perfor-
0
(with µ = 4 and γ = 5). Letting zm denote the m-th mance for classification tasks, and a S U B I C-(1, 8192) for
0
block of z , the (k, l) ∈ (1, . . . , 4096)2 bins of S U B I C- bin selection. We use the ImageNet training set (1M+ im-
IMI are sorted based on the score z10 [k] + z20 [l]. For fair- ages) to train, and the test (training) subsets of Pascal VOC
ness of comparison, we use the same S U B I C−(8, 256) fea- and Caltech-101 as a query (respectively, database) set. We
Caltech-101 Pascal VOC [8] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quan-
0.4 tization for approximate nearest neighbor search. In Proc.
0.4
Conf. Comp. Vision Pattern Rec., 2013. 2
0.35 0.3
[9] T. Ge, K. He, and J. Sun. Product sparse coding. In Proc.
mAP