Professional Documents
Culture Documents
Deep Linear Discriminant Analysis On Fisher Networks: A Hybrid Architecture For Person Re-Identification
Deep Linear Discriminant Analysis On Fisher Networks: A Hybrid Architecture For Person Re-Identification
Deep Linear Discriminant Analysis On Fisher Networks: A Hybrid Architecture For Person Re-Identification
Abstract—Person re-identification is to seek a correct match size problem [27], [28]. Specifically, only hundreds of training
for a person of interest across views among a large number samples are available due to the difficulties of collecting
arXiv:1606.01595v1 [cs.CV] 6 Jun 2016
of imposters. It typically involves two procedures of non-linear matched training images. Moreover, the loss function used in
feature extractions against dramatic appearance changes, and
subsequent discriminative analysis in order to reduce intra- these deep models is at the single image level. A popular
personal variations while enlarging inter-personal differences. In choice is the cross-entropy loss which is used to maximize
this paper, we introduce a hybrid architecture which combines the probability of each pedestrian image. However, the gradi-
Fisher vectors and deep neural networks to learn non-linear ents computed from the class-membership probabilities may
representations of person images to a space where data can be help enlarging the inter-personal differences while unable to
linearly separable. We reinforce a Linear Discriminant Analysis
(LDA) on top of the deep neural network such that linearly reduce the intra-personal variations. To these ends, we are
separable latent representations can be learnt in an end-to-end motivated to develop a architecture that can be comparable to
fashion. deep convolution but more suitable for person re-identification
By optimizing an objective function modified from LDA, the with moderate data samples. At the same time, the proposed
network is enforced to produce feature distributions which have network is expected to produce feature representations which
a low variance within the same class and high variance between
classes. The objective is essentially derived from the general could be easily separable by a linear model in its latent space
LDA eigenvalue problem and allows to train the network with such that visual variation within the same identity is minimized
stochastic gradient descent and back-propagate LDA gradients while visual differences between identities are maximized.
to compute the gradients involved in Fisher vector encoding. For Fisher kernels/vectors and CNNs can be related in a unified
evaluation we test our approach on four benchmark data sets in view to which there are much conceptual similarities between
person re-identification (VIPeR [1], CUHK03 [2], CUHK01 [3],
and Market1501 [4]). Extensive experiments on these benchmarks both architectures instead of structural differences [29]. In-
show that our model can achieve state-of-the-art results. deed, the encoding/aggregation step of Fisher vectors which
typically involves gradient filter, pooling, and normalization
Index Terms—Linear discriminant analysis, deep Fisher net-
works, person re-identification. can be interpreted as a series of layers that alternate linear
and non-linear operations, and hence can be paralleled with
the convolutional layers of CNNs. Some recent studies [29],
I. I NTRODUCTION [30], [31] have empirically demonstrated that Fisher vectors
are competitive with representations learned by convolutional
T HE problem of person re-identification (re-id) refers
to matching pedestrian images observed from disjoint
camera views (some samples are shown in Fig. 1). Many
layers of CNNs, and the accuracy of Fisher vector comes very
close to that of the AlexNet [32] in image classification. For
approaches to this problem have been proposed mainly to instance, Sydorov et al. [29] transfer the idea of end-to-end
develop discriminative feature representations that are robust training and discriminatively learned feature representations
against poses/illumination/viewpoint/background changes [5], in neural networks to support vector machines (SVMs) with
[6], [7], [8], [9], [10], or to learn distance metrics for matching Fisher kernels. As a result, Fisher kernel SVMs can be trained
people across views [11], [12], [13], [14], [15], [16], [17], [18], in a deep way by which classifier parameters and Gaussian
[19], [20], [21], [22], or both jointly [2], [23], [24], [25]. mixture model (GMM) parameters are learned jointly from
To extract non-linear features against complex and large training data. The resulting Fisher kernels can provide state-
visual appearance variations in person re-id, some deep models of-the-art results in image classification [29]. While complex
based on Convolution Neural Networks (CNNs) [2], [23], CNN models require a vast number of training data to find
[24], [25], [26] have been developed to compute robust data- good representations and task-dependent parameters, learning
dependent feature representations. Despite the witnessed im- GMM parameters in Fisher kernels usually only demands
provment achieved by deep models, the learning of parameters a moderate amount of training data due to the nature of
requires a large amount of training data in the form of unsupervised learning. Thus, we are interested in developing
matching pairs whereas person re-id is facing the small sample a deep network based on Fisher vectors, which can learn
effective features to match persons across views.
Authors are with The University of Adelaide, Australia; and Australian Person re-identification can be reformulated into a process
Research Council Centre of Excellence for Robotic Vision; Corresponding
author: Chunhua Shen (e-mail: chunhua.shen@adelaide.edu.au). of discriminant analysis on a lower dimensional space into
which training data of matching image pairs across camera
2
Φ
+ Objective: maximize
Eigenvalue objective eigenvalues of LDA onx L
Σ
SIFT +
+ √.
+
PCA l2 DeepLDA
Linear discriminant
analysis Latent Space
Fig. 2: Overview of our architecture. Patches from a person image are extracted and described by PCA-projected SIFT descriptors. These
local descriptors are further embeded using Fisher vector encoding and aggregated to form its image-level representation, followed by square-
rooting and `2 -normalization (x0 ). Supervised layers f c1 , . . . , f cL that involve a linear projection and an ReLU can transform resulting
Fisher vectors into a latent space into which an LDA objective function is directly imposed to preserve class separability.
Deep LDA
Cross entropy latent space
Softmax layer
Eigenvalue objective
Back-propagation
Back-propagation
Feed-forward
Feed-forward
Fig. 3: Sketch of a general DNN and our deep hybrid LDA network.
In a general DNN, the outputs of network are normalized by a
softmax layer to obtain probabilities and the objective with cross-
entropy loss is defined on each training sample. In our network, an
LDA objective is imposed on the hidden representations, and the
optimization target is to directly maximize the separation between
classes.
5
where S b is the between scatter matrix, defined as S b = S t − eigenvalue problem of S b e = vS w e, where e and v are
Sw . S t and S w denote total scatter matrix and within scatter eigenvectors and corresponding eigenvalues. The projection
matrix, which can be defined as matrix A is a set of eigenvectors e. In the following section,
1 X 1 T we will formulate LDA as an objective function for our hybrid
Sw = Sc, Sc = X̄ c X̄ c ; architecture, and present back-propagation in supervised layers
C c Nc − 1
(4) and Fisher vectors.
1 T
St = X̄ X̄;
N −1
where X̄ c = X c − mc are the mean-centered observations D. Optimization
of class ωc with per-class mean vector mc . X̄ is defined
Existing deep neural networks in person re-identification
analogously for the entire population X.
[2], [23], [24] are optimized using sample-wise optimization
In [40], the authors have shown that nonlinear transfor-
with cross-entropy loss on the predicted class probabilities
mations from deep neural networks can transfrom arbitrary
(see Section below). In constrast, we put an LDA-layer on top
data distributions to Gaussian distributions such that the op-
of the neural networks by which we aim to produce features
timal discriminant function is linear. Specifically, they de-
with low intra-class and high inter-class variability rather than
signed a deep neural network consituted of unsupervised pre-
penalizing the misclassification of individual samples. In this
optimization based on stacked Restricted Boltzman machines
way, optimization with LDA objective operates on the proper-
(RBMs) [41] and subsequent supervised fine-tuning. Similarly,
ties of the parameter distributions of the hidden representation
the Fisher vector encoding part of our network is a generative
generated by the hybrid net. In the following, we first briefly
Gaussian mixture mode, which can not only handle real-value
revisit deep neural nettworks with cross-entropy loss, then we
inputs to the network but also model them continuously and
present the optimization problem of LDA objective on the deep
amenable to linear classifiers. In some extent, one can view the
hybrid architecture.
first layers of our network as an unsupervised stage but these
1) Deep neural networks with cross-entropy loss: It is
layers need to be retrained to learn data-dependent features.
Assume discriminant features z = Ax, which are very common to see many deep models (with no excep-
identically and independently drawn from Gaussian class- tion on person re-identification) are built on top of a deep
conditional distribution, a maximum likelihood estimation neural network (DNN) with training used for classification
of A can be equivalently computed by a maximation of a problem [29], [30], [32], [43]. Let X denote a set of N
discriminant criterion: Qz : F → R, F = {A : A ∈ Rl×d } training samples X1 , . . . , XN with corresponding class labels
with Qz (A) = max(trace{S −1 y1 , . . . , yn ∈ {1, . . . , C}. A neural network with P hidden
t S b }). We use the following
theorem from the statistical properties of LDA features to layers can be represented as a non-linear function f (Θ) with
obtain a guarantee on the maximization of the discriminant model parameters Θ = {Θ1 , . . . , ΘP }. Assume the network
criterion Qz with the learnt hidden representation x. output pi = [pi,1 , . . . , pi,C ] = f (Xi , Θ) is normalized by the
a) Corollary: On the top of the Fisher vector encoding, softmax function to obtain the class probabilities, then the
the supervised layers learn input-output associations leading to network is optimized by using Stochastic Gradient Descent
an indirectly maximized Qz . In [42], it has shown that multi- (SGD) to seek optimal parameter Θ with respect to some loss
layer artificial neural networks with linear output layer function Li (Θ) = L(f (Xi , Θ), yi ):
N
z n = Wxn + b (5) 1 X
Θ = arg min Li (Θ). (6)
maximize asymptotically a specific discriminant criterion eval- Θ N i=1
uated in the l-dimensional space spanned by the last hidden
outputs z ∈ Rl if the mean squared error (MSE) is minimized For a multi-class classification setting, cross-entropy loss is
between z n and associated targets tn . In particular, for a finite a commonly used optimization target which is defined as
sample {xn }Nn=1 , and MSE miminization regarding the target C
coding is
X
Li = − yi,j log(pi,j ), (7)
p
i N/Ni , ω(xn ) = ωi j=1
tn =
0, otherwise
where yi,j is 1 if sample Xi belongs to class yi i.e.(j = yi ),
where Ni is the number of examples of class ωi , and tin is the and 0 otherwise. In essence, the cross-entropy loss tries to
i-th component of a target vector tn ∈ RC , approximates the maximize the likelihood of the target class yi for each of the
maximum of the discriminant criterion Qh . individual sample Xi with the model parameter Θ. Fig.3 shows
Maximizing the objective function in Eq. (3) is essentially to the snapshot of a general DNN with such loss function. It has
maximize the ratio of the between and within-class scatter, also been emphasized in [44] that objectives such as cross-entropy
known as separation. The linear combinations that maximize loss do not impose direct constraints such as linear separa-
Eq.(3) leads to the low variance in projected observations of bility on the latent space representation. This movitates the
the same class, whereas high variance in those of different LDA based loss function such that hidden representations are
classes in the resulting low-dimensinal space L. To find the optimized under the constraint of maximizing discriminative
optimum solution for Eq.(3), one has to solve the general variance.
6
2) Optimization objective with linear discriminant analysis: Recalling the definitions of S t and following [40], [44], [47],
A problem in solving the general eigenvalue problem of we can first obtain the partial derivative of the total scatter
S b e = vS w e is the estimation of S w which overemphasises matrix S t on X as in Equation (12). Here S t [a, b] indicates
high eigenvalues whereas small ones are underestimated too the element in row a and column b in matrix S t , and likewise
much. To alleviate this effect, a regularization term of identity for X[i, j]. Then, we can obtain the partial derivatives of S w
matrix is added into the within scatter matrix: S w + λI [45]. and S b w.r.t. X:
Adding this identify matrix can stabilize small eigenvalues. ∂S w [a, b] 1 X ∂S c [a, b]
Thus, the resulting eigenvalue problem can be rewritten as = ;
∂X[i, j] C c ∂X[i, j]
S b ei = vi (S w + λI)ei , (8) (13)
∂S b [a, b] 1 X ∂S t [a, b] ∂S w [a, b]
= − .
where e = [e1 , . . . , eC−1 ] are the resulting eigenvectors, ∂X[i, j] C c ∂X[i, j] ∂X[i, j]
and v = [v1 , . . . , vC−1 ] the corresponding eigenvalues. In Finally, the partial derivative of the loss function formulated
particular, each eigenvalue vi quantifies the degree of sepa- in (10) w.r.t. hidden state X is defined as:
ration in the direction of eigenvector ei . In [44], Dorfer et m m m
al.adapted LDA as an objective into a deep neural network ∂ 1 X 1 X ∂vi 1 X T ∂S b ∂S w
vi = = e − ei .
by maximizing the individual eigenvalues. The rationale is ∂X m i=1 m i=1 ∂X m i=1 i ∂X ∂X
that the maximization of individual eigenvalues which reflect (14)
the separation of respective eigenvector directions leads to the b) Backpropagation in Fisher vectors: In this section,
maximization of the discriminative capability of the neural net. we introduce the procedure for the deep learning of LDA
Therefore, the objective to be optimized is formulated as: with Fisher vector. Algorithm 1 shows the steps of updating
C−1
parameters in the hybrid architecture. In fact, the hybrid
1 X network consisting of Fisher vectors and supervised layers
arg max vi . (9)
Θ,G C − 1 i=1 updates its parameters by firstly initializing GMMs (line 5) and
then computing Θ (lines 6-7). Thereafter, GMM parameters
However, directly solving the problem in Eq. (9) can yield G can be updated by gradient decent (line 8-12). Specifically,
trival solutions such as only maximizing the largest eigenval- Line 8 computes the gradient of the loss function w.r.t. the
ues since this will produce the highest reward. In perspective GMM parameters π, µ and σ. Their influence on the loss is
of classification, it is well known that the discriminant function indirect through the computed Fisher vector, as in the case
overemphasizes the large distance between already separated of deep networks, and one has to make use of the chain rule.
classes, and thus causes a large overlapping between neighbor- Analytic expression of the gradients are provided in Appendix.
ing classes [40]. To combat this matter, a simple but effective Evaluating the gradients numerically is computationally
solution is to focus on the optimization on the smallest (up to expensive due to the non-trivial couplings between parameters.
m) of the C − 1 eigenvalues [44]. This can be implemented A straight-forward implementation gives rise to complexity of
by concentrating only m eigenvalues that do not exceed a O(D2 P ) where D is the dimension of the Fisher vectors,
predefined threshold for discriminative variance maximation: and P is the combined number of SIFT descriptors in all
m training images. Therefore, updating the GMM parameters
1 X
L(Θ, G) = arg max vi , could be orders of magnitude slower than computing Fisher
Θ,G m i=1
vectors, which has computational cost of O(DP ). Fortunately,
with {v1 , . . . , vm } = {vi |vi < min{v1 , . . . , vC−1 } + }. [29] provides some schemes to compute efficient approximate
(10) gradients as alternatives.
The intuition of formulation (10) is the learnt parameters It is apparent that the optimization w.r.t. G is non-convex,
Θ and G are encouraged to embed the most discriminative and Algorithm 1 can only find local optimum. This also
capablility into each of the C − 1 feature dimensions. This means that the quality of solution depends on the initial-
objective function allows to train the proposed hybrid archi- ization. In terms of initialization, we can have a GMM by
tecture in an end-to-end fashion. In what follows, we present unsupervised Expectation Maximization, which is typically
the derivatives of the optimization in (10) with respect to the used for computing Fisher vectors. As a matter of fact, it is a
hidden representation of the last layer, which enables back- common to see that in deep networks layer-wise unsupervised
propagation to update G via chain rule. pre-training is widely used for seeking good initialization
a) Gradients of LDA loss: In this section, we provide [48]. To update gradient, we employ a batch setting with a
the partial derivatives of the optimization function (10) with line search (line 10) to find the most effective step size in
respect to the output hidden representation X ∈ RN ×d where each iteration. In optimization, the mixture weights π and
N is the number of training samples in a mini-batch and d is Gaussian variance Σ should be ensured to remain positive.
the feature dimension output from the last layer of the network. As suggested by [29], this can be achieved by internally
We start from the general LDA eigenvalue problem of S b ei = reparametering and updating the logarithms of their values,
vi S w ei , and the derivative of eigenvalue vi w.r.t. X is defined from which the original parameters can be computed simply by
as [46]: exponentiation. A valid GMM parameterization also requires
the mixture weights sum up to 1. Instead of directly enforcing
∂vi ∂S b ∂S w
= eTi − vi ei . (11) this constraint that would require a projection step after each
∂X ∂X ∂X
7
2
− N1 n X[n, j]), if
P
N −1 (X[i, j] a = j, b = j
1
− N1 n X[n, b]), if
P
∂S t [a, b] N −1 (X[i, b] a = j, b 6= j
= 1 (12)
− N1 n X[n, a]), if
P
∂X[i, j]
N −1 (X[i, a] a 6= j, b = j
0, a 6= j, b 6= j
if
Algorithm 1 Deep Fisher learning with linear discriminant descriptors. The two Fisher vectors are concatenated into a
analysis. 256K-dim representation.
Input: training samples X1 , . . . , Xn , labels y1 , . . . , yn
Output: GMM G, parameters Θ of supervised layers IV. C OMPARISON WITH CNN S [32] AND DEEP F ISHER
Initialize: initial GMM, G = (log π, µ, log Σ) by using NETWORKS [29]
unsupervised expectation maximization [36], [38].
d) Comparison with CNNs [32]: The architecture of
repeat
deep Fisher kernel learning has substantial number of con-
compute Fisher vector with respect to G:
ceptual similarities with that of CNNs. Like a CNN, the
ΦGi = Φ(xi ; G), i = 1, . . . , n.
Fisher vector extraction can also be interpreted as a series
solve LDA for training set {(ΦG i , yi ), i = 1,
P.m . . , n}:
of layers that alternate linear and non-linear operations, and
Θ ← L(Θ, G) = arg maxΘ m 1
i=1 vi with
can be paralleled with the convolutional layers of the CNN.
{v1 , . . . , vm } = {vi |vi < min(v1 , . . . , vC−1 ) + }.
For example, when training a convolutional network on nat-
compute gradients with respect to the GMM parameters:
ural images, the first stages always learn the essentially the
δlog π = 5log π L(Θ, G), δµ = 5µ L(Θ, G), δlog Σ =
same functionality, that is, computing local image gradient
5log Σ L(Θ, G).
orientations, and pool them over small spatial regions. This
find the best step size η∗ by line search:
is, however, the same procedure how a SIFT descriptor is
η∗ = arg minη L(Θ, G) with Gη = (log π − ηδlog π , µ −
computed. It has been shown in [30] that Fisher vectors are
ηδµ , log Σ − ηδlog Σ ).
competitive with representations learned by the convolutional
update GMM parameters: G ← Gη∗ .
until stopping criterion fulfilled ; layers of the CNN, and the accuracy comes close to that
Return GMM parameters G and Θ. of the standard CNN model, i.e., AlexNet [32]. The main
difference between our architecture between CNNs lies in the
update, we avoid this by deriving gradients for unnormalized
P first layers of feature extractions. In CNNs, basic components
weights π̃k , and then normalize them as πk = π̃k / j π̃j . (convolution, pooling, and activation functions) operate on
local input regions, and locations in higher layers correspond
to the locations in the image they are path-connected to
E. Implementation details (a.k.a receptive field), whilst they are very demanding in
c) Local feature extration: Our architecture is indepen- computational resources and the amount of training data nec-
dent of a particular choice of local descriptors. Following [30], essary to achieve good generalization. In our architecture, as
[36], we combine SIFT with color histogram extracted from alternative, Fisher vector representation of an image effectively
the LAB colorspace. Person images are first rescaled to a encodes each local feature (e.g., SIFT) into a high-dimensional
resolution of 48 × 128 pixels. SIFT and color features are representation, and then aggregates these encodings into a
extracted over a set of 14 dense overlapping 32 × 32-pixels single vector by global sum-pooling over the over image, and
regions with a step stride of 16 pixels in both directions. For followed by normalization. This serves very similar purpose
each patch we extract two types of local descriptors: SIFT to CNNs, but trains the network at a smaller computational
features and color histograms. For SIFT patterns, we divide learning cost.
each patch into 4 × 4 cells and set the number of orientation e) Comparison with deep Fisher kernel learning [29]:
bins to 8, and obtain 4 × 4 × 8 = 128 dimensional SIFT With respect to the deep Fisher kernel end-to-end learning, our
features. SIFT features are extracted in L, A, B color channels architecture exhibits two major differences. The first difference
and thus produce a 128 × 3 feature vector for each patch. can be seen from the incorporation of stacked supervised
For color patterns, we extract color features using 32-bin fully connected layers in our model, which can capture long-
histograms at 3 different scales (0.5, 0.75 and 1) over L, A, and range and complex structure on an person image. The second
B channels. Finally, the dimensionality of SIFT features and difference is the training objective function. In [29], they intro-
color histograms is reduced with PCA respectively to 77-dim duce an SVM classifier with Fisher kernel that jointly learns
and 45-dim. Then, we concatenate these descriptors with (x, y) the classifier weights and image representation. However, this
coordinates as well as the scale of the paths, thus, yielding classification training can only classify an input image into
80-dim and 48-dim descriptors. In our experiments, we use a number of identities. It cannot be applied to an out-of-
the publicly available SIFT/LAB implementation from [10]. domain case where new testing identifies are unseen in training
In Fisher vector encoding, we use a GMM with K = 256 data. Specifically, this classification supervisory signals pull
Gaussians. The per-patch Fisher vectors are aggregated with apart the features of different identities since they have to
sum-pooling, square-rooted and `2 -normalized. One Fisher be classified into different classes whereas the classification
vector is computed on the SIFT and another one on the color signal has relatively week constraint on features extracted from
8
TABLE II: Rank-1, -5, -10, -20 recognition rate of various methods TABLE IV: Rank-1, -5, -10, -20 recognition rate of various methods
on the VIPeR data set. The method of Ensembles* takes two types on the CUHK01 data set.
of features as input: SIFT and LAB patterns. Method r=1 r=5 r = 10 r = 20
Method r=1 r=5 r = 10 r = 20
Ensembles* [22] 46.94 71.22 75.15 88.52
Ensembles* [22] 35.80 67.82 82.67 88.74 JointRe-id [24] 65.00 88.70 93.12 97.20
JointRe-id [24] 34.80 63.32 74.79 82.45 SDALF [6] 9.90 41.21 56.00 66.37
LADF [12] 29.34 61.04 75.98 88.10 MLF [8] 34.30 55.06 64.96 74.94
SDALF [6] 19.87 38.89 49.37 65.73 FPNN [2] 27.87 58.20 73.46 86.31
eSDC [10] 26.31 46.61 58.86 72.77 LMNN [52] 21.17 49.67 62.47 78.62
KISSME [14] 19.60 48.00 62.20 77.00 ITML [11] 17.10 42.31 55.07 71.65
kLFDA [19] 32.33 65.78 79.72 90.95 eSDC [10] 22.84 43.89 57.67 69.84
eBiCov [54] 20.66 42.00 56.18 68.00 KISSME [14] 29.40 57.67 62.43 76.07
ELF [9] 12.00 41.50 59.50 74.50 kLFDA [19] 42.76 69.01 79.63 89.18
PCCA [17] 19.27 48.89 64.91 80.28 Generic Metric [49] 20.00 43.58 56.04 69.27
RDC [18] 15.66 38.42 53.86 70.09 SalMatch [34] 28.45 45.85 55.67 67.95
RankSVM [21] 14.00 37.00 51.00 67.00 Hybrid 67.12 89.45 91.68 96.54
DeepRanking [26] 38.37 69.22 81.33 90.43
NullReid [28] 42.28 71.46 82.94 92.06 TABLE V: Rank-1 and mAP of various methods on the Market-1501
Hybrid 44.11 72.59 81.66 91.47 data set.
Method r=1 mAP
TABLE III: Rank-1, -5, -10, -20 recognition rate of various methods SDALF [6] 20.53 8.20
on the CUHK03 data set. eSDC [10] 33.54 13.54
Method r=1 r=5 r = 10 r = 20 KISSME [14] 39.35 19.12
Ensembles* [22] 52.10 59.27 68.32 76.30 kLFDA [19] 44.37 23.14
JointRe-id [24] 54.74 86.42 91.50 97.31 XQDA [20] 43.79 22.22
FPNN [2] 20.65 51.32 68.74 83.06 Zheng et al. [4] 34.40 14.09
NullReid [28] 58.90 85.60 92.45 96.30 Hybrid 48.15 29.94
ITML [11] 5.53 18.89 29.96 44.20
LMNN [52] 7.29 21.00 32.06 48.94 3) Evaluation on the CUHK01 data set: In this experiment,
LDM [53] 13.51 40.73 52.13 70.81 100 identifies are used for testing, with the remaining 871
SDALF [6] 5.60 23.45 36.09 51.96
eSDC [10] 8.76 24.07 38.28 53.44 identities used for training. As suggested by FPNN [2] and
KISSME [14] 14.17 48.54 52.57 70.03 JointRe-id [24], this setting is suited for deep learning because
kLFDA [19] 48.20 59.34 66.38 76.59 it uses 90% of the data for training. Fig. 4 (c) compares
XQDA [20] 52.20 82.23 92.14 96.25
Hybrid 63.23 89.95 92.73 97.55 the performance of our network with previous methods. Our
method outperforms the state-of-the-art in this setting with
C. Comparison with state-of-the-art approaches 67.12% in rank-1 matching rate (vs. 65% by the next best
1) Evaluation on the VIPeR data set: Fig. 4 (a) shows method).
the CMC curves up to rank-20 recognition rate comparing 4) Evaluation on the Market-1501 data set: It is noteworthy
our method with state-of-the-art approaches. It is obvious that VIPeR, CUHK03, and CUHK01 data sets use regular
that our mothods delivers the best result in rank-1 recog- hand-cropped boxes in the sense that pedestrians are rigidly
nition. To clearly present the quantized comparison results, enclosed under fixed-size bounding boxes, while Market-
we summarize the comparison results on seveval top ranks 1501 employs the detector of Deformable Part Model (DPM)
in Table II. Sepcifically, our method achieves 44.11% rank-1 [50] such that images undergo extensive pose variation and
matching rate, outperforming the previous best results 42.28% misalignment. Thus, we conduct experiments on this data set
of NullReid [28]. Overall, our method performs best over rank- to verify the effectiveness of the proposed method in more
1, and 5, whereas NullReid is the best at rank-15 and 20. We realistic situation. Table V presents the results, showing that
suspect that training samples in VIPeR are much less to learn our mothod outperforms all competitors in terms of rank-1
parameters of moderate size in our deep hybrid model. matching rate and mean average precision (mAP).
2) Evaluation on the CUHK03 data set: The CUHK03
data set is larger in scale than VIPeR. We compare our D. Discussions
model with state-of-the-art approaches including deep learning 1) Compare with kernel methods: In the standard Fisher
methods (FPNN [2], JointRe-id [24]) and metric learning alg- vector classification pipeline, supervised learning relies on
orithms (Ensembles [22], ITML [11], LMNN [52], LDM [53], kernel classifiers [36] where a kernel/linear classifier can be
KISSME [14], kLFDA [19], XQDA [20]). As shown in Fig. learned in the non-linear embeddding space. We remark that
4 (b) and Table III, our methods outperforms all competitors our architecture with a single hidden layer has a close link to
against all ranks. In particular, the hybrid architecture achieves kernel classifiers in which the first layer can be interpreted as
a rank-1 rate of 63.23%, outperforming the previous best a non-linear mapping Ψ followed by a linear classifier learned
result reported by JointRe-id [24] by a noticeable margin. in the embedding space. For instance, Ψ is the feature map
This marginal improvement can be attributed to the availability of arc-cosine kernel of order one if Ψ instantiates random
of more training data that are fed to the deep network to Gaussian projections followed by an reLU non-linearity [56].
learn a data-dependent solution for this specific senario. Also, This shows a close connection between our model (with a
CUHK03 provides more shots for each person, which is single hidden layer and an reLU non-linearity), and arc-cosine
beneficial to learn feature representations with discriminative kernel classifiers. Therefore, we compare two alternatives with
distribution parameters (within and between class scatter). arc-cosine kernel and RBF kernel. Following [30], we train
10
VI. C ONCLUSION
Recognition rate(%)
80
In this paper, we presented a novel deep hybrid architecture
Hybrid 63.23%
Ensembles 52.10% to person re-identification. The proposed network composed
60 JointReid 54.74%
of Fisher vector and deep neural networks is designed to pro-
FPNN 20.65%
ITML 5.53% duce linearly separable hidden representations for pedestrian
LMNN 7.29%
40 LDM 13.51 samples with extensive appearance variations. The network
SDALF 5.60%
eSDC 8.76%
is trained in an end-to-end fashion by optimizing a Linear
KISSME 14.17% Discriminant Analysis (LDA) eigen-valued based objective
20 XQDA 52.20%
kLFDA 48.20%
function where LDA based gradients can be back-propagated
NullReid 58.90% to update parameters in Fisher vector encoding. We demon-
0 strate the superiority of our method by comparing with state
0 5 10 15 20
Rank of the arts on four benchmarks in person re-identification.
CUHK01
100
VII. A PPENDIX
Recognition rate(%)
Recognition rate(%)
Recognition rate(%)
80
80 80
70
60 60 60
50
40 40 40
30
3 hidden layer (SVM) 3 hidden layer (SVM) 3 hidden layer (SVM)
20 1 hidden layer (SVM) 20 1 hidden layer (SVM) 20 1 hidden layer (SVM)
arc−cosine kernel classifier arc−cosine kernel classifer arc−cosine kernel classifier
RBF kernel classifier RBF kernel classifier
10 RBF kernel classifier
0 0 0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Rank Rank Rank
Fig. 5: Compare with kernel methods using CMC curves on VIPeR, CUHK03 and CUHK01 data sets.
Mean(eigen values)
Mean(eigen values)
[14] M. Kostinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof, “Large [41] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, “A learning algorithm
scale metric learning from equivalence constraints,” in Proc. IEEE Conf. for boltzmann machines,” Conginitive Science, vol. 9, no. 1, pp. 147–
Comp. Vis. Patt. Recogn., 2012. 169, 1985.
[15] Y. Wu, M. Mukunoki, T. Funatomi, M. Minoh, and S. Lao, “Optimizing [42] H. Osman and M. M. Fahmy, “On the discriminatory power of adaptive
mean reciprocal rank for person re-identification,” in Proc. Advanced feed-forward layered networks,” IEEE Transactions on Pattern Analysis
Video and Signal-Based Surveillance, 2011. and Machine Intelligence, vol. 16, no. 8, pp. 837–842, 1994.
[16] K. Weinberger, J. Blitzer, and L. Saul, “Distance metric learning for large [43] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
margin nearest neighbor classification,” in Proc. Advances in Neural Inf. large-scale image recognition,” in Proc. Int. Conf. Learn. Representa-
Process. Syst., 2006. tions, 2015.
[17] A. Mignon and F. Jurie, “Pcca: a new approach for distance learning [44] M. Dorfer, R. Kelz, and G. Widmer, “Deep linear discriminant analysis,”
from sparse pairwise constraints,” in Proc. IEEE Conf. Comp. Vis. Patt. in Proc. Int. Conf. Learn. Representations, 2016.
Recogn., 2012, pp. 2666–2672. [45] J. H. Friedman, “Regularized discriminant analysis,” Journal of the
[18] W. Zheng, S. Gong, and T. Xiang, “Re-identification by relative distance American Statistical Association, vol. 84, no. 405, pp. 165–175, 1989.
comparision,” IEEE Transactions on Pattern Analysis and Machine [46] J. de Leeuw, “Derivatives of generalized eigen systems with applica-
Intelligence, vol. 35, no. 3, pp. 653–668, 2013. tions,” 2007.
[19] F. Xiong, M. Gou, O. Camps, and M. Sznaier, “Person re-identification [47] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonical
using kernel-based metric learning methods,” in Proc. Eur. Conf. Comp. correlation analysis,” in Proc. Int. Conf. Mach. Learn., 2013.
Vis., 2014. [48] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for
[20] S. Liao, Y. Hu, X. Zhu, and S. Z. Li, “Person re-identification by local deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554,
maximal occurrence representation and metric learning,” in Proc. IEEE 2006.
Conf. Comp. Vis. Patt. Recogn., 2015. [49] W. Li, R. Zhao, and X. Wang, “Human reidentification with transferred
[21] B. Prosser, W. Zheng, S. Gong, and T. Xiang, “Person re-identification metric learning,” in Proc. Asian Conf. Comp. Vis., 2012.
by support vector ranking,” in Proc. British Machine Vis. Conf., 2010. [50] P. F. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Ob-
[22] S. Paisitkriangkrai, C. Shen, and A. van den Hengel, “Learning to rank ject detection with discriminatively trained part-based models,” IEEE
in person re-identification with metric ensembles,” in Proc. IEEE Conf. Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1627–1645, 2010.
Comp. Vis. Patt. Recogn., 2015. [51] B. Huang, J. Chen, Y. Wang, C. Liang, Z. Wang, and K. Sun, “Sparsity-
[23] L. Wu, C. Shen, and A. van den Hengel, “Personnet: Person re- based occlusion handling method for person re-identification,” in Mul-
identification with deep convolutional neural networks,” in CoRR timedia Modeling, 2015.
abs/1601.07255, 2016. [52] M. Hirzer, P. Roth, and H. Bischof, “Person re-identification by efficient
[24] E. Ahmed, M. Jones, and T. K. Marks, “An improved deep learning imposter-based metric learning,” in Proc. Int’l. Conf. on Adv. Vid. and
architecture for person re-identification,” in Proc. IEEE Conf. Comp. Sig. Surveillance, 2012.
Vis. Patt. Recogn., 2015. [53] M. Guillaumin, J. Verbeek, and C. Schmid, “Is that you? metric learning
[25] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Deep metric learning for person approaches for face identification,” in Proc. IEEE Int. Conf. Comp. Vis.,
re-identification,” in Proc. Int. Conf. Patt. Recogn., 2014. 2009.
[26] S.-Z. Chen, C.-C. Guo, and J.-H. Lai, “Deep ranking for re-identification [54] B. Ma, Y. Su, and F. Jurie, “Bicov: A novel image representation for
via joint representation learning,” IEEE Transactions on Image Process- person re-identification and face verification,” in Proc. British Machine
ing, vol. 25, no. 5, pp. 2353–2367, 2016. Vis. Conf., 2012.
[27] L.-F. Chen, H.-Y. M. Liao, M.-T. Ko, J.-C. Lin, and G.-J. Yu, “A new [55] S. Ioffe and C. Szegedy, “Batch normalization: accelerating deep
lda-based face recognition system which can solve the small sample size network training by reducing internal covariate shift,” in CoRR,
problem,” Pattern Recognition, vol. 33, no. 10, pp. 1713–1726, 2000. abs/1502.03167, 2015.
[28] L. Zhang, T. Xiang, and S. Gong, “Learning a discriminative null [56] Y. Cho and L. Saul, “Kernel methods for deep learning,” in Proc.
space for person re-identification,” in Proc. IEEE Conf. Comp. Vis. Patt. Advances in Neural Inf. Process. Syst., 2009.
Recogn., 2016.
[29] V. Sydorov, M. Sakurada, and C. H. Lampert, “Deep fisher kernels - end
to end learning of the Fisher kernel GMM parameters,” in Proc. IEEE
Conf. Comp. Vis. Patt. Recogn., 2014.
[30] F. Perronnin and D. Larlus, “Fisher vectors meet neural networks: A
hybrid classification architecture,” in Proc. IEEE Conf. Comp. Vis. Patt.
Recogn., 2015.
[31] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep fisher networks
for large-scale image classification,” in Proc. Advances in Neural Inf.
Process. Syst., 2013.
[32] A. Krizhevsky, I. Sutskever, and G. E. Hinton., “Imagenet classification
with deep convolutional neural networks,” in Proc. Advances in Neural
Inf. Process. Syst., 2012.
[33] S. Pedagadi, J. Orwell, S. Velastin, and B. Boghossian, “Local fisher
discriminant analysis for pedestrian re-identification,” in Proc. IEEE
Conf. Comp. Vis. Patt. Recogn., 2013.
[34] R. Zhao, W. Ouyang, and X. Wang, “Person re-identification by salience
matching,” in Proc. IEEE Int. Conf. Comp. Vis., 2013.
[35] T. S. Jaakkola and D. Haussler, “Exploiting generative models in
discriminative classifiers,” in Proc. Advances in Neural Inf. Process.
Syst., 1998.
[36] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek, “Image classifica-
tion with the fisher vector: theory and practice,” Int. J. Comput. Vision,
vol. 105, no. 3, pp. 222–245, 2013.
[37] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman, “The devil
is in the details: an evaluation of recent feature encoding methods,” in
Proc. British Machine Vis. Conf., 2011.
[38] F. Perronnin, J. Snchez, and T. Mensink, “Improving the fisher kernel for
large-scale image classification,” in Proc. Eur. Conf. Comp. Vis., 2010.
[39] R. A. Fisher, “The use of multiple measurements in taxonomic prob-
lems,” Annals of Eugenics, vol. 7, no. 2, pp. 179–188, 1936.
[40] A. Stuhlsatz, J. Lippel, and T. Zielke, “Feature extraction with deep neu-
ral networks by a generalized discriminant analysis,” IEEE Transactions
on Neural Networks and Learning Systems, vol. 23, no. 4, pp. 596–608,
2012.