Information Sciences: Wenzhu Yan, Quansen Sun, Huaijiang Sun, Yanmeng Li

Information Sciences 516 (2020) 109–124
Contents lists available at ScienceDirect
Information Sciences
journal homepage: www.elsevier.com/locate/ins
Joint dimensionality reduction and metric learning for image

set classification
Wenzhu Yan∗, Quansen Sun, Huaijiang Sun, Yanmeng Li
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, PR China
a r t i c l e i n f o a b s t r a c t
Article history: Compared with the traditional classification task based on a single image, an image set
Received 14 April 2019 contains more complementary information, which is of great benefit to correctly classify a
Revised 19 December 2019
query subject. Thus, image set classification has attracted much attention from researchers.
Accepted 21 December 2019
However, the main challenge is how to effectively represent an image set to fully exploit
Available online 24 December 2019
the latent discriminative feature. Unlike in previous works where an image set was rep-
Keywords: resented by a single or a hybrid mode, in this paper, we propose a novel multi-model
Image set classification fusion method across the Euclidean space to the Riemannian manifold to jointly accom-
Feature learning plish dimensionality reduction and metric learning. To achieve the goal of our framework,
Kernel we first introduce three distance metric learning models, namely, Euclidean-Euclidean,
Dimensionality reduction Riemannian-Riemannian and Euclidean-Riemannian to better exploit the complementary
Metric learning information of an image set. Then, we aim to simultaneously learn two mappings perform-
Heterogeneous space fusion ing dimensionality reduction and a metric matrix by integrating the two heterogeneous
spaces (i.e., the Euclidean space and the Riemannian manifold space) into the common in-
duced Mahalanobis space in which the within-class data sets are close and the between-
class data sets are separated. This strategy can effectively handle the severe drawback of
not considering the distance metric learning when performing dimensionality reduction in
the existing set based methods. Furthermore, to learn a complete Mahalanobis metric, we
adopt the L2,1 regularized metric matrix for optimal feature selection and classification. The
results of extensive experiments on face recognition, object classification, gesture recogni-
tion and handwritten classification demonstrated well the effectiveness of the proposed
method compared with other image set based algorithms.
© 2019 Elsevier Inc. All rights reserved.
1. Introduction
Classification task is one of the most important research topics in the fields of computer vision and machine learning.
Notably, numerous state-of-the-art methods based on a single image classification have been proposed [1–3]. However, with
the rapid development of computer and image processing technologies, it is rather convenient to acquire various images of
subjects from many real-world applications including video surveillance, personal photo albums and camera networks. Thus,
a new crucial research topic of learning from image sets is proposed [4–9]. Image set classification provides more informa-
tion to effectively deal with the typical appearance variations within images including: variations in illumination, viewpoint
changes, and occlusions. Generally, based on the input of the set based classification task, there usually exist two key parts:
∗
Corresponding author.
E-mail addresses: ywznanj@163.com (W. Yan), sunquansen@njust.edu.cn (Q. Sun), sunhuaijiang@njust.edu.cn (H. Sun), yanmengli6@126.com (Y. Li).
https://doi.org/10.1016/j.ins.2019.12.041
0020-0255/© 2019 Elsevier Inc. All rights reserved.
110 W. Yan, Q. Sun and H. Sun et al. / Information Sciences 516 (2020) 109–124
Fig. 1. The input/output of image set classification case and the key problem.
Fig. 2. The widely used distance metrics. For image set classification (a), we can adopt distance metrics based on Euclidean point to Euclidean point, affine
hull to affine hull, manifold to manifold, and their hybrid model to compute the dissimilarity between two sets. For single image to set based classification
(b), we can obtain the dissimilarity by using the point to set distance metric. The part (c) is our fusing Point(Set) to Point(Set) distance metric used to
handle image set classification problem.
feature extraction and image classification, which are illustrated in Fig. 1. Feature extraction is the key problem, whose
purpose is to extract discriminative feature to exploit the rich information within the image sets. And image classification
aims to design effective classifiers/models to classify different query subjects. Based on the above two phrases, a series of
nonparametric set modeling methods based on subspace [10–12], manifold [5,13], affine/convex hull [4,14,15] etc, have been
proposed to effectively deal with different visual classification tasks. Another line of research aims to use parametric statis-
tical models [7,9,16,17] to represent an image set. Specifically, some previous works adopt a single Gaussian model [16] or
Gaussian Mixture Model (GMM) [17] to precisely characterize the variations within images in a set. Then, the dissimilarity
between two sets can be measured by adopting the Kullback-Leibler (K-L) divergence. Furthermore, some single image based
algorithms [18–20] have been extended to handle the image set classification problem, and they eventually adopt a majority
voting strategy to classify a query set. To our knowledge, as the performance of a classifier largely relies on the quality
of features, how to construct effective model representations that are invariant and robust to many real-world variations
is the main challenge in image set classification. As shown in Fig. 1, the latent information across different spaces can be
exploited in our fusion model, which is superior to previous works, for the reason that the traditional methods generally
represent an image set by a single mode or a hybrid mode, exhibiting the drawback of not considering the complementary
correlation across different spaces. To be specific, Fig. 2 gives an illustration of distance metric based on different model
representations. It has been shown in other studies [7,21,21–24] that different modes represent the image set from different
perspectives, specifically, the mean vector and the covariance matrix reflect two different statistical features of an image set,
and these features can provide complementary information to fully represent the target set. As the mean vector is consid-
ered as a representation point in the Euclidean space and the covariance matrix essentially lies on a specific Riemannian
manifold, we can naturally present the distance metrics based on Euclidean to Euclidean, and Riemannian to Riemannian
metric learning in their corresponding spaces. Furthermore, to better exploit the potential correlation between the different
representations of an image set, we incorporate the point to set distance metric into the image set classification problem
which can be formulated as the case of matching Euclidean points with Riemannian points, can eventually enhance the
W. Yan, Q. Sun and H. Sun et al. / Information Sciences 516 (2020) 109–124 111
classification performance. To specifically describe the framework of fusing these two heterogeneous spaces in a unified
formulation, we employ the kernel trick [23,25,26] to map the multi-models into a common induced space. Moreover, we
adopt a data-dependent manifold kernel to fully exploit the geometry structure of manifold space by using the unlabeled
data which is easily available nowadays. As described in [12,14,15,27,28], no matter how the set is modeled, it usually re-
quires principal component analysis (PCA) as a beneficial pre-processing tool as it enables us to reduce the computational
burden and extract effective features that is robust to data noise. Some works [29,30] attempt to consider the distance met-
ric jointly with the dimensionality reduction strategy. For the application of single image based person re-identification in
[29], the intrapersonal and extrapersonal variations of subjects are described by multivariate Gaussian distributions. Then, a
joint dimension reduction and metric learning method is proposed by simultaneously learning a subspace and a restricted
quadratic discriminant analysis (RQDA) distance function. However, this method does not work well with the image set
classification problem when the image sets hold weak underlying distributional assumptions. Harandi et al. [30] aimed to
address the feature extraction problem in a low dimensional representation from a geometric perspective, which has great
storage burden and computational complexity. In this work, the outstanding contribution of our new multi-model fusion
method aims to jointly accomplish dimensionality reduction and metric learning to exploit the different latent discrimina-
tive feature for the image set classification problem. To sum up, our work provides the following contributions: 1) Three low
dimensional distance representations (point to point, point to set and set to set) are described to model image sets 2) We
design an efficient joint learning framework by simultaneously learning two mappings performing dimensionality reduction
and a metric matrix by integrating the two heterogeneous spaces (i.e., the Euclidean space and the Riemannian manifold)
into the common induced Mahalanobis space in which the within-class data sets are close and the between-class data sets
are separated. To make our model is optimal for feature selection and classification, we regularize the metric matrix by
using the L2,1 norm to create a penalty of sparsity. 3) A new fast optimization algorithm is also developed to solve the
resulting nonconvex problem. Extensive experiments on four visual classification tasks are conducted to demonstrate the
effectiveness of the proposed method.
The rest of this paper is structured as follows. In the next section, we give an overview of the set based classification
methods. Then, we illustrate the proposed method in detail and introduce the global objective function in Section 3. In
Section 4, we justify the effectiveness of our proposed methods via extensive experiments. Finally, the conclusion is pre-
sented in Section 5.
2. Related works
Numerous works have been proposed to deal with the image set based classification task. In this section, we provide an
overall review of the related works, which are discussed as follows.
Some works model an image set as a linear subspace, and then adopt Canonical Correlation Analysis (CCA) [31] to find
principal angles which can be used to calculate the subspace similarity [10,11]. Methods such as the Mutual Subspace
Method (MSM) [10] and Orthogonal Subspace Method (OSM) [11], have shown promising results. However, for sets with
large images and an extensive range of variations, these methods cannot effectively exploit all the information comprised
in the images. Based on CCA, Discriminative Canonical Correlations (DCC) [12] calculates the subspace similarity by in-
corporating the discriminative information. Thus, it can obtain better results. Some methods aim to use the affine hull or
convex hull to represent an image set [4]. For example, Sparse Approximated Nearest Points (SANP) adopts the affine hull
model to interpret invisible appearance variations and forces the nearest data points to be close by adding sparsity con-
straints [14]. Regularized Nearest Point (RNP) [15] represents an image set by using the regularized affine hull, which leads
to less model complexity than that of SANP. To find some represented prototypes to better measure the set to set distance,
Wang et al [32] jointly learned the prototypes from the corresponding affine hull and a linear discriminative projection
to handle the image set classification problem. Moreover, based on the concept of Linear Regression Classification (LRC)
[33] for image reconstruction, a series of methods have been proposed to extend LRC to deal with the set based classifica-
tion task [6,34], including Dual Linear Regression Classification (DLRC) [6], Pairwise Linear Regression Classification (PLRC)
[34]. However, for these methods based linear regression mechanism, the dimension of the feature vectors should be much
larger than the number of images in the combined new sets when calculating the between-set dissimilarity by the dis-
tance between the virtual images reconstructed from the original data sets. To improve the classification performance, SJSRC
[8] adopts a set-level joint sparse representation model to classify a query subject by using the minimal reconstruction
residual.
Methods modeling image sets as local linear manifold components can effectively capture the variations information
[5,13,35,36]. Manifold-Manifold Distance (MMD) [5] adopts the manifold to manifold distance to measure the set dis-
similarity. Manifold Discriminant Analysis (MDA) [13] extends MMD to further exploit the latent discriminative informa-
tion in a projected low dimensional space. Moreover, from the geometric take, Huang et al. [28] represented an image
set as a point on Grassmann manifold, then, they performed a dimensionality reduction method which aims to embed
the original Grassmann manifold into a lower-dimensional manifold space where discriminative features can be naturally
exploited.
For the metric learning, Log-Euclidean Metric Learning (LEML) [37] adopts Symmetric Positive Definite (SPD) matrices
to represent the image sets and performs directly on logarithms of these SPD matrices to measure the distance between
sets. Localized Multi-Kernel Metric Learning (LMKML) [22] adopts the different order statistics of an image set to learn a
distance metric. However, as these statistics are combined in series directly, the redundancies in LMKML may result in great
computational complexity. Huang et al. [23] proposed a metric learning framework to handle Video-to-Still (Still-to-Video)
problem by exploiting the mutual information across the Euclidean and Riemannian manifold spaces, while for Video to
Video (set-set) problem, their framework only adopts a hybrid metric learning method which lacks of the ability to fuse the
latent complementary information across the two heterogeneous spaces.
Additionally, some statistical models [7,9,16,17] have been proposed to effectively model image sets, including the single
Gaussian model [16] and the GMM [9,17,31]. The between set dissimilarity is eventually measured by adopting the KL-
divergence. However, these methods typically suffer from the problem that the query image set has weak statistical correla-
tions with the training sets, which leads to larger fluctuations in performance. By modeling an image set with its covariance
matrix, Covariance Discriminative Learning (CDL) [38] conducts the kernel discriminative analysis to address complex data
distribution.
Recently, deep learning has shown its potential capability to tackle the image set based classification task [39,40]. Hayat
et al [39]. adopted a multilayer neural network to obtain the feature representation for an image set and used the minimal
representation residual to classify a query set. To explore the discriminative ability of deep network, Lu et al. [40] mapped
the original data manifold into a feature space to enhance the classification performance. Although deep learning based
methods have achieved relatively good performance, they definitely need a great number of training image sets and superior
computational platforms.
3. The proposed approach
3.1. Problem formulation
In this paper, we focus on exploiting the multi-model fusion by jointly learning the lower-dimensional representation of
data sets and the Mahalanobis distance to effectively deal with the image set classification problem. Firstly, we introduce
the following Definition. 1.
Definition 1. The classical Mahalanobis distance (MD) can be used to define the distance between two points xi and x j as
follow:

d ( xi , x j ) = (xi − x j )T M(xi − x j ), (1)
where M is the Mahalanobis metric matrix, which is positive definite.
According to the Definition. 1, when we map the multi-model of data set into the common space, the pairwise multi-
modual distance metric can be described in the Definition. 2.
Definition 2. Suppose that xi and r j are two modual representations of the image set, then, the multimodual Mahalanobis
distance is

d ( xi , r j ) = (fx (xi ) − fr (x j ))T Mxr (fx (xi ) − fr (x j )), (2)
where the two mapping functions fx and fr are used to learn the distance metric across different moduals. The positive
definite matrix Mxr is the Mahalanobis matrix.
Given data sets S, the Euclidean model of S is X = [x1 , x2 , . . . xn ], xi ∈ Rd with the labels Li ∈ (1, 2 . . . , C ), where C is
the number of classes. The manifold formation of S is R = [r1 , r2 , . . . rn ], ri ∈ which shares the labels with the Euclidean
data. We jointly fuse these two data models of S to fully exploit the structure of the data sets, as they can provide the
complementary information to each other. Our new multi-model fusion method aims to jointly accomplish dimensionality
reduction (DR) and metric learning (ML) to exploit the latent discriminative feature for image set classification. The flowchart
of the proposed method is shown in Fig. 3.
As shown in Fig. 3, the image sets are represented by different models: Model 1 is represented in the Euclidean space
and Model 2 is modeled in the Riemannian manifold, respectively. Unlike in previous works where the schemes directly
operate on the target data in the original space, we adopt the kernel technique to map the multi-models of data sets into
the Hilbert space to obtain the nonlinear separable high-dimensional information. Specifically, the commonly used Radial
Basis Function (RBF) kernel is used in the Euclidean space, and the data-dependent kernel is adopted in the Riemannian
manifold space by using the unlabeled testing data constructed in a graph to better exploit the geometry structure of the
nonlinear manifold, as this semi-supervised learning strategy can strongly penalize the weak statistical correlations between
the training and testing manifold representations. Then, we jointly learn two projection matrices performing dimensionality
reduction (Px , Pr ) and a Mahalanobis metric matrix (M) by integrating the two heterogeneous Euclidean and Riemannian
spaces into a common space. This unified learning framework can yield feature extraction directly accomplished in a low
dimensional subspace, which is different from methods based on a pre-processing by PCA. The learning distance metric
leads to the compactness of within-class sets enhanced, and between-class data samples better separated. Notably, to obtain
Fig. 3. Joint dimensionality reduction and metric learning for image set classification.
the most useful basis elements which can be beneficial to feature selection, we adopt the L2,1 regularization on the metric
matrix to create a penalty of sparsity that can exhibit effective feature interpretability. The L2,1 -norm has been used in
various fields which can be defined as [41]

d
w
d
M 2 , 1 = m2i j = Mi 2 , (3)
i=1 j=1 i=1
where M ∈ Rd×w , and mi is the row vector of M. In a word, our proposed method aims to integrate the dimensionality
reduction and the sparse feature extraction into a unified framework.
3.2. Joint dimensionality reduction and metric learning (JDRML)
Once we obtain the two heterogeneous models of the target image sets from the Euclidean and Riemannian manifold
spaces respectively, we further aim to transform them into a common space, in which the pairwise model distance metric
can be properly described. To achieve this, we adopt the strategy of kernel method to obtain the high dimensional repre-
sentation of two models. To be specific, the kernel mapping representation of two models will be introduced as follows.
For the Euclidean model, given two points xi and x j , we adopt the commonly used RBF kernel

k(xi , x j ) = exp xi − x j 2 /2σ 2 . (4)
For the manifold model, given two Riemannian manifold representations ri and r j , we adopt the manifold based kernel
to encode the intrinsic structure of the data sets

k(ri , r j ) = tr l og(ri )l og(r j ) . (5)
Furthermore, to fully specify the data manifold structure information, by using the semi-supervised setting that definitely
has access to the available unlabeled data, we construct a fully trusted graph by using the manifold distance metric and
employ a kernel deformation strategy [25] to derive the new data dependent kernel as follows:
k˜ (ri , r j ) = k(ri , r j ) − (kri )T (I + LK )−1 Lkr j , (6)

where L is the laplacian matrix, I is the identity matrix, k is the kernel function in the original reproducing kernel Hilbert
space (RKHS), K is the original kernel matrix, kri = [k(ri , r1 ), . . . , k(ri , rn )]T and kr j = [k(r j , r1 ), . . . , k(r j , rn )]T .
On the basis of the above kernels, we obtain the implicit nonlinear kernel transformations which respectively map the
Euclidean space Rd and the Riemannian manifold into two RKHSs. Thereafter, we learn two projection matrices Px and
Py to obtain the lower-dimensional representations which can preserve the energy of each model as much as possible, and
simultaneously pursue discriminant function based metric learning to obtain better discriminable performance, i.e., learn-
ing the Mahalanobis matrix M to reflect the class similarity. The Mahalanobis matrix for each task is ensured to be posi-
tive semi-definite. Before fusing different models into a unified framework, we first introduce three different Mahalanobis
distance with the lower dimensional representation based on Euclidean-Riemannian, Euclidean-Euclidean, and Riemannian-
Riemannian metric learning. The Mahalanobis distance between the Euclidean point xi and the manifold point r j can be
described as follows:
T
d1 (xi , r j ) = PTx φ (xi ) − PTr φ (r j ) M PTx φ (xi ) − PTr φ (r j ) . (7)
Considering that Px can be rewritten as a linear combination of all training sets in the kernel space according to the
Euclidean space, i.e, Px = (X )Wx and similarly, Pr can be expressed as Pr = (R )Wr for the Riemannian manifold repre-
sentation. Then, we have
PTx φ (xi ) = WTx (X )T φ (xi ) = WTx Kxi , (8)
PTr φ (r j ) = WTr (R )T φ (r j ) = WTr Kr j . (9)
Thus, Eq. (7) can be rewritten as

T
d1 xi , r j = WTx Kxi − WTr Kr j M WTx Kxi − WTr Kr j . (10)
Similarly, the Mahalanobis distance between the Euclidean points of xi and x j in the low-dimensional space can be
expressed as
T
d2 (xi , x j ) = WTx Kxi − WTx Kx j M WTx Kxi − WTx Kx j (11)
and the Mahalanobis distance between the manifold points of ri and r j in the low-dimensional space can be written as
T
d3 (ri , r j ) = WTr Kri − WTr Kr j M WTr Kri − WTr Kr j . (12)
The two projection matrices Wx and Wr are learned to obtain the lower-dimensional representations which can preserve
the energy of each model as much as possible. We constrain them as
H = Kx − WTx Wx Kx 2F + Kr − WTr Wr Kr 2F . (13)

To learn a latent space whose Mahalanobis distance can definitely reflect the class similarity, we adopt the learning
criterion from a large margin to keep each input data closer to its neighbors with the same label and far away from other
inputs with different class labels. We express the relation between the Euclidean model and the manifold model by a linear
inequality constraint J1 , considering the effect of the similarity and dissimilarity constraints. Similarly, the J2 and J3 are used
to represent the distances for the Euclidean and the manifold models, respectively,

n
n
J1 = d 1 ( xi , r j ) − d 1 ( x i , r j ) ≥ 1 − ξ1 , (14)
i, j=1,(lxi =lr j ) i, j=1,(lxi =lr j ,i= j )

n
n
J2 = d 2 ( xi , x j ) − d 2 ( x i , x j ) ≥ 1 − ξ2 , (15)
i, j=1,(lxi =lx j ) i, j=1,(lxi =lx j ,i= j )

n
n
J3 = d 3 ( ri , r j ) − d 3 ( r i , r j ) ≥ 1 − ξ3 , (16)
i, j=1,(lri =lr j ) i, j=1,(lri =lr j ,i= j )
where ξi ≥ 0, i = 1, . . . m, m = 3 are the slack variables.

Inspired by the graph constraint and in order to effectively represent our model, we define the following three matrices:

1 lxi = lr j , i = j
D(i, j ) = −1 lxi = lr j (17)
0 else,

1 lxi = lx j , k1 (i, j )
Dx (i, j ) = −1 lxi = lx j k2 (i, j ) (18)
0 else,

1 lri = lr j , k1 (i, j )
Dr (i, j ) = −1 lri = lr j k2 (i, j ) (19)
0 else,
where k1 (i, j) represents the nearest neighbors belonging to the same class and k2 (i, j) represents those belonging to the
different class. Then, we rewrite the Eqs. (14)–(16) in matrix formulation as
J1 = −tr (G1 M ) where G1 = WTx Kx Ax KTx Wx + WTr Kr Ar KTr Wr − 2WTx Kx DKTr Wr , (20)
J2 = −tr (G2 M ) where G1 = 2WTx Kx Ax KTx Wx − 2WTx Kx Dx KTx Wx = 2WTx Kx Lx KTx Wx , (21)
J3 = −tr (G3 M ) where G1 = 2WTr Kr Ar KTr Wr − 2WTr Kr DKTr Wr = 2WTr Kr Lr KTr Wr , (22)

where Ax , Ar , Ax , Ar are diagonal matrices with Ax (i, i ) = nj=1 D(i, j ), Ar ( j, j ) = ni=1 D(i, j ), Ax (i, i ) = nj=1 Dx (i, j ),
n
Ar (i, i ) = j=1 Dr (i, j ).
Thus, we define the L2,1 -norm regularization objection function as follows:

m
min J = M2,1 + λ1 Kx − WTx Wx Kx 2F + Kr − WTr Wr Kr 2F + Ci ξi
M,Wx ,Wr
i=1
i = 1, . . . m, m = 3
s.t. −tr (Gi M ) ≥ 1 − ξi (23)
ξi ≥ 0 ,
where the L2,1 -norm is adopted to obtain optimal sparse feature extraction and λ1 is used to balance the part of the regular-
izations for the projection matrices Wx and Wr . ξi ≥ 0(i = 1, 2, 3 ) are the slack variables used to penalize large distances for
three different distance metrics based on Euclidean-Riemannian, Euclidean-Euclidean, and Riemannian-Riemannian metric
learning described in Eqs. (14)–(16) with the corresponding balancing parameters Ci .
3.3. Optimization
The optimization problem in Eq. (23) is complicated to solve as it is a non-convex problem. We propose a new fast and
simple algorithm to solve the problem in Eq. (23) by modifying the inequality constraints into the equality constraints, which
is proved to be an effective strategy to solve the quadratic programming problem in [42]. Thus, we rewrite the Eq. (23) as

m
min J = M2,1 + λ1 Kx − WTx Wx Kx 2F + Kr − WTr Wr Kr 2F + Ci ξi
M,Wx ,Wr
i=1
i = 1, . . . m, m = 3
s.t. (24)
−tr (Gi M ) = 1 − ξi .
Then, after some modifications, we have

min J = M2,1 − λ1 tr WTx Kx KTx Wx + WTr Kr KTr Wr + tr (C1 G1 + C2 G2 + C3 G3 )M. (25)
M,Wx ,Wr
As the optimization problem in Eq. (25) is not convex for Wx , Wr , and M simultaneously, we adopt an alternative opti-
mized solution to solve one variable with the others fixed. Each sub-optimization problem has a closed-form solution. The
optimization steps are shown in Algorithm 1.
Algorithm 1 JDRML.
Input: Kx , Kr , D, Dx , Dr , the tradeoff parameters λ1 , C1 , C2 , C3 , iteration numbers t1 .
Output: M, Wx , Wr .
Initialize Mi , Wxi , Wri .
repeat
Step 1: Fix Wx , Wr and solve M via Eq. (28).
Step 2: Fix M, Wr and solve Wx via Eq. (31).
Step 3: Fix M, Wx and solve Wr via Eq. (34).
until t1 reached.
Step 1: Learn M with fixed Wx and Wr , Eq. (25) can be rewritten as:
min J = M2,1 + tr (C1 G1 + C2 G2 + C3 G3 )M, (26)
M
then, we have
min J = tr (MT UM ) + tr (C1 G1 + C2 G2 + C3 G3 )M. (27)
M
As features in different domains usually have sparse correspondences, the matrix U is constrained to be sparse. Then, we
have
∂J
= UM + (C1 G1 + C2 G2 + C3 G3 )T = 0. (28)
∂M
Considering that the Mahalanobis matrix M needs to be positive semi-definite, we can project M onto a semi-definite cone
after every iterative step. Then, M can be obtained by the procedure in Algorithm 2. Then, we introduce the following
Lemma 1 to show that the iterative optimization procedure in Algorithm 2 is convergent.
Lemma 1. Supposed that ftva is the value of the objective function in Eq. (26) at the tth iteration, after t + 1th iteration, we have
va ≤ f va .
ft+1 t
Algorithm 2 The procedure to obtain M.

Initialize iteration numbers t2 .
repeat
Update Mt+1 , via
Step 1: M = −U−1 (C1 G1 + C2 G2 + C3 G3 )T .
Step 2: project M onto the semi-definite cone by computing the eigen-decomposition of M.
Update Uii = −/2Mi 2
until converged.
Proof. We show the proof of Lemma 1 in the Appendix.

Step 2: Learn Wx with fixed M and Wr , Eq. (25) can be rewritten as:

min J = tr −λ1 WTx Kx KTx Wx + C1 WTx Kx Ax KTx Wx − 2C1 WTx Kx DKTr Wr + 2C2 WTx Kx Lx KTx Wx M , (29)
Wx
then,
∂J
= −λ1 Kx KTx Wx + C1 Kx Ax KTx + 2C2 Kx Lx KTx Wx M − 2C1 Kx DKTr Wr M = 0, (30)
∂ Wx
we have
−1 −1
Wx M − λ1 C1 Kx Ax KTx + 2C2 Kx Lx KTx Kx KTx Wx = 2C1 C1 Kx Ax KTx + 2C2 Kx Lx KTx Kx DKTr Wr M. (31)
Eq. (31) is typically a Sylvester equation.
Step 3: Learn Wr with fixed M and Wx , Eq. (25) can be rewritten as:

min J = tr −λ1 WTr Kr KTr Wr + (C1 WTr Kr Ar KTr Wr − 2C1 WTr Kr DKTx Wx + 2C3 WTr Kr Lr KTr Wr )M , (32)
Wr
then,
∂J
= −λ1 Kr KTr Wr + C1 Kr Ar KTr + 2C3 Kr Lr KTr Wr M − 2C1 Kr DKTx Wx M = 0, (33)
∂ Wr
we have
−1 −1
Wr M − λ1 C1 Kr Ar KTr + 2C3 Kr Lr KTr Kr KTr Wr = 2C1 C1 Kr Ar KTr + 2C3 Kr Lr KTr Kr DKTx Wx M. (34)
Eq. (34) is typically a Sylvester equation.
4. Experimental results
In this section, we present the experimental results to evaluate the proposed method compared with state-of-the-art
image-set classification methods on four visual classification tasks: face recognition [43–45], object classification [46], ges-
ture recognition [47] and digit classification [48].
4.1. Datasets and parameter settings
For the face recognition task, the challenging YouTube Celebrities (YTC) dataset [43] collected from YouTube has been
widely used to evaluate face recognition in the previous works [4–6,12,28,37,38]. There exist more than 10 0 0 video pieces
belonging to 47 subjects. The face images within the YTC dataset have large variations in pose illumination expressions as
well as low resolution. The Extended Yale Face Database B (EYaleB) [44] consists of 16,128 images of 28 classes. Nine face
image sets per class are contained in this dataset. Some face examples from the YTC and EYaleB datasets are shown in
Fig. 4(a) and (b), respectively. Moreover, we use an up-to-date version of the COX dataset [45] to evaluate the performance
of video based face recognition under the typical application such as video surveillance. This dataset contains 10 0 0 different
subjects captured rich variations. Each subject has three videos.
For the object classification task, we use the benchmark ETH-80 object dataset [46]. There exist 80 object sets from 8
categories including apples, cars, cows, cups, dogs, horses, pears and tomatoes. Each subject contains 10 subcategory sets
and the number of images in each set is approximately 41. Some object examples extracted from this dataset are shown in
Fig. 5.
We use the Cambridge Gesture dataset [47] to evaluate the action recognition task. This dataset contains nine hand
gesture classes. Each gesture includes 100 image sequences which can be divided into five illuminations and 10 motions
from each of two subjects. Some examples of these gestures are given in Fig. 6.
For the handwritten digits recognition task, the MNIST dataset contains a total of 70,0 0 0 image samples which can be
divided into 10 classes from 0 to 9. All the black and white digits images are resized to 20 × 20. Some exemplar images
are shown in Fig. 7.
Fig. 4. Some face examples from (a) YTC and (b) EYaleB, respectively.
Fig. 5. Some object images from ETH80 dataset.
Fig. 6. Some gesture images from hand Gesture dataset.
To evaluate the classification performance of our proposed method, the comparisons of set based methods include Dis-
criminant Canonical Correlation Analysis (DCC) [12], Manifold-to-Manifold Distance (MMD) [5], Manifold Discriminant Anal-
ysis (MDA) [13], Affine Hull based Image Set Distance (AHISD) [4], Convex Hull based Image Set Distance (CHISD) [4], Sparse
Approximated Nearest Points (SANP) [14], Regularized Nearest Points (RNP) [15], Prototype Discriminative Learning (PDL)
[32], Covariance Discriminant Learning (CDL) [38], Projection Metric learning (PML) [28], Discriminant Analysis on Rieman-
nian manifold of Gaussian distributions (DARG) [7], Cross Euclidean-to-Riemannian Metric Learning (CERML) [23].
To obtain a fair comparison of the methods, the parameters are all empirically tuned. To be specific, we adopt the PCA
to maintain 90% energy for DCC [12]. For AHISD, CHISD, SANP, and RNP, all the parameters are set according to Cevikalp
and Triggs [4], Hu et al. [14], and Yang et al. [15], respectively. For MMD and MDA, we select the number of local linear
patches from [5–20]. The other parameters in MDA are followed by the works in [13]. We use the kernel LDA without
requiring other parameter configurations in [38]. For PML, the total energy is preserved 95% by adopting the PCA [28]. For
DARG, as the model based on the Mahalanobis and Log-Euclidean distance can lead to the best recognition accuracy than
others in [7], we use this distance metric to evaluate all the visual classification tasks in our experiments. The video to video
Fig. 7. Some exemplar images from MNIST dataset.
Fig. 8. ROC Curves on: (a) YTC and (b) EYaleB, respectively.
classification case of CERML is adopted in this experiments [23]. For the experimental settings of the proposed JDRML1 , λ1
is selected from [0.1-0.3] in the step of 0.05. C1 is selected from [0.2, 0.5, 0.8], and C2 and C3 are tuned from [0.0 0 01, 0.0 05,
0.01, 0.015, 0.02], respectively. The total number of iteration is set as 15. Some other key parameters of these methods are
described in the following experiments.
4.2. Experimental results and analysis
In our experiments, for the face recognition task on YTC and EYaleb datasets, we adopt the Viola-Jones algorithm in
[49] to extract the face images. And the face images are resized to 30 × 30 and 20 × 20 respectively. We conduct our
experimental configurations provided in [5,8,14,32,47]. Three image sets of each individual are randomly selected for training
and the others for testing. For the COX dataset, each video frame is resized to 32 × 40 and the histogram equalization is
adopted to reduce the lighting effects. We randomly select 300 subjects for training. The first two videos of the rest 700
subjects are used to perform the video to video evaluation. The experimental results for the face recognition task are shown
in Table 1.
From Table 1, we can obtain that our proposed method achieves the best accuracy rates on these three face recogni-
tion datasets. Specifically, on the YTC and COX datasets, all methods yield relatively lower identification rates as these two
datasets include faces under a wide range of variations. Notably, the accuracy rates of the proposed JDRML can reach 78.1%
and 93.44% on the YTC and COX, respectively, which indicates that fusing different model representations of the target set
into a unified framework can help to fully exploit the mutual complementary information, and extracting effective sparse
feature naturally enhances the classification performance. Besides, Convex/Affine hull based models (AHISD, CHISD, SANP
and RNP) show competitive classification performance compared with those of the multiple linear subspace methods (MMD
and MDA) on the YTC and EYaleb datasets. We can also see that models based on manifold (CDL, PML, DARG and CERML)
yield higher recognition rates than those of other methods on the COX dataset, for the reason that the nonlinear manifold
based on Riemannian or Grassmann structure considers the intrinsic geometry information. Furthermore, we also present
the Receiver Operating Characteristic (ROC) curves in Fig. 8(a) and (b) for different methods on YTC and EYaleB, respectively.
1
https://github.com/zhuyeqingma/myJDRML.
Table 1
Performance of all methods on different datasets (%).
Method YTC EYaleB COX Gesture ETH80 MNIST Year
DCC 68.5 ± 7.4 86.4 ± 8.7 62.53 64.7 85.3 ± 6.9 99.2 ± 4.87 2007
MMD 57.3 ± 7.9 71.9 ± 7.1 38.29 58.1 81.2 ± 6.5 93.8 ± 3.8 2008
MDA 52.3 ± 8.2 56.7 ± 7.4 65.83 21.4 65.4 ± 7.3 84.5 ± 1.2 2009
AHISD 72.5 ± 8.8 82.0 ± 6.5 53.03 18.1 71.0 ± 8.7 72.0 ± 2.1 2010
CHISD 70.9 ± 8.4 80.0 ± 7.2 56.90 18.3 68.0 ± 7.2 99.7 ± 5.1 2010
SANP 71.4 ± 6.5 82.0 ± 6.1 57.82 22.4 67.0 ± 6.3 99.8 ± 3.2 2012
RNP 71.5 ± 6.4 83.4 ± 5.8 58.07 35.6 67.0 ± 6.4 100 ± 0.0 2013
CDL 74.1 ± 8.2 89.1 ± 3.4 78.43 73.4 88.3 ± 6.2 100 ± 0.0 2012
PML 72.7 ± 7.6 84.7 ± 4.5 71.27 83.2 86.0 ± 6.5 100 ± 0.0 2015
DARG 76.4 ± 8.1 88.7 ± 6.5 83.71 31.2 84.4 ± 7.2 100 ± 0.0 2017
PDL 71.7 ± 8.6 89.5 ± 5.5 65.8 21.1 73.0 ± 6.2 99.7 ± 2.7 2017
CERML 76.6 ± 7.8 90.1 ± 5.2 90.31 83.7 85.0 ± 2.5 99.8 ± 6.7 2018
JDRML 78.1 ± 7.5 93.6 ± 3.6 93.44 84.6 94.0 ± 2.2 100 ± 0.0
Fig. 9. Convergence curves of JDRML on (a) YTC and (b) ETH80 datasets.
Fig. 10. Recognition accuracy of JDRML on YTC (a), ETH80(b) and Gesture (c) datasets with different λ1 .
The proposed method clearly outperforms the other methods by producing the highest true positive rates against all false
positive rates.
To evaluate the classification performance on the ETH80 dataset, we randomly select half of the 10 object sets per subject
for training and the rest for testing and adopt five fold cross validation experiments. The results of the comparison between
our method and other competing methods are clearly shown in Table 1. As can be seen in the table, methods assuming that
image sets lie on the Riemannian manifold (CDL, PML, DARG and CERML) can better exploit the latent structure information.
Thus, they show relatively good performance. Our method can achieve a high classification rate of 94%.
Table 2
Classification performance (%) of JDRML and JDRML-DK on different datasets.
Method YTC EYaleB COX Gesture ETH80 MNIST
JDRML 78.1 ± 7.5 93.6 ± 3.6 93.44 84.6 94.0 ± 2.2 100
JDRML-DK 79.8 ± 7.9 95.2 ± 4.3 93.56 85.2 95.7 ± 3.4 100
Table 3
The classification performance and runtime for the set based methods.
N 270 450
method p=10 p=50 p=80 p=10 p=50 p=80

PCA+DCC 0.72(21.3s) 0.80(51.2s) 0.84(83.6s) 0.74(28.8s) 0.85(77.6s) 0.90(134.9s)
PCA+MMD 0.15(0.12s) 0.14(0.12s) 0.18(0.14s) 0.18(0.18s) 0.16(0.17s) 0.28(0.21s)
PCA+RNP 0.41(0.011s) 0.42(0.02s) 0.45(0.05s) 0.41(0.09s) 0.43(0.12s) 0.52(0.17s)
DARG 0.58(101.4s) 0.60(110.9s) 0.62(118.8s) 0.53(82.5s) 0.64(120.2s) 0.67(132.5s)
CERML 0.35(22.5s) 0.43(58.8s) 0.47(91.1s) 0.46(22.1s) 0.61(50.2s) 0.56(87.8s)
JDRML 0.82(59.7s) 0.88(98.1s)
For the hand gesture recognition task, each of the 100 videos per gesture class is divided into five illuminations (Set1,
Set2, Set3, Set4, and Set5). All video frames are resized to 20 × 20. Following the experimental protocol in [47], we
select set5 for training and the rest of the sets (Set1, Set2, Set3, and Set4) for testing. From the results in Table 1, the
proposed method outperforms the other methods, while the classification performance of these convex/affine hull based
models (AHISD, CHISD, SANP and RNP) degrades significantly as they can be largely deteriorated by outliers.
Unlike in the previous experimental settings on the MNIST database oriented to a single image classification model, in
this experiment, we adopt the set based strategy to achieve the handwritten digits classification task. Thus, a comprehensive
insight into image set classification is given. Specifically, all images are divided into 300 sets. Each digit subject includes 30
subsets, where each subset contains about 200 frames. 20 image sets of each individual are randomly selected for training
and the others for testing. As shown in Table 1, most of the existing set based methods achieve 100% on this datset. Thus,
set based models can greatly improve this case of handwritten digits classification.
To fully explore the manifold structure, we use the unlabeled testing data sets to construct the data dependent kernel and
then we set the number of the neighbors in Eq. (6) as 6. The results of the comparison between the proposed JDRML and
JDRML with the data dependent manifold kernel (JDRML-DK) are shown in Table 2. We can see in the table that JDRML-DK
shows the best classification performance, as a semi-supervised learning strategy with the unlabeled testing data samples
can strongly penalize weak statistical correlations between the training and testing manifold points.
4.3. Comparison of different features and runtime
In this section, we analyze the classification performance of set based methods with different number of dimensions.
For DARG, CERML and JDRML, the image sets are embedded into RKHS by performing a non-linear transformation to obtain
the nonlinear separable high-dimensional information of the original set data. Unlike DARG and CERML, the proposed JDRML
learns the low dimensional representation with metric learning in a unified framework. We conduct experiments to evaluate
the role of dimensionality reduction for kernelized data sets and to analyze the classification performance of PCA based
subspace model (PCA+DCC), convex hull model (PCA+RNP), and local linear subspace manifold method (PCA+MMD). In the
experiments, we select the number of image sets to 270 and 450 for training. And the dimensional feature (p) is reduced to
three types which are shown in Table 3. The recognition accuracies and the computational time on the Gesture dataset for
different methods are shown in Table 3. From the Table 3, we can see that PCA based set classification methods show faster
runtimes with relatively lower dimensional feature, and the recognition performance is obviously affected by the different
selected dimensional features. Our proposed joint learning method outperforms most of the other set based classification
methods, and has comparable runtimes.
4.4. Parameter analysis
4.4.1. Convergence analysis

Theoretically, for our objection function in Eq. (25), it is convex for one variable when the others are fixed. We use the
data sets of the YTC and ETH80 as examples to illustrate the optimization process of our method. The curves of the objective
function vs. the number of iterations are plotted in Fig. 9. We carry out a five-fold cross validation. From Fig. 9, we can see
that after several iterations, the value of the objective function becomes stable, and that our method shows insensitivity to
the number of iterations to some extent.
4.4.2. Performance analysis with different parameter settings

In this section, we evaluate the parameter sensitivity of the proposed JDRML on three datasets: YTC, ETH80, and Gesture.
The parameter λ1 is selected from [0.1-0.3] in the step of 0.05, which is used to balance the part of regularizations on the
Fig. 11. Recognition performance with different parameters of C1 and C2 on YTC dataset.
Fig. 12. Recognition performance with different parameters of C1 and C2 on ETH80 dataset.
Fig. 13. Recognition performance with different parameters of C1 and C2 on Gesture dataset.
projection matrices Wx and Wr . We show the recognition accuracies with different λ1 on three datasets in Fig. 10. From the
figure, we can see that the recognition accuracies on different datasets are relatively stable with different λ1 .
Then, we further conduct experiments to evaluate the rest three regularized parameters : C1 , C2 and C3 , which are used
to balance the contributions of the three parts (G1 , G2 , and G3 ) in Eq. (25). The C1 is selected from [0.2,0.5,0.8]. With a fixed
C1 , we obtain the recognition performance with different parameters of C2 and C3 on three datasets in Figs. 11–13, where C2
and C3 are tuned from [0.0 0 01,0.0 05,0.01,0.015,0.02], respectively. The color bar displays different recognition performance.
From the Figs. 11–13, on three datasets with a cross validation of C2 and C3 , we can select the optimal parameter settings
to obtain the best classification performance. For the three different visual classification tasks in our experiments, we set C1
to 0.5 as it shows relatively better classification performance in Fig. 11(b)-Fig. 13(b).
5. Conclusion
In this study, in order to effectively deal with the image set classification problem, we put study emphases on multi-
model fusion representation which provides an effective strategy to fully extract the latent discriminative feature. To ac-
complish this work, we first employ the kernel trick to map different representations of the image sets (in the Euclidean
space and the Riemannian manifold) into the Reproducing kernel Hilbert space. Thereafter, three Mahalanobis distance met-
ric learning models are given. Then, we aim to jointly learn two projection matrices and a metric matrix by integrating
the two heterogeneous Euclidean space and Riemannian manifold into a common induced space in which the energy of
each model can be preserved as much as possible and the class similarity is reflected. Moreover, we adopt the L2,1 norm to
achieve sparse feature learning. Finally, we conduct extensive experiments on different visual classification tasks to evalu-
ate the classification performance of our JDRML. The experimental results clearly indicate that the proposed joint learning
method outperforms the other state-of-the-art set based classification methods and needs less iterations to achieve con-
vergence. Besides, although our method has achieved promising results, there are still some aspects that deserve study in
the future. First, the proposed model only learns a single metric matrix, which may not be powerful enough to exploit the
specific information of each modual representation. We can further adopt a multimetric learning method to learn multiple
model-specific metric matrices in the resulted space. Second, as a valid kernel parameter is generally difficult to select, this
motivates us to learn an optimal kernel from a set of base kernels by using multiple kernel learning technique.
Declaration of Competing Interest
We declare that we have no financial and personal relationships with other people or organizations that can inappropri-
ately influence our work, there is no professional or other personal interest of any nature or kind in any product, service
and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled,
“Joint Dimensionality Reduction and Metric Learning for Image Set Classification”.
CRediT authorship contribution statement
Wenzhu Yan: Conceptualization, Methodology, Writing - original draft. Quansen Sun: Supervision. Huaijiang Sun: Writ-
ing - review & editing, Supervision. Yanmeng Li: Validation, Formal analysis, Writing - original draft.
Acknowledgment
This work was supported by the National Natural Science Foundation of China (Project No.61673220 and 61772272).
Appendix
Proof of Lemma 1. In order to proof the convergence of the optimization problem in Eq. (26), we first introduce the
Lemma 2 as follows:
Lemma 2. For any two non-zero constants u and v, the following inequality holds:
u22 v22
u 2 − ≤ v2 − . (35)
2v2 2v2
The detailed proof of Lemma 2 is similar to that in [41].
Suppose that the solution of min J = tr (MT UM ) can be obtained by solving the generalized eigenvalue problem. Then,
M
Mt+1 = min J = tr (M UM ). T
(36)
M
Thus,
tr (Mt+1
T
UMt+1 ) ≤ tr (MtT UMt ). (37)
It can be inferred that
tr (Mt+1
T
UMt+1 ) + tr (C1 G1 + C2 G2 + C3 G3 )Mt+1 ≤ tr (MtT UMt ) + tr (C1 G1 + C2 G2 + C3 G3 )Mt . (38)
Then, we can obtain
mt+1
k
22 mtk 22
+ tr (C1 G1 + C2 G2 + C3 G3 )Mt+1 ≤ + t r (C1 G1 + C2 G2 + C3 G3 )Mt . (39)
k
2
mtk 2 k
2mtk 2
According to Lemma 2, for each k we have

mt+1
k
22 mtk 22
mt+1
k
2 − ≤ mtk 2 − . (40)
k
2mtk 2 k
2mtk 2
Thus the following inequality holds:

mt+1
k
22 mtk 22
mt+1
k
2 − + tr (C1 G1 + C2 G2 + C3 G3 )Mt+1 ≤ mtk 2 − + tr (C1 G1 + C2 G2 + C3 G3 )Mt . (41)
k
2
mtk 2 k
2mtk 2
Combining Eqs. (39) and (41), we have

mt+1
k
2 + tr (C1 G1 + C2 G2 + C3 G3 )Mt+1 ≤ mtk 2 + tr (C1 G1 + C2 G2 + C3 G3 )Mt , (42)
based on Eq. (3), we obtain
mt+1 2,1 + tr (C1 G1 + C2 G2 + C3 G3 )Mt+1 ≤ mt 2,1 + tr (C1 G1 + C2 G2 + C3 G3 )Mt . (43)
Therefore, the convergence of Eq. (26) is proved.
References
[1] M.J. Lyons, J. Budynek, S. Akamatsu, Automatic classification of single facial images, IEEE Trans. Pattern Anal. Mach. Intell. 21 (12) (1999) 1357–1362.
[2] M. Korytkowski, L. Rutkowski, R. Scherer, Fast image classification by boosting fuzzy classifiers, Inf. Sci. 327 (2016) 175–182.
[3] C. Zhang, J. Cheng, Y. Zhang, J. Liu, C. Liang, J. Pang, Q. Huang, Q. Tian, Image classification using boosted local features with random orientation and
location selection, Inf. Sci. 310 (2015) 118–129.
[4] H. Cevikalp, B. Triggs, Face recognition based on image sets, in: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE, 2010,
pp. 2567–2573.
[5] R. Wang, S. Shan, X. Chen, W. Gao, Manifold-manifold distance with application to face recognition based on image set, in: Computer Vision and
Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, 2008, pp. 1–8.
[6] L. Chen, Dual linear regression based classification for face cluster recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2014, pp. 2673–2680.
[7] W. Wang, R. Wang, Z. Huang, S. Shan, X. Chen, Discriminant analysis on Riemannian manifold of gaussian distributions for face recognition with image
sets, IEEE Trans. Image Process. 27 (1) (2018) 151–163.
[8] P. Zheng, Z.-Q. Zhao, J. Gao, X. Wu, A set-level joint sparse representation for image set classification, Inf. Sci. 448 (2018) 75–90.
[9] M. Harandi, M. Salzmann, M. Baktashmotlagh, Beyond Gauss: image-set matching on the Riemannian manifold of PDFs, in: Proceedings of the IEEE
International Conference on Computer Vision, 2015, pp. 4112–4120.
[10] O. Yamaguchi, K. Fukui, K.-i. Maeda, Face recognition using temporal image sequence, in: Automatic Face and Gesture Recognition, 1998. Proceedings.
Third IEEE International Conference on, IEEE, 1998, pp. 318–323.
[11] E. OJE, Subspace methods of pattern recognition, Pattern Recognition and Image Processing series, vol. 6, Research Studies Press, 1983.
[12] T.-K. Kim, J. Kittler, R. Cipolla, Discriminative learning and recognition of image set classes using canonical correlations, IEEE Trans. Pattern Anal. Mach.
Intell. 29 (6) (2007) 1005–1018.
[13] R. Wang, X. Chen, Manifold discriminant analysis, in: Computer Vision and Pattern Recognition, 20 09. CVPR 20 09. IEEE Conference on, IEEE, 20 09,
pp. 429–436.
[14] Y. Hu, A.S. Mian, R. Owens, Sparse approximated nearest points for image set classification, in: Computer vision and pattern recognition (CVPR), 2011
IEEE conference on, IEEE, 2011, pp. 121–128.
[15] M. Yang, P. Zhu, L. Van Gool, L. Zhang, Face recognition based on regularized nearest points between image sets, in: Automatic Face and Gesture
Recognition (FG), 2013 10th IEEE International Conference and Workshops on, IEEE, 2013, pp. 1–7.
[16] G. Shakhnarovich, J.W. Fisher, T. Darrell, Face recognition from long-term observations, in: European Conference on Computer Vision, Springer, 2002,
pp. 851–865.
[17] O. Arandjelovic, G. Shakhnarovich, J. Fisher, R. Cipolla, T. Darrell, Face recognition with image sets using manifold density divergence, in: Computer
Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, IEEE, 2005, pp. 581–588.
[18] M. Zhang, R. He, D. Cao, Z. Sun, T. Tan, Simultaneous feature and sample reduction for image-set classification., in: AAAI, vol. 16, 2016, pp. 1401–1407.
[19] S.A. Shah, U. Nadeem, M. Bennamoun, F. Sohel, R. Togneri, Efficient image set classification using linear regression based image reconstruction, in:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 99–108.
[20] S.A.A. Shah, M. Bennamoun, F. Boussaid, Iterative deep learning for image set based face and object recognition, Neurocomputing 174 (2016) 866–874.
[21] Z. Huang, R. Wang, S. Shan, X. Chen, Face recognition on large-scale video in the wild with hybrid Euclidean-and-Riemannian metric learning, Pattern
Recognit. 48 (10) (2015) 3113–3124.
[22] J. Lu, G. Wang, P. Moulin, Localized multifeature metric learning for image-set-based face recognition, IEEE Trans. Circuits Syst. Video Technol. 26 (3)
(2016) 529–540.
[23] Z. Huang, R. Wang, S. Shan, L. Van Gool, X. Chen, Cross Euclidean-to-Riemannian metric learning with application to face recognition from video, IEEE
Trans. Pattern Anal. Mach. Intell. 40 (12) (2018) 2827–2840.
[24] X. Gao, Q. Sun, H. Xu, D. Wei, J. Gao, Multi-model fusion metric learning for image set classification, Knowl. Based Syst. 164 (2019) 253–264.
[25] Y. Wu, Y. Jia, P. Li, J. Zhang, J. Yuan, Manifold kernel sparse representation of symmetric positive-definite matrices and its applications, IEEE Trans.
Image Process. 24 (11) (2015) 3729–3741.
[26] G. Feng, H. Li, J. Dong, J. Zhang, Face recognition based on Volterra kernels direct discriminant analysis and effective feature classification, Inf. Sci. 441
(2018) 187–197.
[27] P. Zheng, Z.-Q. Zhao, J. Gao, X. Wu, Image set classification based on cooperative sparse representation, Pattern Recognit. 63 (2017) 206–217.
[28] Z. Huang, R. Wang, S. Shan, X. Chen, Projection metric learning on Grassmann manifold with application to video based face recognition, in: Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 140–149.
[29] S. Liao, Y. Hu, S.Z. Li, Joint dimension reduction and metric learning for person re-identification, arXiv:1406.4216 (2014).
[30] M. Harandi, M. Salzmann, R. Hartley, Joint dimensionality reduction and metric learning: a geometric take, in: Proceedings of the 34th International
Conference on Machine Learning-Volume 70, JMLR. org, 2017, pp. 1404–1413.
[31] H. Hotelling, Relations between two sets of variates, in: Breakthroughs in statistics, Springer, 1992, pp. 162–190.
[32] W. Wang, R. Wang, S. Shan, X. Chen, Prototype discriminative learning for face image set classification, in: Asian Conference on Computer Vision,
Springer, 2016, pp. 344–360.
[33] I. Naseem, R. Togneri, M. Bennamoun, Linear regression for face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 32 (11) (2010) 2106–2112.
[34] Q. Feng, Y. Zhou, R. Lan, Pairwise linear regression classification for image set retrieval, in: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2016, pp. 4865–4872.
[35] S. Chen, C. Sanderson, M.T. Harandi, B.C. Lovell, Improved image set classification via joint sparse approximated nearest subspaces, in: Proceedings of
the IEEE Conference on Computer Vision and pattern Recognition, 2013, pp. 452–459.
[36] H. Hu, Sparse discriminative multimanifold Grassmannian analysis for face recognition with image sets, IEEE Trans. Circuits Syst. Video Technol. 25
(10) (2015) 1599–1611.
[37] Z. Huang, R. Wang, S. Shan, X. Li, X. Chen, Log-euclidean metric learning on symmetric positive definite manifold with application to image set
classification, in: International Conference on Machine Learning, 2015, pp. 720–729.
[38] R. Wang, H. Guo, L.S. Davis, Q. Dai, Covariance discriminative learning: a natural and efficient approach to image set classification, in: Computer Vision
and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 2496–2503.
[39] M. Hayat, M. Bennamoun, S. An, Deep reconstruction models for image set classification, IEEE Trans. Pattern Anal. Mach. Intell. 37 (4) (2015) 713–727.
[40] J. Lu, G. Wang, W. Deng, P. Moulin, J. Zhou, Multi-manifold deep metric learning for image set classification, in: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2015, pp. 1137–1145.
[41] F. Nie, H. Huang, X. Cai, C.H. Ding, Efficient and robust feature selection via joint l21-norms minimization, in: Advances in Neural Information Pro-
cessing Systems, 2010, pp. 1813–1821.
[42] M.A. Kumar, M. Gopal, Least squares twin support vector machines for pattern classification, Expert Syst. Appl. 36 (4) (2009) 7535–7543.
[43] M. Kim, S. Kumar, V. Pavlovic, H. Rowley, Face tracking and recognition with visual constraints in real-world videos, in: Computer Vision and Pattern
Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, 2008, pp. 1–8.
[44] A.S. Georghiades, P.N. Belhumeur, D.J. Kriegman, From few to many: illumination cone models for face recognition under variable lighting and pose,
IEEE Trans. Pattern Anal. Mach. Intell. 23 (6) (2001) 643–660.
[45] Z. Huang, S. Shan, R. Wang, H. Zhang, S. Lao, A. Kuerban, X. Chen, A benchmark and comparative study of video-based face recognition on COX face
database, IEEE Trans. Image Process. 24 (12) (2015) 5967–5981.
[46] B. Leibe, B. Schiele, Analyzing appearance and contour based methods for object categorization, in: Computer Vision and Pattern Recognition, 2003.
Proceedings. 2003 IEEE Computer Society Conference on, vol. 2, IEEE, 2003, pp. II–409.
[47] T.-K. Kim, R. Cipolla, Canonical correlation analysis of video volume tensors for action categorization and detection, IEEE Trans. Pattern Anal. Mach.
Intell. 31 (8) (2009) 1415–1428.
[48] L. Deng, The MNIST database of handwritten digit images for machine learning research [best of the web], IEEE Signal Process. Mag. 29 (6) (2012)
141–142.
[49] P. Viola, M.J. Jones, Robust real-time face detection, Int. J. Comput. Vis. 57 (2) (2004) 137–154.

Information Sciences: Wenzhu Yan, Quansen Sun, Huaijiang Sun, Yanmeng Li

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Sciences: Wenzhu Yan, Quansen Sun, Huaijiang Sun, Yanmeng Li

Uploaded by

Copyright:

Available Formats

Information Sciences 516 (2020) 109–124

Contents lists available at ScienceDirect

Joint dimensionality reduction and metric learning for image

3. The proposed approach

3.1. Problem formulation

where M is the Mahalanobis metric matrix, which is positive deﬁnite.

3.2. Joint dimensionality reduction and metric learning (JDRML)

k˜ (ri , r j ) = k(ri , r j ) − (kri )T (I + LK )−1 Lkr j , (6)

PTx φ (xi ) = WTx (X )T φ (xi ) = WTx Kxi , (8)

PTr φ (r j ) = WTr (R )T φ (r j ) = WTr Kr j . (9)

Thus, Eq. (7) can be rewritten as

H = Kx − WTx Wx Kx 2F + Kr − WTr Wr Kr 2F . (13)

where ξi ≥ 0, i = 1, . . . m, m = 3 are the slack variables.

Algorithm 2 The procedure to obtain M.

Proof. We show the proof of Lemma 1 in the Appendix.

4.1. Datasets and parameter settings

Fig. 5. Some object images from ETH80 dataset.

Fig. 6. Some gesture images from hand Gesture dataset.

Fig. 7. Some exemplar images from MNIST dataset.

4.2. Experimental results and analysis

Method YTC EYaleB COX Gesture ETH80 MNIST Year

Method YTC EYaleB COX Gesture ETH80 MNIST

method p=10 p=50 p=80 p=10 p=50 p=80

4.3. Comparison of different features and runtime

4.4. Parameter analysis

4.4.1. Convergence analysis

4.4.2. Performance analysis with different parameter settings

Declaration of Competing Interest

CRediT authorship contribution statement

According to Lemma 2, for each k we have

Thus the following inequality holds:

Combining Eqs. (39) and (41), we have

You might also like

PTx φ (xi ) = WTx (X )T φ (xi ) = WTx Kxi , (8)

PTr φ (r j ) = WTr (R )T φ (r j ) = WTr Kr j . (9)