Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Journal Pre-proof

Distance metric learning based on the class center and nearest neighbor
relationship

Yifeng Zhao, Liming Yang

PII: S0893-6080(23)00235-6
DOI: https://doi.org/10.1016/j.neunet.2023.05.004
Reference: NN 5658

To appear in: Neural Networks

Received date : 27 October 2022


Revised date : 25 April 2023
Accepted date : 1 May 2023

Please cite this article as: Y. Zhao and L. Yang, Distance metric learning based on the class center
and nearest neighbor relationship. Neural Networks (2023), doi:
https://doi.org/10.1016/j.neunet.2023.05.004.

This is a PDF file of an article that has undergone enhancements after acceptance, such as the
addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive
version of record. This version will undergo additional copyediting, typesetting and review before it
is published in its final form, but we are providing this version to give early visibility of the article.
Please note that, during the production process, errors may be discovered which could affect the
content, and all legal disclaimers that apply to the journal pertain.

© 2023 Published by Elsevier Ltd.


Journal Pre-proof
Manuscript Click here to view linked References

Distance metric learning based on the class center and nearest


neighbor relationship

of
Yifeng Zhao1 , Liming Yang1, 2∗
1 College of Information and Electrical Engineering, China Agricultural University, Beijing

pro
2 College of Science, China Agricultural University, Beijing, Haidian, 100083, China
∗ corresponding author

re- Abstract

Distance metric learning has been a promising technology to improve the performance of algorithms related
to distance metrics. The existing distance metric learning methods are either based on the class center or
lP
the nearest neighbor relationship. In this work, we propose a new distance metric learning method based on
the class center and nearest neighbor relationship (DMLCN). Specifically, when centers of different classes
overlap, DMLCN first splits each class into several clusters and uses one center to represent one cluster.
Then, a distance metric is learned such that each example is close to the corresponding cluster center and
rna

the nearest neighbor relationship is kept for each receptive field. Therefore, while characterizing the local
structure of data, the proposed method leads to intra-class compactness and inter-class dispersion simulta-
neously. Further, to better process complex data, we introduce multiple metrics into DMLCN (MMLCN) by
learning a local metric for each center. Following that, a new classification decision rule is designed based
on the proposed methods. Moreover, we develop an iterative algorithm to optimize the proposed methods.
Jou

The convergence and complexity are analyzed theoretically. Experiments on different types of data sets
including artificial data sets, benchmark data sets and noise data sets show the feasibility and effectiveness
of the proposed methods
Keywords: Distance metric learning; Class center; Nearest neighbor relationship; Multi-metric learning
Email address: cauyanglm@163.com (Liming Yang)
Journal Pre-proof

1 Introduction

In many practical problems, it is necessary to measure the distance between two examples. For face recog-

of
nition, we need to measure the similarity between two pictures; in handwritten character recognition, we
need to judge whether the two characters are similar.In addition, many machine learning algorithms in-
volve calculating the distance between two examples such as k-nearest neighbors (KNN) (Cover and Hart,

pro
1967), k-nearest neighbour based weighted reduced universum twin support vector machines for class im-
balance learning (KWRUTSVM-CIL) (Ganaie and Tanveer, 2022) and least squares KNN-based weighted
multiclass twin support vector machines (LS-KWMTSVM) (Tanveer et al., 2021). Therefore, choosing an
appropriate distance metric is very important. In recent years, distance metric learning (Li and Tian, 2018)
has become a very active research field and has been applied to various applications such as image set
re-
retrieval (Hu and Hauptmann, 2021) and face verification (Chong et al., 2020; Chen and Hu, 2021). The
existing distance metric learning methods can be divided into two categories: class center-based methods
and nearest neighbor relationship-based methods.
The class center-based methods (such as linear discriminant analysis (LDA) (Hart et al., 2000)) try to
lP
learn a transformation matrix to pull every example close to the center of its class. LDA uses one center
to describe the shape of one class and learns a linear mapping, under which examples are pulled close to
the corresponding class center, and class centers are pushed away from each other. However, LDA may
fail when dealing with data with multiple clusters per class. Therefore, Chen and Huang (2003) proposed
rna

a method named clustering based discriminant analysis (CDA) which uses clustering methods to detect the
underlying clusters. Different from them, He et al. (2018) introduced a new loss function called triplet-
center loss into distance metric learning.
The advantage of class center-based methods is that they can detect potential clusters and lead to intra-
class compactness and inter-class dispersion. Thus, they usually show good generalization when dealing
Jou

with data with a simple distribution. Different from them, nearest neighbor relationship-based methods
better characterize the local structure of data by learning a distance metric to keep the nearest neighbor
relationship (under the learned distance metric two examples with the same class are close and two examples
belonging to different classes are far away from each other). Thus, these methods show a stronger ability
for some data sets with complex decision boundaries. So, in recent years, many distance metric learning
methods are based on the nearest neighbor relationship.
According to the type of constraints, nearest neighbor relationship-based methods can be divided into

2
Journal Pre-proof

three categories: pairwise constraints, triple constraints, and quadruplet constraints. Due to the simple
form, a large part of distance metric learning methods uses pairwise constraints such as convex model
for support vector distance metric learning (CSV-DML) (Ruan et al., 2021a), multi-birth metric learning

of
model (MBML) (Ren et al., 2022), nearest-neighbor search model for distance metric learning (NNS-DML)
(Ruan et al., 2021b), positive-semidefinite constrained metric learning (PCML) and nonnegative-coefficient
constrained metric learning (NCML) (Zuo et al., 2017). However, some methods adopt triple constraints

pro
such as co-occurrence embedding regularized metric learning model (CRML) (Wu et al., 2020), distance
metric learning using difference of convex functions (DC) programming (DML-dc) (Nguyen and De Baets,
2018), large margin nearest neighbor (LMNN) classification (Weinberger and Saul, 2009), and clustered
multi-metric learning (CMML) (Nguyen et al., 2019). Different from them, Chen et al. (2017) propose a
deep quadruplet network for person re-identification.
re-
Based on the type of the distance metric, nearest neighbor relationship-based methods can be divided
into two types: linear and nonlinear distance metric learning methods. Most of the methods introduced
above are linear distance metric learning methods. Some nonlinear distance metric learning methods are
based on deep metrics such as deep localized metric learning (DLML) (Duan et al., 2018) and deep adver-
lP
sarial metric learning (DAML) (Duan et al., 2020), and others are based on kernel function (Kulis et al.,
2013). Compared with the nonlinear distance metric, the linear distance metric is powerless. Although
many linear distance metric learning methods can be kernelized using a unified framework (Kulis et al.,
2013), extending a single metric to multiple metrics is a more common method. The methods such as
rna

CMML and multi-metric LMNN (MM-LMNN) (Weinberger and Saul, 2009) adopt this strategy. And we
focus on single and multi-metric linear distance metric learning in this work.
Although nearest neighbor relationship-based methods show good performance, they ignore the poten-
tial clusters of each class. In this work, we propose a new distance metric learning method that considers
the center of the class and the nearest neighbor relationship of examples. The main contributions of this
Jou

work are concluded as follows:

• We propose a new distance metric learning method based on the class center and nearest neigh-
bor relationship (DMLCN). Further, we introduce multi-metric into DMLCN (MMLCN). They can
characterize the local structure of data and make intra-class compactness and inter-class dispersion
simultaneously. The proposed DMLCN and MMLCN attempt to consider the class-center and the
nearest neighbor relationship of examples.

3
Journal Pre-proof

• When the data distribution is complex, only one center may not depict the data structure well. To
handle the case of overlapping centers of different classes, the proposed DMLCN split each class into
simple parts and uses one center to describe one part. Then, it learns a distance metric to pull every

of
example close to the corresponding cluster center. More importantly, it tries to maintain the nearest
neighbor relationship of every receptive field.

pro
• Moreover, to enhance the ability to process data sets with complex data structures, DMLCN is ex-
tended to MMLCN by learning multiple metrics. In addition, we have customized a decision function
for DMLCN and MMLCN, which can make full use of the learning result.

• An effective algorithm is designed to solve the proposed methods, which can get a more accurate
solution. Moreover, the convergence and complexity of the algorithm are analyzed theoretically.
re-
• We conduct several extensive experiments to evaluate the performances of DMLCN and MMLCN in
noise-free and noise settings. Experimental results show that the proposed methods perform compet-
itively compared with the state-of-the-art distance metric learning methods.
lP
In Section 2, we give the preliminaries of this paper including notations and background knowledge. In
Section 3, we introduce the proposed method and give the solving algorithm. In Section 4, several experi-
ments are designed to show the effectiveness of the proposed method. Finally, conclusions and remarks are
provided in Section 5.
rna

2 Preliminaries

In this section, we introduced some notations that will be used in the remaining work and the background
knowledge of distance metric learning.
Jou

2.1 Notations

Lowercase letters are used to indicate scalars. Lowercase bold letters are used to denote vectors. Uppercase
letters are used to represent matrices. M ⪰ 0 denotes M is a positive semidefinite matrix. We use X =
{(x1 , y1 ), ..., (xi , yi ), ..., (xm , ym )} to denote training set from n classes, where xi ∈ Rd is i-th instance
with label yi ∈ {1, ..., n}. The index set of examples of the i-th class is defined as Pi = {j|yj = i}, the

4
Journal Pre-proof

number of examples of the i-th class is denoted as mi and the class center of the i-th class is defined as ci =
P
j∈Pi xj . The distance between xi and xj under M is recorded as dM (xi , xj ) = (xi −xj ) M (xi −xj ).
1 T
mi

of
2.2 Background

We give a brief review of class center-based methods and nearest neighbor relationship-based methods then

pro
introduce the basic knowledge of multi-metric learning.

2.2.1 Class center-based methods

LDA aims at learning a transformation matrix W = (w1 , · · · , wn ) to make inter-class dispersion and intra-
class aggregation, where the component wi is a projection vector. Finally, LDA defines the between-class
and within-class covariance matrices as:

Sw =
re-
n X
X
(xj − ci )(xj − ci )T (1)
i=1 j∈Pi

n
X
Sb = mi (ci − c)(ci − c)T (2)
lP
i=1
Pm
where c = 1
m i xi . And it tries to maximize the following criterion function

W T Sb W
J(W ) = (3)
W T Sw W
rna

LDA separates each class by describing the shape of each class with a center, squeezing examples of
each class towards the corresponding center, and separating class centers from each other. LDA may fail
when the centers of different classes coincide. To solve this problem, CDA uses clustering methods to
detect the underlying clusters. Due to the limitations of these methods in dealing with complex data sets,
in recent years, more distance metric learning methods are based on the nearest relationship. In fact, the
Jou

simple learning strategy is effective for some data sets with a simple shape.

2.2.2 Nearest neighbor relationship-based methods

In recent years, a large part of distance metric learning methods are based on the nearest neighbor relation-
ship. Unlike class center-based methods, for a given example, nearest neighbor relationship-based methods
try to learn a distance metric under which nearest neighbors with its label are close to it but nearest neigh-
bors with other labels are away from it. LMNN is one of the most representative distance metric learning

5
Journal Pre-proof

methods based on nearest relationship which aims at improving the performance of KNN classifier. For
the input example xi , LMNN hopes to learn a distance metric under which the k nearest neighbors of xi
belong to the class of xi . To realize this target, LMNN tries to pull k nearest neighbors with the label of xi

of
into a small hyper-sphere and push examples with different labels out of the large hyper-sphere. As another
classic method, geometric mean metric learning (GMML) was proposed by Zadeh et al. (2016) which has
a closed-form solution and low-time complexity.

pro
2.2.3 Multi-metric distance metric learning

When dealing with some data sets with highly nonlinear decision boundaries, it is a good choice to split the
data into simple parts and learn multiple distance metrics. MMLMNN learns a distance metric for every
class. For example xi , it constructs triples {(xi , xj , xk )|yj = yi and yk ̸= yi } and learns matrices Mj ⪰ 0
re-
and Mk ⪰ 0 so that dMk (xi , xk ) − dMj (xi , xj ) ≥ 1. There are two distance matrices in one constraint
of MMLMNN which can ensure that the comparison of distances is meaningful. Multi-metric learning has
become a research hotspot in recent years and different learning strategies have been proposed such as local
metrics facilitated transformation (LIFT) (Ye et al., 2020), and sparse compositional metric learning (SCML)
lP
(Shi et al., 2014).

3 Main contributions
rna

3.1 Motivation

Class center-based methods lead to more compact intra-class and more dispersed inter-classes for some data
sets with the simple distribution. Let’s consider the data in (a) in Figure 1. As shown in (b) in Figure 1,
class center-based methods try to pull examples close to the center of the class they belong to and push
Jou

examples away from the center of other classes. The learning goal is that examples gather around the center
of the class they belong to and are far away from the center of other classes as shown in (c) in Figure 1.
However, nearest neighbor relationship-based methods just consider the nearest neighbor relationship, and
the pull and push power only exist in similar and dissimilar pairs as shown in (d) in Figure 1. Finally, it just
tries to keep the local structure as shown in (e) in Figure 1. In this situation, class center-based methods
may get better performance.
However, when dealing with data sets with complex decision boundaries, class center-based methods are

6
Journal Pre-proof

Class 1 Class 2 Center of class 1 Center of class 2 Pull Push Shape of class 1 Shape of class 2

Class center-based methods

of
pro
(b) Learning process (c) Learning goal

Nearest neighbor relationship-based methods

(a) Original data distribution

re-(d) Learning process (e) Learning goal

Figure 1: Schematic illustration of advantages of class center-based methods. For data set (a), each class has a simple cluster.
A part of the pull and push power of the two types of methods is depicted in (b) and (d) respectively.

less powerful. For example, for (a) in Figure 2, if the distance metric is not powerful enough to pull every
lP
example to the corresponding center so that class center-based methods can just get the result as shown in
(c), a part of examples belonging to the overlapping part of class 1 and class 2 may be misclassified. Nearest
neighbor relationship-based methods just try to keep the nearest neighbor relationship. The learning goal
rna

is shown in (e) in Figure 2. In this situation, they can get better performance because the nearest neighbor
relationship in the overlapping area of class 1 and class 2 is kept.

3.2 Distance metric learning based on the class center and nearest neighbor rela-
tionship (DMLCN)
Jou

We propose a method named distance metric learning based on the class center and nearest neighbor rela-
tionship (DMLCN) which can obtain compact intra-class and dispersed inter-classes, and characterize the
local structure of data.
Before introducing the proposed method, we first give the concept of the receptive field (Hammer and
Villmann, 2002): The receptive field Ri of ci is defined as Ri = {x ∈ X|∀cj ||x − ci ||22 ≤ ||x − cj ||22 }.
For the data set in (a) in Figure 3, we can get the receptive fields of the center of class 1 and the center

7
Journal Pre-proof

Class 1 Class 2 Center of class 1 Center of class 2 Pull Push Shape of class 1 Shape of class 2

Class center-based methods

of
pro
(b) Learning process (c) Learning goal
Nearest neighbor relationship-based methods

(a) Original data distribution

re- (d) Learning process (e) Learning goal

Figure 2: Schematic illustration of advantages of nearest neighbor relationship-based methods. A part of the pull and push
power of these two kinds of methods is depicted in (b) and (d) respectively.
lP
of class 2 as shown in (b) in Figure 3. For a given example, the learning purpose of most class center-based
methods is to pull it close to the center of its class. Because in the prediction stage, for a given example,
we only need to determine which center is closer to it. However, this purpose may not be realized because
rna

of the complex decision boundary. For this case, we proposed to learn a distance metric under which for
every receptive field, the nearest neighbors with the same class are closer while the nearest neighbors with
different classes are far away from each other. In addition, each example is as close to the center of its class
as possible. The main idea is illustrated by (b) in Figure 3. Finally, we hope to get the result as shown in
(c) in Figure 3. Although we can not drag each example close to the corresponding class center, the nearest
Jou

neighbor relationship in each receptive field has been preserved. For a testing example, we first determine
which receptive field it belongs to, and then use KNN classifier in the receptive field. It is worth noting that
this makes full use of the center of each class and the nearest neighbor relationship of the training examples.
Next, we will introduce how to pull every example close to the corresponding class center then give the
method to pull the nearest neighbors with the same class close and push the nearest neighbors with different
classes far away from each other.
We construct triplets set between ci and cj as Tij = {(k, i, j)|k ∈ Pi and k ̸∈ Pj }. To pull examples

8
Journal Pre-proof

Class 1 Class 2 Center of class 1 Center of class 2 Pull Push Shape of class 1 Shape of class 2
Receptive field of the center of class 1 Receptive field of the center of class 2

of
pro
(a) Original data distribution (b) Learning process of DMLCN (c) Learning goal of DMLCN

Figure 3: Schematic illustration of DMLCN. For the data set in (a), DMLCN tries to pull every example close to the corre-
sponding class center and keep the nearest neighbor relationship in each receptive field. A part of the pull and push power is
illustrated in (b).
re-
close to the corresponding class center, we try to learn a distance metric matrix to meet the following
constraints:
dM (xk , cj ) ≥ dM (xk , ci ) (4)
lP
where (k, i, j) ∈ Tij and M ⪰ 0. Besides, we hope to keep a safe margin between each cluster to increase
the robustness of DMLCN. Most distance metric learning methods fix the margin as 1. However, it may be
inappropriate when the distance between examples is much smaller or much larger than 1. So, we fix the
margin as ρij = α||ci − cj ||22 for (4) and get
rna

dM (xk , cj ) − dM (xk , ci ) ≥ ρij (5)

where α is a constant less than 1. We can observe that (5) can be viewed as a variant of the triplet-center
loss (TCL) (He et al., 2018).
Based on receptive field Ri , the index sets of the similar and dissimilar pairs are defined as Si =
Jou

{(j, k)|xj ∈ Ri , xk ∈ Ri and yj = yk } and Di = {(j, k)|xj ∈ Ri , xk ∈ Ri and yj ̸= yk } respectively. To


P
pull the nearest neighbors with the same label as close as possible, we try to minimize (j,k)∈Si dM (xj , xk ).
P
It’s easy to think of minimizing − (j,k)∈Di dM (xj , xk ) to push the nearest neighbors with different labels
away from each other. However, this may lead the objective function to be unbounded. To solve this prob-
lem, we use the strategy adopted in geometric mean metric learning (GMML) (Zadeh et al., 2016). The
authors find that we can increase dM (xj , xk ) by decreasing dM −1 (xj , xk ). This observation follows from
the gradients of dM (xj , xk ) and dM −1 (xj , xk ) about M point in nearly opposite directions. So, we can max-

9
Journal Pre-proof

P P P
imize (j,k)∈Di dM (xj , xk ) by minimizing (j,k)∈Di dM −1 (xj , xk ). Moreover, (j,k)∈Di dM −1 (xj , xk ) is
bounded by 0 because M −1 is positive semidefinite. We replace M with M + βI, β > 0 to take care of
problems due to possible ill-conditioning of M .

of
Finally, we give the model formula of DMLCN as follows:
n
X X n
X X n
X X X
min d(M +βI) (xj , xk ) + v1 d(M +βI)−1 (xj , xk ) + v2 ξijk
M
i=1 (j,k)∈Si i=1 (j,k)∈Di i=1 j∈X n ,j̸=i (k,i,j)∈Tij

pro
(6)
s.t. d(M +βI) (xk , cj ) − d(M +βI) (xk , ci ) ≥ ρij − ξijk , ξijk >= 0

β > 0, v1 > 0 v2 > 0, M ⪰0


where v1 , v2 are hyper-parameters, X n = {1, 2, ..., n} and ξij is slack variable.
However, for the data set in (a) in Figure 4, class 1 and class 2 can not be represented by only one center.
What is more serious is that the class centers of the two classes coincide. In this situation, DMLCN will
re-
lose effectiveness. Because there exists no distance metric matrix to pull an example close and push it away
from the same center at the same time. As shown in (b) in Figure 4, splitting class 1 and class 2 into four
clusters and using one center to describe one cluster is a good choice. So, when the shape of each class
can’t be depicted by a single center, we can use the clustering method to split every class into several parts
lP
and use one center to represent the simple part as done by Chen and Huang (2003). This is equal to using
multiple centers to represent every class. Supping li clusters are got after using the clustering method for
P
the i-th class, for the whole data set, we can get l = ni=1 li clusters. Pi is redefined as the index set of
examples of the i-th cluster. The center of i − th cluster is denoted as ci . The receptive field Ri of the center
rna

ci is redefined as Ri = {x ∈ X|∀cj ||x − ci ||22 ≤ ||x − cj ||22 } (Hammer and Villmann, 2002). Then, we
just need to replace the n in (6) with l.

3.3 Multi-metric learning based on the class center and nearest neighbor relation-

ship (MMLCN)
Jou

We introduce multiple metrics into DMLCN to enhance the ability to process complex data. We try to
learn a distance metric matrix for every class center. Supposing x belongs to the i-th cluster, we can get
constraints set for x
dMj (x, cj ) − dMi (x, ci ) ≥ ρij (7)

where i, j ∈ {1, ..., l}, i ̸= j, Mi ⪰ 0 belongs to center ci and Mj ⪰ 0 belongs to center cj . When
determining which receptive field the example belongs to, we only need to calculate the distance under the

10
Journal Pre-proof

Class 1 Class 2 Center of class 1 and class 2 Shape of class 1 Shape of class 2
Examples of clusters 1, 2, 3, and 4 of class 1 Centers of clusters 1, 2, 3, and 4 of class 1
Examples of clusters 1, 2, 3, and 4 of class 2 Centers of clusters 1, 2, 3, and 4 of class 2

of
pro
re-
(a) The data set with coincident class centers (b) The goal of using clustering methods

Figure 4: Schematic diagram of using clustering algorithm to solve the overlapping of class centers and the inability of a single
class center to represent a class.

distance metric of the corresponding center and judge which center is close. We can find that it is similar to
lP
the constraints in MMLMNN or we can view it as the multi-metric version of TCL. Further, we can extend
DMLCN to multi-metric DMLCN (MMLCN) as
l
X X l
X X l
X X X
min d(Mi +βI) (xj , xk ) + v1 d(Mi +βI)−1 (xj , xk ) + v2 ξijk
M1 ,M 2,...,Ml
i=1 (j,k)∈Si i=1 (j,k)∈Di i=1 j∈X l ,j̸=i (k,i,j)∈Tij
rna

s.t. d(Mj +βI) (xk , cj ) − d(Mi +βI) (xk , ci ) ≥ ρij − ξijk , ξijk ≥ 0

β > 0, v1 > 0 v2 > 0, Mi ⪰ 0, Mj ⪰ 0

(8)

Reviewing formula (8), we can find that the introduction of multiple metrics can improve the ability to deal
with the data with nonlinear decision boundaries. In the first and second terms of (8), different matrices
Jou

are used for different receptive fields which can better handle the characteristics of each receptive field. In
addition, in the third term, we use different matrices which can produce a stronger pulling or pushing force.

3.4 Testing Phase

For a given example, if we classify it to the class of its nearest cluster center, it may be classified wrong
in some situations such as (a) in Figure 3. When the shape of every cluster is complex, although we can

11
Journal Pre-proof

not pull every example close to the corresponding cluster center, the nearest neighbor relationship is still
kept. So, for MMLCN, for the given example x, at first, we search the nearest cluster center by using the
following criterion:

of
i = arg min d(Mi +βI) (x, ci ) (9)
i

Then we use KNN classifier in the receptive field Ri . The distance is calculated under the distance metric

pro
Mi . This decision-making method is tailored for MMLCN. It can make full use of the information learned
by MMLCN. For DMLCN, we just need to replace Mi with M .

3.5 Optimization algorithm

We only give the optimization algorithm of MMLCN, because DMLCN can be regarded as a special case
of MMLCN. We can rewrite (8) as

min L(M1 , M 2, ..., Ml ) =


l
X X
re- d(Mi +βI) (xj , xk ) + v1
l
X X
d(Mi +βI)−1 (xj , xk )
M1 ,M 2,...,Ml ⪰0
i=1 (j,k)∈Si i=1 (j,k)∈Di
l
X X X
+ v2 ℓ(d(Mj +βI) (xk , cj ) − d(Mi +βI) (xk , ci ) − ρij )
lP
i=1 j∈X l ,j̸=i (k,i,j)∈Tij

(10)

where v1 , v2 are hyper-parameters, ℓ is defined as:



 −z if z < 0
rna

ℓ(z) = (11)
 0 if z ≥ 0

The optimization problem (10) can be solved by standard semidefinite programming techniques (Boyd
et al., 2004). Unfortunately, as the number of clusters increases, the time and space complexity will become
intolerable. So, we use the block-coordinate descent method (Tseng, 2001) to solve (10) which views every
local distance metric matrix Mi , i ∈ {1, 2, · · · , l} as a coordinate block. When solving Mi , other matrices
Jou

are fixed as constants. Next, we discuss how to solve the subproblem of Mi and then give the final algorithm
framework for solving all local distance metric matrices.

12
Journal Pre-proof

3.5.1 Subproblem of Mi

When Mj , j ̸= i is fixed, solving (10) is equivalent to solving the following subproblem:


X X

of
min Li (Mi ) = d(Mi +βI) (xj , xk ) + v1 d(Mi +βI)−1 (xj , xk )
Mi ⪰0
(j,k)∈Si (j,k)∈Di
X X
+ v2 ( ℓ(d(Mj +βI) (xk , cj ) − d(Mi +βI) (xk , ci ) − ρij ) (12)
j∈X l ,j̸=i (k,i,j)∈Tij

pro
X X
+ ℓ(d(Mi +βI) (xk , ci ) − d(Mj +βI) (xk , cj ) − ρji ))
j∈X l ,j̸=i (k,j,i)∈Tji

We can rewrite (12) as:


X X
min Li (Mi ) = d(Mi +βI) (xj , xk ) + v1 d(Mi +βI)−1 (xj , xk )
Mi ⪰0
(j,k)∈Si (j,k)∈Di
X X X X (13)
+ v2 ( re-
j∈X l ,j̸=i t=(k,i,j)∈Tij
ℓ(at ) +
j∈X l ,j̸=i t=(k,j,i)∈Tji

where a(k,i,j) = d(Mj +βI) (xk , cj ) − d(Mi +βI) (xk , ci ) − ρij . Because (11) is nondifferentiable, we adopt
ℓ(at ))

iterative sub-gradient projection algorithm (Weinberger and Saul, 2009) to solve (13). We can get the sub-
gradient of Mi as follows:
lP
∂Li (Mi ) X X
G(Mi ) = = (xj − xk )(xj − xk )T − v1 (Mi + βI)−1 (xj − xk )(xj − xk )T (Mi + βI)−1
∂Mi
(j,k)∈Si (j,k)∈Di
X X ∂ℓ(at ) ∂at X X ∂ℓ(at ) ∂at
+ v2 ( + )
∂at ∂Mi ∂at ∂Mi
j∈X l ,j̸=i t=(k,i,j)∈Tij l j∈X ,j̸=i t=(k,j,i)∈Tji
rna

(14)

 −1 if z < 0
∂ℓ(z) ∂a(k,i,j) ∂a(k,j,i)
where ∂z
= , ∂Mi
= −(xk − ci )(xk − ci )T and ∂Mi
= (xk − ci )(xk − ci )T .
 0 if z ≥ 0
In the iterative sub-gradient projection algorithm, we give a start point Mi0 to Mi . In the t − th step,
t− 21
we update Mi by using Mi = Mit−1 − ϵG(Mit−1 ) where ϵ is the step size. To ensure Mi is positive
Jou

t− 12
semidefinite, we project Mi onto the positive semidefinite cone and get Mit = psd(Mi ). Here, psd is the
t− 1
projection operator that’s say psd(Mi 2 ) = V max{0, Σ}V T where Σ is the dialog matrix consisting of
t− 21
eigenvalues of Mi , V is the orthogonal matrix of eigenvectors and max{0, Σ} sets the negative elements
of Σ as 0.
The details of this algorithm are concluded in Algorithm 1. It combines the implementation of LMNN
and a distance metric learning algorithm that is specialised for multi-label classification tasks (Gouk et al.,
2016). In Algorithm 1, s is the lower bound of total iteration times, h is the tolerance of the objective

13
Journal Pre-proof

function not decreasing times, n1 is the upper bound of total iteration times, ϵ is the step size, t records
the number of total iterations, e1 is the tolerance of the difference between the objective function values in
the t and t − 1 iterations, k records the times that the value of the objective function doesn’t decrease, pre

of
records the value of the objective function at the previous iteration point and temp records the value of the
objective function at the current iteration point. If the given step size ϵ is too lager, the value of the objective
function may not decrease. The combination of t < s and ϵ = η2 ϵ helps to find an appropriate step size.

pro
We use ϵ = η1 ϵ to make the objective function drop faster. However, when t ≥ s, the increase of ϵ may
lead to the early stop of Algorithm 1. To solve this problem, we use k < h and ϵ = η2 ϵ to adjust ϵ to the
right size. To prevent the algorithm from iterating too many times, t < n1 is added to the stop condition. If
h > 0, pre − temp > e1 will not work. If h <= 0, pre − temp > e1 works but k < h does not. Note that
Algorithm 1 helps find a more accurate solution and is easy to apply to many optimization problems.
re-
Algorithm 1 algorithm for solving the subproblem of Mi
1: Input: v1 ,v2 ,s, h, n1 , ϵ, e1 , Mi , X

2: Initial: t = 0, k = 0, pre = +∞, temp = Li (Mi )


3: while ((t < s or k < h or pre − temp > e1 ) and t < n1 ) do
lP
1
4: Ti2 = Mi − ϵG(Mi )
1
5: Ti = psd(Ti2 )
6: if (Li (Ti ) < temp) then
7: temp = Li (Ti )
rna

8: Mi = Ti
9: k=0
10: ϵ = η1 ϵ (η1 > 1)
11: else
12: k =k+1
Jou

13: ϵ = η2 ϵ (η2 < 1)


14: end if
15: t=t+1
16: end while
17: Output: Mi

14
Journal Pre-proof

3.5.2 Finally algorithm

We use the block-coordinate descent method to get the optimal solution for (8). The upper bound of to-
tal iteration times and the tolerance of the difference between the objective function values in the t and

of
t − 1 iterations are denoted by n2 and e2 respectively. We choose the identity matrix as the initial value
of each local metric. In other words, (M10 , M20 , ..., Ml0 ) = (I, I, ..., I) where I is the matrix as the same

pro
shape of the distance metric. In the t-th iteration, at first, we get the optimal solution for (12) using Al-
gorithm 1 and update (M1t , M2t , ..., Mlt ) to (M1t+1 , M2t , ..., Mlt ). Then by solving the subproblem of M2 ,
(M1t+1 , M2t , ..., Mlt ) is updated to(M1t+1 , M2t+1 , ..., Mlt ). After repeating l times, the t-th iteration is finished
and (M1t , M2t , ..., Mlt ) is replaced by (M1t+1 , M2t+1 , ..., Mlt+1 ). When the value of the objective function
doesn’t decrease, the solution of (8) is found. The final algorithm is summarized in Algorithm 2.

Algorithm 2 final algorithm framework


1: Input: v1 ,v2 ,n2 , e2 , X

2:
re-
Initial: (M10 , M20 , ..., Ml0 ) = (I, I, ..., I), t = 1, pre = +∞, temp = L(I, I, · · · , I)
3: while (pre − temp ≥ e2 and t < n2 ) do
4: i=1
lP
5: for (i ≤ l) do
6: Get Mit using Algorithm 1
7: end for
rna

8: pre = temp
9: temp = L(M1t , M2t , ..., Mlt )
10: t=t+1
11: end while
12: Output: M1t , M2t , ..., Mlt
Jou

3.6 Convergence analysis

Theorem 1 For any M1 , M2 , ..., Ml ⪰ 0, we can get L(M1 , M2 , ..., Ml ) ≥ 0 and Li (Mi ) ≥ 0 where
i ∈ {1, 2, ..., l}.

Proof 1 We can get (M1 + βI)−1 , (M2 + βI)−1 , ..., (Ml + βI)−1 ≻ 0 according to M1 , M2 , ..., Ml ⪰ 0.

15
Journal Pre-proof

So, we have the following conclusions:


X
d(Mi +βI) (xj , xk ) ≥ 0
(j,k)∈Si
(15)

of
X
d(Mi +βI)−1 (xj , xk ) ≥ 0
(j,k)∈Di

for any i ∈ {1, 2, ..., l}. Observing (11), we know that ℓ is a nonnegative function. In other words, ℓ(at ) ≥ 0

pro
for any t = (k, i, j) ∈ Tij where i, j ∈ {1, 2, ..., l} and i ̸= j.
So, we have L(M1 , M2 , ..., Ml ) ≥ 0 and Li (Mi ) ≥ 0 for any M1 , M2 , ..., Ml ⪰ 0.

Theorem 2 Supposing Algorithm 1 can find the optimal solution for (12), the sequence L(M1t , M2t , ..., Mnt )
generated by the optimization algorithm is monotonically decreasing and convergent.

re-
Proof 2 In the (t + 1)-th iteration in Algorithm 2, we get M1t+1 using Algorithm 1. If Algorithm 1 can find
the optimal solution, we have
L1 (M1t+1 ) ≤ L1 (M1t ) (16)

Adding
lP
X X X X
d(Mit +βI) (xj , xk ) + v1 d(M t +βI)−1 (xj , xk )
i
i∈X l ,i̸=1 (j,k)∈Si i∈X l ,i̸=1 (j,k)∈Di
X X X (17)
+v2 ℓ(d(Mjt +βI) (xk , cj ) − d(Mit +βI) (xk , ci ) − ρij )
i∈X l ,i̸=1 j∈X l ,j̸=i,j̸=1 (k,i,j)∈Tij

to the left and right of inequality (16), we get


rna

L(M1t , M2t , ..., Mlt ) ≥ L(M1t+1 , M2t , ..., Mlt ) (18)

Iteratively, we have

L(M1t , M2t , ..., Mlt ) ≥ L(M1t+1 , M2t , ..., Mlt )


Jou

≥ L(M1t+1 , M2t+1 , ..., Mlt )


(19)
..
.

≥ L(M1t+1 , M2t+1 , ..., Mlt+1 )

So, the sequence L(M1t , M2t , ..., Mlt ) generated by the optimization algorithm is monotonically decreas-
ing. Furthermore, from Theorem 1, we know that L(M1t , M2t , ..., Mlt ) has the lower bound 0. So, the
sequence L(M1t , M2t , ..., Mlt ) is also convergent.

16
Journal Pre-proof

3.7 Computational complexity

Because DMLCN can be regarded as a special case of MMLCN, only the complexity of MMLCN is ana-
lyzed in this section. To simplify the expression, we omitted the constant in time complexity. The working

of
process of MMLCN includes two stages: using a clustering algorithm to split every class into simple parts
and learning distance metrics M1 , ..., Ml . We chose k-means as the clustering algorithm because of its

pro
simplicity and efficiency. According to (Nguyen et al., 2019), the computational complexity of k-means
is O(m ∗ d ∗ l ∗ T ) where T is the number of iterations. Before solving the optimization problem (8),
P P
we use matrices Qi and Ni to store (j,k)∈Si (xj − xk )(xj − xk )T and (j,k)∈Di (xj − xk )(xj − xk )T
respectively where i ∈ {1, 2, ..., l}. The computational complexity of calculating Qi and Ni is O(|Si | ∗ d2 )
and O(|Di | ∗ d2 ) respectively where |.| is the operator to get the number of elements. When using Algo-
P
re-
rithm 1 to update Mi , in each iteration, the computational complexity is O((l|Pi |+ j∈X l ,j̸=i |Pj |)∗d2 +d3 )
which includes the calculating of G(Mi ), Li (Mi ) and psd(Mi ). So, the computational complexity to up-
P
date Mi is O(n1 ((l|Pi | + j∈X l ,j̸=i |Pj |) ∗ d2 + d3 )), and the computational complexity of Algorithm 2
P P P
is O( li=1 (|Si | + |Di |) ∗ d2 + n2 ∗ li (n1 ((l|Pi | + j∈X l ,j̸=i |Pj |) ∗ d2 + d3 ))) which is equivalent to
P
O( li=1 (|Si | + |Di |) ∗ d2 + n1 ∗ n2 ∗ l ∗ (d3 + m ∗ d2 )). In this paper, we only search for one similar nearest
lP
P
example and one dissimilar nearest example for every example. So, O( li=1 (|Si | + |Di |) ∗ d2 + n1 ∗ n2 ∗
l ∗ (d3 + m ∗ d2 )) can be simplified to O(n1 ∗ n2 ∗ l ∗ (d3 + m ∗ d2 )). The total computational complexity
is O(n1 ∗ n2 ∗ l ∗ (d3 + m ∗ d2 ) + m ∗ d ∗ l ∗ T ).
The computational burden of the algorithm increases with the number and dimension of examples. For
rna

data sets with a large number of examples, using online learning algorithms is a good choice (Nguyen et al.,
2020). For data sets with high dimensions, we can replace Mi by Li LTi and learn Li directly which can
ensure Mi is a positive semidefinite matrix without the psd operation as done by Ye et al. (2020). Here, Li
is a linear transformation matrix. At the same time, this may lead to the non-convexity of the model.
Jou

4 Experiments

To show the effectiveness of the proposed method, several numerical experiments are designed. The pro-
posed methods are compared with baseline Euclidean distance metric and seven state-of-the-art distance
metric learning methods including LMNN, CMML, DML-dc, DMLMJ, GMML, ITML, and LMDML. The
experiment was carried out in four steps. In the first step, we generate three two-dimension toy sets to

17
Journal Pre-proof

intuitively illustrate the power of the proposed method. In the second step, the experiment is conducted on
fifteen benchmark data sets. In the third step, the ablation study is carried out to show the importance of the
class center part and the nearest neighbor relationship part for DMLCN and MMLCN. Finally, we analyze

of
the impact of noise on the proposed methods.

4.1 Experimental setting

pro
To ensure the fairness of the experiments, all the experiments are carried out on MATLAB R2018a on the
PC with Intel Core i7-8700 processor (3.20GHz), 16GB RAM. When the KNN classifier is used, the value
of K is chosen as 3 for all of the compared methods. To ensure the same scale for each dimension, we
normalize the value of every attribute of each example between 0 and 1. Ten-fold cross-validation is used
to reduce the influence of contingency and increase the reliability of experimental results. The commonly
re-
used evaluation criteria Accuracy (ACC) is adopted in this paper:
TP + TN
ACC = (20)
TP + TN + FP + FN
where TP, TN, FN, and FP denote true positives, true negatives, false negatives, and false positives, respec-
lP
tively. The parameter selection details are given as follows:

1. Euclidean: The Euclidean distance metric is used as the baseline distance metric in which the dis-
tance between two points was calculated under the identity matrix.
rna

2. CMML1 : Clustered multi-metric learning (Nguyen et al., 2019). CMML is a well-known multi-
metric learning method that aims at learning one global and several local distance metrics. It splits the
whole data into several clusters using k-means algorithm and learns a distance metric for each cluster.
In addition, the global distance metric is learned to maintain the common information among clusters.
The parameter α is chosen from {0.001,0.1,10} and parameter β is selected from {0.001,0.01,0.1}.
Jou

3. DML-dc2 : Distance metric learning using DC programming (Nguyen and De Baets, 2018). To get a
more robust distance metric, DML-dc replaces the hinge loss with the ramp loss function (Collobert
et al., 2006) which gives an upper bound to the value of the loss. It adopts a DC algorithm (DCA)
(Tao and Le Thi, 1997) to solve the non-convex optimization problem. The trade-off parameter λ is
chosen from {0.001,0.01,0.1,1} and parameter s is set to −1.
1 https://github.com/bacnguyencong/CMML
2 https://users.ugent.be/ bacnguye/DML-dc.v1.0.zip

18
Journal Pre-proof

4. DMLMJ3 : Distance metric learning through maximization of the Jeffrey divergence (Nguyen et al.,
2017). DMLMJ tries to learn a distance metric under which the Jeffrey divergence between the
distribution of the differences in the positive difference space and the distribution of the differences in

of
the negative difference space was maximized. There are no parameters needed to select for DMLMJ.

5. GMML4 : Geometric mean metric learning. GMML enjoys a closed-form solution which leads to a

pro
fast learning process. The parameter λ is fixed as 0.1. For parameter t, we search in {0.1, 0.3, 0.5, 0.7, 0.9}
in the first step. If the highest accuracy is got at t = 0.1 or t = 0.9, t is selected from {0, 0.02, ..., 0.24}∪
{0.00001, 0.001} or {0.76, 0.78, ..., 1} ∪ {0.9999, 0.995} in the second step. Otherwise, t is selected
from {s − 0.14, s − 0.12, ..., s, ..., s + 0.12, s + 0.14} where s is the best value of t got in the first step.

6. LMNN5 : As one of the most classical and successful distance metric learning methods, it aims at
re-
learning a distance metric to improve the performance of KNN classifier. For parameter u, we search
in {0.125,0.25,0.5}.

7. ITML6 : Information theoretic metric learning algorithm (Davis et al., 2007). ITML formulates the
distance metric learning problem as that of minimizing the differential relative entropy between two
lP
multivariate Gaussians under constraints on the distance function. It can ensure the learned distance
metric is positive semidefinite without psd operation. So, ITML is fast and scalable. The parameter
γ is selected from {0.01,0.1,1,10}.
rna

8. LMDML7 : Large-margin distance metric learning (Nguyen et al., 2020). LMDML employs the prin-
ciple of margin maximization to learn the distance metric. To get the optimal solution for LMDML,
Nguyen et al. (2020) designed an efficient online algorithm based on stochastic gradient descent
(SGD) (Robbins and Monro, 1951), called LMDML-A, which is fast when dealing with large-scale
data sets. The parameter B of LMDML-A is searched in {0.1, 1, 10, 100}.
Jou

9. DMLCN and MMLCN: The parameters v1 and v2 are selected from {0.1, 0.3, 0.5, 0.7, 0.9, 3, 7} and
{0.01, 0.1, 1, 10, 100, 1000} respectively. In addition, we fix β as 10−8 .
3 https://users.ugent.be/bacnguy /DMLMJ.zip
4 https://codeload.github.com/PouriaZ/GMML/zip/refs/heads/master
5 https://codeload.github.com/gabeos/lmnn/zip/master
6 https://www.cs.utexas.edu/ pjain/itml/
7 http://users.ugent.be/ bacnguye/LMDML-A.v1.0.zip

19
Journal Pre-proof

4.2 Experiment on artificial data sets

We generate three two-dimension toy sets to reveal the effectiveness of DMLCN and MMLCN intuitively.

of
• Toy set 1 (see (a) in Figure 5) is a 4 squares chessboard data set. It is often used in data-mining
algorithms. We use Matlab toolbox to generate 500 positive samples and 500 negative samples.

• Toy set 2 (see (b) in Figure 5) consists of one diamond and four triangles. We generate this data set

pro
in two steps. In the first step, we generate 1000 two-dimension examples. For each example (x, y), it
satisfies −1 ≤ x ≤ 1 and −1 ≤ y ≤ 1. In the second step, we select the example (x, y) that satisfies
|x + y| ≤ 1 and |y − x| ≤ 1 as a positive example, otherwise we select it as a negative example.

• Toy set 3 (see (c) in Figure 5) is a mixed-Gaussian data set. Positive examples consist of 250 examples
 re- 
that obey N (0, 0)T , diag{1, 1} and 250 examples that obey N (4, 0.5)T , diag{1, 0.75} ; negative

examples consist of 250 examples that obey N (2, 2.5)T , diag{1, 0.7} and 250 examples that obey
 
N (6, 3.2)T , diag{1, 1} . Here, x ∼ N (u1 , u2 )T , diag{σ12 , σ22 } indicates that the two dimensions
of example x submit to N (u1 , σ12 ) and N (u2 , σ22 ), respectively.
lP
4 1 7
Class 1 Class 1 Class 1
Class 2 0.8 6 Class 2
3 Class 2

0.6 5
2
0.4 4

1
0.2 3

0 0 2

-0.2 1
-1
rna

-0.4 0
-2
-0.6 -1

-3
-0.8 -2

-4 -1 -3
-4 -3 -2 -1 0 1 2 3 4 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -4 -2 0 2 4 6 8 10

(a) toy set 1 (b) toy set 2 (c) toy set 3

Figure 5: Artificial data sets.

For toy set 1 and toy set 3, the number of clusters of class 1 and class 2 is selected as 2. There is one
Jou

diamond and four triangles in the toy set 2. So, we generate 4 clusters for class 2 and 1 cluster for class 1.
The experimental results are summarized in Table 1. For every data set, the highest accuracy is bold and
underlined, and the second accuracy is bold. In toy set 1 and toy set 3, DMLCN and MMLCN perform
better than the comparison methods. In toy set 2, MMLCN gets the best performance while DMLCN
performs poorly because there are clusters with different distributions, and multiple metrics can better
capture the distribution of these clusters. In toy set 1 and toy set 2, compared with MMLCN, improvements

20
Journal Pre-proof

in comparison methods on the baseline KNN classifier are not obvious. In toy set 3, both DMLCN and
MMLCN get good performance by considering the center of every cluster and keeping the nearest neighbor
relationship in each receptive field.

of
Table 1: Classification results of artificial data sets. The highest accuracy is bold and underlined, and the second accuracy is
bold.

pro
Data Sets Euclidean CMML DML-dc DMLMJ GMML LMNN ITML LMDML DMLCN MMLCN

toy set 1 98.20±1.81 97.70±1.25 98.20±1.81 98.20±1.81 98.30±1.77 98.30±1.77 98.00±1.70 98.20±1.81 99.10±0.99 99.20±1.55
toy set 2 95.40±2.84 95.40±2.50 95.40±2.84 95.20±2.86 95.60±2.95 95.60±2.95 95.80±3.05 95.40±2.84 95.60±2.46 97.00±2.87
toy set 3 92.00±2.05 93.20±1.69 93.40±1.84 91.80±2.30 92.50±2.01 92.80±1.87 92.30±2.31 93.50±1.78 93.90±1.52 93.70±1.70

4.3 Experiment on benchmark data sets re-


We compare these distance metric learning methods on one image dataset COIL208 and fourteen data
sets which are obtained from the knowledge extraction based on evolutionary Learning (KEEL) machine
learning repository9 . We use principal components analysis (PCA) (Hart et al., 2000) to reduce COIL20
lP
to 100 dimensions. The details of these data sets are summarized in Table 2. For convenience, the cluster
number li of the i-th class is determined by the following principle:


1
 mi ≤ interval (21)

rna

 mi
li = ⌊ ⌋ interval < mi < z × interval (22)

 interval


z z × interval ≤ mi (23)

Here, mi is the number of examples of the i − th class, interval is a positive integer, z is the positive
integer that is bigger than 1 and ⌊ interval
mi
⌋ denotes the integer which is not bigger than mi
interval
. We use
(21) to ensure sufficient examples to train the distance metrics. Generating too many clusters will increase
Jou

computational complexity. So, (23) is used to limit the number of clusters. The value of interval and z are
selected as 100 and 4.
8 http://www.cad.zju.edu.cn/home/dengcai/Data/MLData.html
9 https://sci2s.ugr.es/keel/datasets.php

21
Journal Pre-proof

Table 2: The details of the benchmark data sets.

# #Data sets #Examples #Class #Feature

of
1 appendicitis 106 2 7
2 balance 625 3 4
3 bands 365 2 19
4 glass 214 6 9
5 heart 270 2 13

pro
6 iris 150 3 4
7 led7digit 500 10 7
8 newthyroid 215 3 5
9 page-blocks 5472 5 10
10 pima 768 2 8
11 tae 151 3 5
12 twonorm 7400 2 20
13
14
15
re-
wdbc
wisconsin
COIL20
569
683
1440
2
2
20
30
9
1024

4.3.1 Results
lP
In Table 3, the highest accuracy is bold and underlined, and the second accuracy is bold. Rank 1 is as-
signed to the method obtaining the highest accuracy, rank 2 is assigned to the method obtaining the second
accuracy, and so on. We give the average rank in the last row in Table 3.
DMLCN and MMLCN outperform the Euclidean distance metric across all the fifteen methods which
rna

reveals that the learned distance metric is helpful to improve the classification performance. Compared
with Euclidean, the improvement of accuracy of MMLCN on balance, heart and tae, and the improvement
of accuracy of DMLCN on heart are more than 5%. Besides, from Table 3, we can find that in some data sets
such as bands, heart, iris, led7digit, twonorm and wisconsin DMLCN performs best, and in some data sets
such as glass, twonorm and wdbc MMLCN gets the best performance. In other data sets such as balance,
Jou

hear, iris, tae and wisconsin MMLCN also performs competitively. Besides, DMLCN and MMLCN get
the two lowest average ranks. MMLCN performs better than DMLCN on balance, glass, newthyroid, page-
blocks, tae and wdbc. However, MMLCN does not get improvement on other data sets. While enhancing
the ability of DMLCN of processing complex data, MMLCN also faces the risk of over-fitting.
The Friedman test and Bonferroni-Dunn post-hoc test (Demšar, 2006) are adopted to statistically an-
alyze the performance of the ten distance metric learning methods. Under the null-hypothesis that all the

22
Journal Pre-proof

Table 3: Classification results of benchmark data sets. The highest accuracy is bold and underlined, and the second accuracy
is bold. Rank 1 is assigned to the method obtaining the highest accuracy, rank 2 is assigned to the method obtaining the second
accuracy, and so on.

of
Data Sets Euclidean CMML DML-dc DMLMJ GMML LMNN ITML LMDML DMLCN MMLCN

appendicitis 84.91±11.43 86.55±10.63 84.82±13.33 85.64±10.60 87.73±10.28 83.82±13.81 88.55±11.3 84.91±11.43 86.73±11.52 86.64±8.32
balance 83.67±3.56 91.35±1.90 91.36±3.40 93.60±3.31 83.68±4.49 81.28±2.71 91.03±2.76 85.26±8.75 87.04±4.00 92.80±1.35
bands 69.61±8.55 69.56±7.91 72.36±9.43 69.04±6.42 70.98±6.93 73.18±8.52 69.59±7.75 71.54±6.79 74.26±8.47 71.76±6.79

pro
glass 68.70±5.28 66.88±4.61 68.70±5.28 70.58±7.40 70.11±6.64 71.08±7.87 66.41±6.96 71.06±7.42 68.72±5.18 71.08±6.66
heart 76.30±8.23 80.37±4.29 77.41±5.37 75.56±7.65 80.00±8.76 79.26±8.23 78.89±7.82 78.89±7.82 82.59±8.20 81.85±7.70
iris 96.00±6.44 96.00±4.66 96.00±6.44 96.67±4.71 98.00±3.22 96.00±4.66 97.33±3.44 96.00±6.44 98.00±3.22 97.33±3.44
led7digit 65.20±6.61 65.80±6.43 64.80±7.61 64.00±6.67 65.80±8.46 66.60±6.47 65.00±7.56 64.80±7.32 68.40±6.98 66.40±6.31
newthyroid 93.53±4.39 96.34±4.18 96.80±4.82 97.23±3.21 95.84±3.37 95.41±5.25 97.27±4.39 95.41±5.25 95.89±6.23 96.80±4.82
page-blocks 95.98±0.78 95.19±1.05 95.96±0.75 96.02±0.83 96.89±0.54 96.58±0.74 96.58±0.71 95.96±0.75 96.24±0.71 96.34±0.77
pima 72.91±4.22 75.39±3.30 72.91±4.22 71.74±4.24 73.83±4.71 72.79±4.91 73.18±4.57 73.19±5.16 74.61±5.22 73.17±6.08
tae 44.96±13.22 47.67±10.19 46.29±12.58 45.63±10.62 52.29±8.30 51.50±13.66 46.96±13.16 47.67±9.69 48.96±11.09 51.63±10.07
twonorm 96.39±0.42 97.35±0.51 97.04±0.37 97.08±0.70 97.31±0.34 96.64±0.39 96.74±0.67 97.07±0.50 97.62±0.47 97.62±0.36
wdbc 96.66±2.67 97.01±2.62 96.84±1.39 96.84±2.72 97.19±2.64 96.84±2.31 97.01±2.49 97.01±2.49 96.83±2.32 97.71±1.67
wisconsin 97.22±1.62 97.07±1.54 96.77±1.36 96.48±2.10 97.22±1.76 97.22±1.75 97.07±1.55 97.21±1.76 97.66±1.73 97.51±1.39
COIL20

Average rank
97.71±1.73

8.17
99.10±0.66

5.7
99.72±0.67

6.7
re-
99.58±0.67

6.6
99.24±1.06

3.93
100.00±0.00

5.5
99.38±0.83

5.4
99.03±1.05

6.4
99.79±0.34

3.33
98.61±0.87

3.27

algorithms are equivalent, the Friedman statistic can be computed as:


" #
12N X k(k + 1)2
lP
2 2
χF = Rj −
k(k + 1) j 4

where Rj denotes the average rank of the j-th method, k is the number of the methods and N is the number
of the data sets. However, according to (Demšar, 2006), Friedman’s χ2F is undesirably conservative and we
usually use
rna

(N − 1)χ2F
FF =
N (k − 1) − χ2F
which is distributed according to the F-distribution with k−1 and (k−1)(N−1) degrees of freedom. Using
the average rank in Table 3, we can get χ2F = 37.2855 and FF = 5.3421. The critical value of F (9, 126) for
α = 0.1 is 1.6817. So, we reject the null-hypothesis. To judge whether the compared methods are worse
than DMLCN or MMLCN, we adopt the Bonferroni-Dunn test. Two methods will be viewed differently if
Jou

the difference between the average rank is large than the critical value:
r
k(k + 1)
CD = qα
6N

For α = 0.1 and k = 10, we can get qα = 2.539 and CD = 2.81. The differences in the average rank
between DMLCN (and MMLCN) and comparison methods are shown in Table 4, where R# − R∗ denotes
the difference between method # and method ∗. So we can get the conclusion that DMLCN and MMLCN

23
Journal Pre-proof

perform better compared with Euclidean, DML-dc, DMLMJ, and LMDML. Besides, both DMLCN and
MMLCN perform competitively compared with other distance metric methods because their average ranks
are lower than other methods.

of
Table 4: Algorithmic average rank difference. R# − R∗ denotes the difference between method # and
method ∗.

pro
R# − RDM LCN Rank difference R# − RDM LCN Rank difference

REuclidean − RDM LCN 4.84 > 2.81 REuclidean − RM M LCN 4.9 > 2.81
RCM M L − RDM LCN 2.37 > 0 RCM M L − RM M LCN 2.43 > 0
RDM L−dc − RDM LCN 3.37 > 2.81 RDM L−dc − RM M LCN 3.43 > 2.81
RDM LM J − RDM LCN 3.27 > 2.81 RDM LM J − RM M LCN 3.33 > 2.81
RGM M L − RDM LCN 0.6 > 0 RGM M L − RM M LCN 0.66 > 0
RLM N N − RDM LCN 2.17 > 0 RLM N N − RM M LCN 2.23 > 0
RIT M L − RDM LCN
RLM DM L − RDM LCN
re-
2.07 > 0
3.07 > 2.81
RIT M L − RM M LCN
RLM DM L − RM M LCN
2.13 > 0
3.13 > 2.81

4.3.2 Visualization
lP
We use t-SEN to visualize the learning results to intuitively demonstrate the effect of DMLCN. After ob-
taining the distance metric matrix M by DMLCN, we map instance xi to LT xi where M = LLT and L is
a transformation matrix with the same shape as M . The Euclidean distance metric is used as a comparison
which uses the identity matrix as the distance metric matrix. As shown in Figure 6, on iris data set, except
rna

for several examples in class 2, class 2 and class 3 are separated from each other under the distance metric
matrix learned by DMLCN. On newthyroid data set, examples are closer to the corresponding class center
under the distance metric matrix leaned by DMLCN compared with Euclidean distance metric matrix.

4.3.3 Parameters sensitivity


Jou

It is important to choose appropriate hyper-parameters for many machine learning methods. To analyze
the influence of parameter v1 and parameter v2 on MMLCN, we give the relationship between ACC and
parameters in Figure 7. For MMLCN, the ACC shows an upward trend with the increase of v2 on balance,
iris, and led7digit. However, for glass, the accuracy decreased when v2 is too large. For data sets with a
simple distribution, it is appropriate to enforce the class center part of MMLCN. Therefore a large value is
needed for v2 . While, for data sets with highly nonlinear decision boundaries, if v2 is too large, the first and

24
Journal Pre-proof

of
pro
Class 1 Class 2 Class 3 Class 1 Class 2 Class 3
Center of class 1 Center of class 2 Center of class 3 Center of class 1 Center of class 2 Center of class 3

20
20
15
15

10
10
(1) iris

5 5

0
0

-5

-10
re- -5

-10

-15

-20 -15 -10 -5 0 5 10 15 20 25 -20 -10 0 10 20 30


lP
Class 1 Class 2 Class 3 Class 1 Class 2 Class 3
Center of class 1 Center of class 2 Center of class 3 Center of class 1 Center of class 2 Center of class 3

10 10

5 5
(2) newthyroid

rna

0 0

-5 -5

-10 -10

-20 -15 -10 -5 0 5 10 15 -15 -10 -5 0 5 10 15 20

(a) Euclidean (b) DMLCN


Jou

Figure 6: Visualization effects of DMLCN on benchmark data sets.

25
Journal Pre-proof

second terms are powerless which will result that the nearest neighbor relationship cannot be maintained.
Choosing an appropriate value for v1 is also important for MMLCN, because v1 can balance the first and
second terms in (8). We can observe the change of v1 influences the performance of MMLCN on these data

of
sets.
appendicitis balance bands glass

86.64 92.80 71.76 71.08

pro
85.48 89.08 69.50 67.95
ACC(%)

ACC(%)

ACC(%)

ACC(%)
84.32 85.35 67.24 64.83
83.16 81.63 64.98 61.70
82.00 77.91 62.72 58.57
0.1 0.1 0.1 0.1
0.3 0.3 0.3 0.3
0.5 1000 0.5 1000 0.5 1000 0.5 1000
0.7 0.7 0.7 0.7
0.9 10 100 0.9 10 100 0.9 10 100 0.9 10 100
3
7 0.11
0.01
3
7 0.11
0.01
3
7 0.11
0.01
3
7 0.11
0.01
heart iris led7digit newthyroid

81.85 97.33 66.40 96.80


80.00 95.50 63.05 89.95
ACC(%)

ACC(%)

ACC(%)

ACC(%)
78.15 93.66 59.70 83.10
76.29 91.83 56.35 76.24
74.44 90.00 53.00 69.39
0.1

96.34
0.3
0.5
0.7
0.9
3
7
page-blocks
0.11
0.01
1000
10 100
0.1

73.17
0.3
0.5 re-
0.7
0.9
3
7
pima
0.11
0.01
1000
10 100
0.1

51.63
0.3
0.5
0.7
0.9
3
7
tae
0.11
0.01
1000
10 100
0.1

97.62
0.3
0.5
0.7
0.9
3

twonorm
7 0.11
0.01
1000
10 100

96.03 72.00 49.64 97.27


ACC(%)

ACC(%)

ACC(%)

ACC(%)
95.72 70.83 47.65 96.92
95.41 69.65 45.66 96.57
95.10 68.48 43.67 96.22
0.1 0.1 0.1 0.1
0.3 0.3 0.3 0.3
0.5 0.5 0.5 0.5
lP
0.7 1000 0.7 1000 0.7 1000 0.7 1000
0.9 10 100 0.9 10 100 0.9 10 100 0.9 10 100
3
7 0.11
0.01
3
7 0.11
0.01
3
7 0.11
0.01
3
7 0.11
0.01
wdbc wisconsin COIL20

97.71 97.51 98.61


91.43 97.18 95.66
ACC(%)

ACC(%)

ACC(%)

85.15 96.85 92.71


78.87 96.52 89.76
72.59 96.19 86.81
0.1 0.1 0.1
0.3 0.3 0.3
0.5 1000 0.5 1000 0.5 1000
0.7 0.7 0.7
0.9 10 100 0.9 10 100 0.9 10 100
rna

3
7 0.11
0.01
3
7 0.11
0.01
3
7 0.11
0.01

Figure 7: Sensitivity analysis of MMLCN.

4.4 Ablation study


Jou

The proposed methods consider both the center of each class and the nearest neighbor relationship. To
verify the effectiveness of the combination of these two parts, we implement four submodels:

1. DMLCC: Distance metric learning based on the class center. DMLCC ignores the part of the nearest
neighbor relationship of DMLCN and gets the following formulation:
l
X X X
min ℓ(d(M +βI) (xk , cj ) − d(M +βI) (xk , ci ) − ρij ) (24)
M ⪰0
i=1 j∈X l ,j̸=i (k,i,j)∈Tij

26
Journal Pre-proof

2. DMLNN: Distance metric learning based on nearest neighbor relationship. DMLNN which just
considers the nearest neighbor relationship of DMLCN is used to verify if considering the center of
each class is effective. The formulation of DMLNN is shown as follows:

of
l
X X l
X X
min d(M +βI) (xj , xk ) + v1 d(M +βI)−1 (xj , xk ) (25)
M ⪰0
i=1 (j,k)∈Si i=1 (j,k)∈Di

pro
3. MMLCC: Multi-metric learning based on the class center. MMLCC is a multi-metric version of
DMLCC. We use MMLCC to verify the effectiveness of using the nearest neighbor relationship in
MMLCN. The model formulation of MMLCC is given as follows:
l
X X X
min ℓ(d(Mj +βI) (xk , cj ) − d(Mi +βI) (xk , ci ) − ρij ) (26)
M1 ,M 2,...,Ml ⪰0
i=1 j∈X l ,j̸=i (k,i,j)∈Tij

re-
4. MMLNN: Multi-metric learning based on nearest neighbor relationship. MMLNN is a multi-metric
version of DMLNN. It is used to show the role of the center of each class in MMLCN. We give the
formulation of MMLNN as follows:
l
X X l
X X
min d(Mi +βI) (xj , xk ) + v1 d(Mi +βI)−1 (xj , xk )
lP
M1 ,M 2,...,Ml ⪰0
i=1 (j,k)∈Si i=1 (j,k)∈Di (27)

Table 5: Results of ablation experiment. The best result is bold.


rna

Control group of DMLCN Control group of MMLCN


Data sets
DMLCC DMLNN DMLCN MMLCC MMLNN MMLCN

appendicitis 83.91±11.20 82.00±9.78 86.73±11.52 85.82±8.36 82.00±11.51 86.64±8.32


balance 84.79±4.46 84.96±2.75 87.04±4.00 91.03±1.77 81.11±3.58 92.80±1.35
bands 66.54±6.46 73.45±9.16 74.26±8.47 65.22±11.77 69.31±11.54 71.76±6.79
glass 59.29±6.55 66.43±6.46 68.72±5.18 58.57±11.69 61.15±17.22 71.08±6.66
Jou

heart 81.11±7.29 76.30±9.11 82.59±8.20 81.11±7.90 72.96±11.85 81.85±7.70


iris 98.00±3.22 94.67±6.89 98.00±3.22 96.00±4.66 33.33±12.57 97.33±3.44

The experiment is carried out on appendicitis, balance, bands, glass, heart, and iris. Table 5 shows
the experimental results which include the control group of DMLCN and the control group of MMLCN.
Compared with DMLCC and DMLNN, DMLCN gets the best results on these data sets. MMLCN also
performs better than MMLCC and MMLNN. The absence of the nearest neighbor relationship part or
the class center part will affect the performance of DMLCN and MMLCN. Besides, we can observe that

27
Journal Pre-proof

MMLNN gets very low accuracy on iris. As shown in Figure 6, when using Euclidean distance metric, class
1 is separate from class 2 and class 3. That means we may only construct similar pairs in the receptive field
of the center of class 1 which result in the learned matrix for class 1 being 0 where 0 denotes a matrix with

of
all elements being 0. From SubSection 3.4, we can see that the test examples of class 2 and class 3 may only
search k-nearest neighbors in the receptive field of the center of class 1. In this situation, the performance
of MMLNN will be influenced. For MMLCN, the constraints in (8) can avoid this problem.

pro
4.5 Sensitivity to noise

In order to analyze the impact of noise on DMLCN and MMLCN, we carry out an experiment on appen-
dicitis, balance, bands, glass, heart and iris. We randomly change the labels of nearly 5%, 10% and 20%
re-
examples of these data sets to generate nearly 5%, 10% and 20% label noise. In DMLCN and MMLCN,
the penalty for outliers is unbounded. To solve this problem, we introduce smooth ramp loss (Wang et al.,
2008) into DMLCN, MMLCN and get DMLCN  SR , MMLCNSR . The smooth ramp  loss is defined as

 


 0 z >1+h 
 0 z>h

 

HRhu (z) = H1hu (z) + H0hu (z) where H1hu (z) = (1+h−z)2
|1 − z| ⩽ h , H0 (z) =  −(h−z)
hu 2
lP
 4h 4h
|z| ⩽ h

 


 

1 − z z <1−h z z < −h
and h is the Huber parameter which was typically taken between 0.001 and 0.5. Here, we chose h as 0.25.
The model formulation of DMLCNSR is defined as:
rna

l
X X l
X X
min d(M +βI) (xj , xk ) + v1 d(M +βI)−1 (xj , xk )
M ⪰0
i=1 (j,k)∈Si i=1 (j,k)∈Di
l
(28)
X X X
+ v2 HRhu (d(M +βI) (xk , cj ) − d(M +βI) (xk , ci ))
i=1 j∈X l ,j̸=i (k,i,j)∈Tij

The model formulation of MMLCNSR is shown as:


Jou

l
X X l
X X
min d(Mi +βI) (xj , xk ) + v1 d(Mi +βI)−1 (xj , xk )
M1 ,M 2,...,Ml ⪰0
i=1 (j,k)∈Si i=1 (j,k)∈Di
l
(29)
X X X
+ v2 HRhu (d(Mj +βI) (xk , cj ) − d(Mi +βI) (xk , ci ))
i=1 j∈X l ,j̸=i (k,i,j)∈Tij

The experimental results are shown in Table 6. The higher accuracy is bold. Here, DMLCNSR is only
compared with DMLCN and MMLCNSR is only compared with MMLCN. When the noise increases, the
accuracy got by these methods shows a downward trend. When adding noise, DMLCNSR performs better

28
Journal Pre-proof

than DMLCN on appendicitis, balance and iris and MMLCNSR performs better than MMLCN on iris.
However, on other data sets, the introduction of smooth ramp loss has no such obvious effect. The impact
of noise on the first and second terms of the objective function on (6) and (8) has not been eliminated

of
because the smooth ramp loss can not be directly introduced into these two terms.

Table 6: The experimental results on KEEL data sets with noise. The best result is bold. DMLCNSR is only compared with

pro
DMLCN and MMLCNSR is only compared with MMLCN.

0% 5%
Data Sets
DMLCN DMLCN SR MMLCN MMLCN SR DMLCN DMLCN SR MMLCN MMLCN SR

appendicitis 86.73±11.52 89.45±12.76 86.64±8.32 86.73±9.40 80.27±16.36 83.09±12.56 81.00±13.55 81.18±14.97


balance 87.04±4.00 91.68±2.58 92.80±1.35 93.60±3.22 81.90±3.62 85.76±4.88 82.40±3.27 85.92±3.08
bands 74.26±8.47 74.27±9.36 71.76±6.79 71.50±9.28 69.36±8.80 68.82±10.44 67.44±7.79 68.18±9.27
glass
heart
iris
68.72±5.18
82.59±8.20
98.00±3.22
68.72±5.69
81.11±6.40
98.00±4.50
re-
71.08±6.66
81.85±7.70
97.33±3.44
72.40±9.53
78.89±8.91
97.33±3.44
63.07±7.17
75.56±9.27
89.33±10.52
61.28±9.28
74.81±4.55
92.00±7.57
61.73±9.83
74.07±8.73
89.33±10.52
64.03±6.55
73.70±11.24
92.67±7.98

10% 20%
Data Sets
DMLCN DMLCN SR MMLCN MMLCN SR DMLCN DMLCN SR MMLCN MMLCN SR
lP
appendicitis 77.36±19.01 80.18±15.73 80.18±17.91 79.27±17.53 63.00±9.14 68.82±15.56 65.00±5.02 64.64±16.17
balance 76.46±4.54 80.65±3.53 78.39±2.72 79.66±4.23 63.99±6.68 68.31±7.82 64.17±5.33 64.14±7.16
bands 66.01±8.48 65.21±6.52 65.46±10.99 62.74±7.88 61.70±7.62 59.74±5.78 60.57±5.70 55.97±9.54
glass 62.14±9.21 59.85±7.14 59.39±9.70 59.39±4.85 53.27±9.44 49.59±9.25 50.48±8.23 43.51±10.98
heart 70.74±13.8 70.00±8.45 70.00±4.77 71.48±11.46 64.44±10.06 65.19±11.07 64.07±14.61 70.00±10.25
iris 83.33±9.03 87.33±9.66 84.00±10.98 85.33±8.78 68.67±12.19 74.67±14.33 71.33±10.45 76.00±11.84
rna

5 Conclusion

The class center-based methods lead to more compact intra-class and more dispersed inter-classes for simple
distribution data, while the nearest neighbor relationship-based methods can better characterize the local
Jou

structure of data. In this work, we propose a new distance metric learning method (DMLCN) that considers
both the class center and the nearest neighbor relationship. First, DMLCN splits each class into several
clusters and uses one center to represent one cluster. Then, DMLCN learns a distance metric, under which
each example is close to the corresponding cluster center and the nearest neighbor relationship is kept in
each receptive field at the same time. Further, to better characterize complex data structures, we introduce
multi-metric into DMLCN (MMLCN) by learning a local metric for each cluster center. Besides, we design

29
Journal Pre-proof

a classification decision function based on the proposed methods. Then, an iterative algorithm is developed
to solve the proposed methods. The convergence and complexity of the algorithm are analyzed theoretically.
Following that, experiments on different types of datasets are conducted, and different evaluation criteria

of
are applied to test the proposed methods. Experiments on artificial data sets illustrate the effectiveness of the
proposed methods. Compared with the state-of-the-art distance metric learning methods, the performances
of the proposed methods are competitive on benchmark data sets. The ablation experiment is conducted

pro
to show the necessity of the class center part and the nearest neighbor relationship part of the proposed
methods. In addition, we analyze also the influence of noise on DMLCN and MMLCN by introducing
robust loss function.
The computational burden of the proposed methods will increase with the scale of examples. In future
work, we will focus on reducing the computational complexity while ensuring the convexity of DMLCN
re-
and MMLCN. In addition, how to reduce the impact of noise on the proposed methods is what we will do
in the future.

Acknowledgments
lP
This work was supported in part by National Natural Science Foundation of China (No.11471010).

References
rna

Boyd, S., Boyd, S. P., and Vandenberghe, L. (2004). Convex optimization. Cambridge university press.

Chen, J. and Hu, J. (2021). Weakly supervised compositional metric learning for face verification. IEEE
Transactions on Instrumentation and Measurement, 70:1–8.

Chen, W., Chen, X., Zhang, J., and Huang, K. (2017). Beyond triplet loss: A deep quadruplet network for
Jou

person re-identification. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pages 1320–1329.

Chen, X. and Huang, T. (2003). Facial expression recognition: A clustering-based approach. Pattern
Recognition Letters, 24(9):1295–1302.

Chong, S.-C., Ong, T.-S., and Chong, L.-Y. (2020). Discriminative spectral regression metric learning in un-

30
Journal Pre-proof

constrained face verification. In 2020 8th International Conference on Information and Communication
Technology (ICoICT), pages 1–6.

of
Collobert, R., Sinz, F., Weston, J., and Bottou, L. (2006). Trading convexity for scalability. In Proceedings
of the 23rd International Conference on Machine Learning, ICML ’06, page 201–208, New York, NY,
USA. Association for Computing Machinery.

pro
Cover, T. and Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information
Theory, 13(1):21–27.

Davis, J. V., Kulis, B., Jain, P., Sra, S., and Dhillon, I. S. (2007). Information-theoretic metric learning. In
Proceedings of the 24th International Conference on Machine Learning, ICML ’07, page 209–216, New
York, NY, USA. Association for Computing Machinery.
re-
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine
Learning Research, 7:1–30.

Duan, Y., Lu, J., Feng, J., and Zhou, J. (2018). Deep localized metric learning. IEEE Transactions on
lP
Circuits and Systems for Video Technology, 28(10):2644–2656.

Duan, Y., Lu, J., Zheng, W., and Zhou, J. (2020). Deep adversarial metric learning. IEEE Transactions on
Image Processing, 29:2037–2051.
rna

Ganaie, M. and Tanveer, M. (2022). Knn weighted reduced universum twin svm for class imbalance learn-
ing. Knowledge-Based Systems, 245:108578.

Gouk, H., Pfahringer, B., and Cree, M. (2016). Learning distance metrics for multi-label classification. In
Asian Conference on Machine Learning, pages 318–333. PMLR.
Jou

Hammer, B. and Villmann, T. (2002). Generalized relevance learning vector quantization. Neural Networks,
15(8):1059–1068.

Hart, P. E., Stork, D. G., and Duda, R. O. (2000). Pattern classification. Wiley Hoboken.

He, X., Zhou, Y., Zhou, Z., Bai, S., and Bai, X. (2018). Triplet-center loss for multi-view 3d object retrieval.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1945–1954.

31
Journal Pre-proof

Hu, T.-Y. and Hauptmann, A. G. (2021). Statistical distance metric learning for image set retrieval.
In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 1765–1769.

of
Kulis, B. et al. (2013). Metric learning: A survey. Foundations and Trends® in Machine Learning,
5(4):287–364.

pro
Li, D. and Tian, Y. (2018). Survey and experimental study on metric learning methods. Neural Networks,
105:447–462.

Nguyen, B. and De Baets, B. (2018). An approach to supervised distance metric learning based on differ-
ence of convex functions programming. Pattern Recognition, 81:562–574.

re-
Nguyen, B., Ferri, F. J., Morell, C., and De Baets, B. (2019). An efficient method for clustered multi-metric
learning. Information Sciences, 471:149–163.

Nguyen, B., Morell, C., and De Baets, B. (2017). Supervised distance metric learning through maximization
of the jeffrey divergence. Pattern Recognition, 64:215–225.
lP
Nguyen, B., Morell, C., and De Baets, B. (2020). Scalable large-margin distance metric learning using
stochastic gradient descent. IEEE Transactions on Cybernetics, 50(3):1072–1083.

Ren, Q., Yuan, C., Zhao, Y., and Yang, L. (2022). A multi-birth metric learning framework based on binary
rna

constraints. Neural Networks, 154:165–178.

Robbins, H. and Monro, S. (1951). A stochastic approximation method. The annals of mathematical
statistics, pages 400–407.

Ruan, Y., Xiao, Y., Hao, Z., and Liu, B. (2021a). A convex model for support vector distance metric
Jou

learning. IEEE Transactions on Neural Networks and Learning Systems, pages 1–14.

Ruan, Y., Xiao, Y., Hao, Z., and Liu, B. (2021b). A nearest-neighbor search model for distance metric
learning. Information Sciences, 552:261–277.

Shi, Y., Bellet, A., and Sha, F. (2014). Sparse compositional metric learning. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 28.

32
Journal Pre-proof

Tanveer, M., Sharma, A., and Suganthan, P. (2021). Least squares knn-based weighted multiclass twin svm.
Neurocomputing, 459:454–464.

of
Tao, P. and Le Thi, H. A. (1997). Convex analysis approach to d.c. programming: Theory, algorithms and
applications. Acta Math. Vietnamica, 22:289–355.

Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable minimization.

pro
Journal of optimization theory and applications, 109(3):475–494.

Wang, L., Jia, H., and Li, J. (2008). Training robust support vector machine with smooth ramp loss in
the primal space. Neurocomputing, 71(13):3020–3025. Artificial Neural Networks (ICANN 2006) /
Engineering of Intelligent Systems (ICEIS 2006).

re-
Weinberger, K. Q. and Saul, L. K. (2009). Distance metric learning for large margin nearest neighbor
classification. Journal of Machine Learning Research, 10(9):207–244.

Wu, H., Zhou, Q., Nie, R., and Cao, J. (2020). Effective metric learning with co-occurrence embedding for
collaborative recommendations. Neural Networks, 124:308–318.
lP
Ye, H.-J., Zhan, D.-C., Li, N., and Jiang, Y. (2020). Learning multiple local metrics: Global consideration
helps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(7):1698–1712.

Zadeh, P., Hosseini, R., and Sra, S. (2016). Geometric mean metric learning. In Balcan, M. F. and Wein-
rna

berger, K. Q., editors, Proceedings of The 33rd International Conference on Machine Learning, vol-
ume 48 of Proceedings of Machine Learning Research, pages 2464–2471, New York, New York, USA.
PMLR.

Zuo, W., Wang, F., Zhang, D., Lin, L., Huang, Y., Meng, D., and Zhang, L. (2017). Distance metric learning
via iterated support vector machines. IEEE Transactions on Image Processing, 26(10):4937–4950.
Jou

33
Journal Pre-proof
Declaration of Interest Statement

Declaration of interests

☒The authors declare that they have no known competing financial interests or personal relationships

of
that could have appeared to influence the work reported in this paper.

☐The authors declare the following financial interests/personal relationships which may be considered
as potential competing interests:

pro
re-
lP
rna
Jou

You might also like