Distance Metric Learning

DISTANCE METRIC
LEARNING
BASED ON CONVEX
FUNCTIONS PROGRAMMING
GUIDE – Mr. Rakesh Sanodiya

INTRODUCTION
 A Supervised distance metric learning method that aims to improve

the performance of nearest-neighbour classification.
 The method is based on large-margin principle.
 Mahalanobis distance metric is one of the most successful and well
studied framework for distance metric learning (DML).
 Mahalanobis distance metric learning can be formulated within a
convex optimization framework.
 Difference of convex functions (DC) programming is used in this
distance metric learning method.
MAHALANOBIS METRIC
 The Mahalanobis distance between two feature vectors xi and xj

takes the following form:
dM ( xi , xj ) = √( xi −xj )M( xi −xj ),
where M is a positive semidefinite (PSD) matrix.
 A positive semidefinite metric (PSD) is a symmetric matric such that

X’MX >= 0 ; for all vector x
MARGIN
 It is a geometric measure for evaluating the confidence of the

predictions made by a classifier
 It provides theoretical generalization bounds on the effectiveness of a
classifier i.e. the higher the confidence is, the lower generalization
error the classifier obtains.
 Margin is defined as folowing
 φ(xi) = d(xi,NM (xi)) − d (xi,NH (xi))
where, NM ( xi) and NH ( xi) are called the nearest miss and nearest
hit respectively.
HIT AND MISS EXAMPLES
 Hit examples:
Let xi be an example in X. The hit examples of xi are the elements
of the set Hi consisting of the examples in X \ {xi} that share the
same class label with xi,
i.e. Hi = xj | j ∈ { 1 , . . . , n } , j != i, yj = yi .
 Miss examples:
Let xi be an example in X . The miss examples of xi are the elements
of the set Mi consisting of the examples in X that do not share the
same class label with xi ,
i.e. Mi = xj | j ∈ { 1 , . . . , n } , yj = yi .
LARGE MARGIN DISTANCE METRIC
LEARNING APPROACH
 It maximizes the margin of the nearest neighbour classifier.
 D^2M(xi,xl) - d^2M(xi,xj) = [dM(xi,xl)-dM(xi,xj)] [dM(xi,xl)+dM(xi,xj)]
 we can rewrite the margin based on the distance metric dM as
φM ( xi ) = dM( xi , NMm( xi )) −dM( xi , NHm( xi ) )
= gi( M ) − hi( M )
 where ,gi( M ) = −min dM (xi,xj) | xj ∈ Hi
hi( M ) = −min dM (xi,xj) | xj ∈ Mi
, are convex functions of M on SD + .
IDEA FOR MISCLASSIFICATION
 The goal is to maximize min{φM( xi ) | xi ∈ X }

 We can deal with the nonseparable case by adding additional penalty terms to
the objective function.
arg min λtr( M ) + 1/n σ𝑛𝑖=1 𝑙(φM ( xi ))
Where, l is a loss function
λ> 0 is a hyper-parameter controlling the trade-off
between the margin violations and the regularization.
tr(M) : trace of M is equal to its nuclear norm,
RAMP LOSS FUNCTION
 It is a non–convex loss function.

 We consider the ramp loss as:-
Rs(z) = max {0 , 1 −z }− max {0 , s −z }
 Where, s < 1 is a parameter
 The idea behind the ramp loss is to truncate large losses with the constant s ,
making the classifier more robust to outliers .
CONTD..
Fig. An illustration of the ramp loss function with s = −0 . 5 and some

convex loss functions.
APPLICATION OF RAMP LOSS FUNCTION
 Apply the ramp loss function on φM(xi) i.e. margin.
 l(φM (xi)) = max{ 0 ,1 −gi( M ) + hi( M )} − max{ 0 , s − gi( M ) + hi ( M )}
 = max {1 + hi( M ) , gi ( M )} − max {s + hi( M ) , gi( M )
Put this in objective function,we get

 G (M) = λ*tr(M) + 1/n σ𝑛𝑖=1 max { 1 + hi(M), gi(M) }
 H(M) = 1 /n σ𝑛𝑖=1 max{ s + hi (M), gi (M) }
 Now the objective function can be decomposed into a convex part G ( M )

and a concave part −H( M )
CONTD..
 So this problem can be cast as an instance of DC programming, given by
arg min
G (M) − H(M)
𝑀≥𝑂
 It is a non-smooth nonconvex optimization problem which can be solved by a

DC program .
DML-dc: Distance metric learning using
DC programing
 DCA is one of the most effective algorithms for solving DC programs.
 Essentially, the idea is to linearize the concave part and subsequently solve
the convex subproblem.
 When the objective function is differentiable, DCA can be seen as the

concave-convex procedure (CCCP).
 Such algorithms have already been used in SVMs , clustering , regression ,

and so on.
Algorithm 1
 Input : Parameter €
 Output : M t+1 > 0
 Begin
.Let Mo > 0 be an initial solution
.Set the iteration counter t := 0
.Repeat
◦Linearize the concave part by computing Ut ∈ ∂H (Mt)
◦Compute Mt+1 by solving the following convex semidefinite program
arg min
 Mt+1 := G (M) − (M, Ut)
M >0
Increase the iteration counter t := t + 1
Until ||Mt − Mt+1||F ≤ €
 End.
Algorithm 2
 Input : Parameters T , η, Ut , m
 Output : Mt >= 0
 Begin
• Let Mo>= 0 be an initial solution
• Initialize all sets Hi and Mi containing only the m nearest examples
•For k ← 0 to T −1
◦If (k + 1 mod some _ constant) = 0,
Recompute all sets Hi and Mi
◦Compute the subgradient Gk
◦Set Mk +1 / 2 := Mk − ηGk
◦Project onto the PSD cone as Mk + 1 := ∏ SD+ (M k +1/2 )
End

Distance Metric Learning

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Distance Metric Learning

Uploaded by

Copyright:

Available Formats

DISTANCE METRIC

GUIDE – Mr. Rakesh Sanodiya

 A Supervised distance metric learning method that aims to improve

 The Mahalanobis distance between two feature vectors xi and xj

 A positive semidefinite metric (PSD) is a symmetric matric such that

 It is a geometric measure for evaluating the confidence of the

 The goal is to maximize min{φM( xi ) | xi ∈ X }

 It is a non–convex loss function.

 Where, s < 1 is a parameter

Fig. An illustration of the ramp loss function with s = −0 . 5 and some

Put this in objective function,we get

 Now the objective function can be decomposed into a convex part G ( M )

 So this problem can be cast as an instance of DC programming, given by

 It is a non-smooth nonconvex optimization problem which can be solved by a

 When the objective function is differentiable, DCA can be seen as the

 Such algorithms have already been used in SVMs , clustering , regression ,

You might also like