School of Biomedical Engineering, Southern Medical University, Guangzhou 510515, China

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

METRIC LEARNING FOR MAXIMIZING MAP AND ITS APPLICATION TO CONTENT-

BASED MEDICAL IMAGE RETRIEVAL



Wei Yang, Qianjin Feng, Zhentai Lu, Wufan Chen
*


School of Biomedical Engineering, Southern Medical University, Guangzhou 510515, China

ABSTRACT

The descriptive power of low-level image features for
describing the high-level semantic concepts is limited for
content-based image retrieval (CBIR). To reduce this
semantic gap and improve retrieval performance of CBIR, a
distance metric learning method is proposed which can learn
a linear projection to define a distance metric for
maximizing mean average precision (MAP). The smooth
approximation of MAP is optimized as the objective
function by gradient-based approaches to find the optimal
linear projection (called MPP). MPP is applied to retrieval
of contrast-enhanced MRI images of brain tumors on a large
dataset. The results demonstrate the effectiveness of MPP as
compared to the state-of-the-art metric learning methods.

Index Terms CBIR, metric learning, mean average
precision, brain MRI

1. INTRODUCTION

Content-based image retrieval (CBIR) technology supports
to find the images with the same anatomical regions, the
similar lesions, or the same disease. It can benefit the
clinical decision making and medical education [1]. Two
intrinsic problems of CBIR are: (a) extraction of features to
represent the images, and (b) definition of the similarity
between image features [2]. Typically, low-level visual
features which characterize the color, texture, and shape of
the image or the region of interest are extracted for CBIR.
Then, the similarity between feature vectors is computed,
and the images with maximum similarity to the query image
are retrieved. The performance of a CBIR system highly
depends on the features and similarity measures used.
The representation of an image by the visual features
usually results in loss of information. At the same time, the
appearance of images has large variation. Lesions belonging
to the same pathological category but from different patients
or at different disease stages can present different
appearances on the images. Low-level visual features may
not directly link to the image category (semantic concept). It
is difficult to measure the image similarity associated with
the semantic concepts in the space of low-level features.
Using a common distance metric, such as Mahalanobis or
Euclidean distance as the similarity (dissimilarity) measure,
a CBIR system cannot achieve the desired performance.
Therefore, it is desirable for a CBIR system to reduce the
semantic gap between the low-level visual features and the
semantics concepts [3]. Distance metric learning (DML)
methods can be used to find a linear transformation that
projects the image features to a new feature space to reduce
this semantic gap [4,5]. It is expected that the distance
defined in the projected feature space can reflect the
difference between semantic concepts. Previous work has
shown that the appropriately-designed distance metrics can
improve CBIR performance compared to Euclidean distance
[4-6].
The goal of DML is to find a linear transformation
matrix to project the image features to a new feature space
which can optimize a predefined objective function. The
objective functions of many DML methods, such as kernel-
based distance metric learning [5], Xings method [7], Local
Fisher Discriminant Analysis (LFDA) [8], and Large
Margin Nearest Neighbor (LMNN) [9], are designed for
clustering and K nearest neighbor (KNN) classification in
essence, and the ultimate goal of these DML methods is to
improve the accuracy of KNN classification or clustering.
However, the retrieval evaluation measures for CBIR
systems, such as mean average precision (MAP) and mean
reciprocal rank (MRR), are defined with respect to the
permutation of retrieved images for a given query, which
are very different from classification accuracy. Optimization
of the accuracy of KNN classification through DML may
not lead to optimal retrieval performance.
In this paper, we propose a new DML method which
can directly optimize MAP. In our method, non-continuous
and non-differentiable MAP is smoothly approximated, and
the gradient-based approach is then employed to optimize
the approximated MAP. The proposed DML method finds a
linear transformation matrix which can project the image
feature to a more discriminative feature space and define a
Mahalanobis distance for maximizing MAP. We call this
transform as Maximum average Precision Projection (MPP).
We apply MPP to retrieval of contrast-enhanced MRI (CE-
MRI) brain images on a large dataset. The experimental
results show that MPP can significantly improve retrieval
performance.

1901 978-1-4244-4128-0/11/$25.00 2011 IEEE ISBI 2011
2. METHOD
2.1. Distance metric learning
Let L be a dD matrix and W=L
T
L. The (squared)
Mahalanobis distance between feature vector x
i
and x
j
is:
2 T T
T
|| || ( ) ( )
( ) ( )
ij i j i j i j
i j i j
d = =
=
Lx Lx x x L L x x
x x W x x
,
where
T
denotes the transposition of a matrix or vector. The
objective of DML is to find an optimal L or W for
minimizing an objective function. The intuitive goal of
DML is keeping all intra-class data points close, while
separating all inter-class data points as far as possible in the
feature space projected by L.
Previous DML frameworks have been formed in two
ways: pair-wise and relative-comparison. Xings method [7]
is typically pair-wise, while the objective function for
LMNN is defined in the relative-comparison way [9]. Since
maximizing MAP is equivalent to minimizing (1MAP),
most of current work shows that the pair-wise and relative-
comparison loss functions are the upper bound of (1MAP)
[10]. However, they are not tight enough to (1MAP) and
they do not use the complete information of ranking list.
This may penalize the retrieval performance of learning
algorithms. We will define an objective function using the
information of ranking list.

2.2. Retrieval measures and their approximations
To evaluate CBIR, several retrieval evaluation measures
have been proposed based on precision and recall. The most
common way to summarize the precision-recall curve into
one value is MAP. More precisely, precision at the top k
retrieved images (Prec@k in short) is defined as the
proportion of relevant images up to rank position k:
Prec@ { ( ) }/
j j
j
k rel k k t = s

1 x , (1)
where x
j
is a feature vector, {0,1}
j
rel e is the relevance label
of x
j
for a given query x
q
(1 for relevant, and 0 for
irrelevant), (x
j
) is the position or rank of x
j
in the ranking
list, 1{}is the indicator function. Average precision (AP) is
defined as:
AP Prec@ / Prec@ ( ) /
j j j
j j
rel j N rel N t
+ +
= =

x (2)
where N
+
is the number of relevant images. MAP is the
average of AP over all queries. AP and MAP are not
continuous and differentiable. It is not possible to directly
optimize these measures by using gradient-based
approaches.
Several methods for optimizing the retrieval measures
have been developed for ranking [11-13]: optimizing their
upper bounds or approximations as surrogate objective
functions. Previous studies have shown that the approach of
directly optimizing the retrieval measures can achieve good
ranking performance when compared to other approaches
[11,12]. The reason behind this mainly is that the retrieval
measures are explicitly considered in the direct optimization
approaches.
For a query sample x
q
, the distance between x
q
and x
j
is
represented as d
j
, then the rank of x
j
can be represented as:
, ,
( ) 1 1{ } 1 { 0}
j j k jk
k k j k k j
d d d t
= =
= + > = + A >

x 1 , (3)
where
jk j k
d d d A = , 1{}is the indicator function. The rank
can be regarded as a function of ranking scores d
j
(j=1,,
N). Due to the indicator function, the ranking function is
non-continuous and non-differentiable.
The indicator function can be approximated using the
sigmoid functions, such as the logistic function
1
( ) (1 exp( )) S z z o

= + , where is the scaling parameters


[12]. In this way, the rank (x
j
) can be approximated, and
becomes continuous and differentiable (denoted as ( )
j
t x ):
,
( ) 1 ( )
j jk
k k j
S d t
=
= + A

x . (4)
AP in Eq.2 can be rewritten as follows,
,
,
1 1
AP { ( ) ( )}
( )
{ }
1

( ) ( )
j k k j
j k k j j
j j k
j k
j k k j j j
rel rel
N
rel d d
rel rel
N
t t
t
t t
+
=
+
=
= s
| | >
= + |
|
\ .


1 x x
x
1
x x
. (5)
The rank (x
j
) in Eq.5 is approximated by ( )
j
x t in Eq.4. The
indicator function in Eq.5 is approximated by the logistic
function S(z) again. Then, the approximation of AP is
obtained as follows:
,
( )
1

AP
( ) ( )
j j k jk
j k k j j j
rel rel rel S d
N t t
+
=
| | A
= + |
|
\ .

x x
. (6)
This approximation form of AP has been proposed and
mentioned in the studies on information retrieval [11,13].
However, it is defined on the ranking scores not the
distances, and its performance has not been tested. There are
also other kinds of approximations of retrieval measures
proposed by researchers [11-13], but the formulation in
Eq.6 is more direct and simple.

2.3. Metric learning for directly maximizing MAP
We try to learn an appropriate distance metric for CBIR
which can directly optimize MAP on the given query pairs.
A retrieval loss function is defined with respect to a linear
transformation matrix L over the query pairs
( ,{ , 1, 2,..., })
q j q
j N = x x , 1, 2,..., q M = as:
AP
1

Loss ( ) 1 AP /
M
q
q
M
=
=

L , (7)
where M is the total number of query pairs. Loss
AP
is a
smooth approximation of (1MAP). The loss function can
also be defined with respect to W by substituting L
T
L with
W. However, W should be positive semidefinite to ensure
the non-negative distance and the triangle inequality.
Optimization of W can be accomplished by semidefinite
1902
programming (SDP), which is time-consuming. Since
Loss
AP
is non-convex with respect to both of L and W, we
directly optimize it with respect to L to avoid requiring SDP.
A popular regularization term is added to L:
T
Reg( ) t r( ) = L L L , (8)
where tr(A) denotes the trace of matrix A. The objective
function of MPP learning is defined as:
T
AP
R( ) Loss ( ) tr( ) = + L L L L . (9)
Thus, the optimization formulation of MPP learning is:
* T
AP
arg min R( ) arg min Loss ( ) tr( )
d D d D
R R


e e
= = +
L L
L L L L L , (10)
where is a weight parameter which controls the trade-off
between the empirical loss and the regularization term.
Because R(L) is continuous and differentiable, R(L)
can be minimized using gradient-based approaches.
Mimicking the deduction of [12], we can get the gradient of
R(L) easily. To minimize R(L), many gradient-based
approaches, such as conjugate gradient and L-BFGS, can be
used. However, the computation complexity of R(L) and its
gradient is huge. Computation of R(L) or its gradient on all
the query pairs is time-consuming and impractical.
Stochastic gradient descent (SGD) algorithm is adopted for
optimizing R(L) in this paper, which can converge fast in
the experiments.
The final solution of L is highly dependent on the
initial estimation L
0
due to the non-convexity of the
objective function. There are many local minima in the
solution space. It is expected that the initial solution L
0

would be very close to the global optimal solution. Some
other DML algorithms can be applied to obtain L
0
.
Empirically, LFDA [8] with the adapted parameter (the
number of nearest neighbors) can be used to obtain an
appropriate linear transformation with relatively high MAP.
Thus, MPP is performed as follows. Firstly, the feature
vectors are transformed as
LFDA
x L x , where L
LFDA
is the
linear transformation matrix learned by LFDA. Then, SGD
for MPP is carried on with an identity matrix or a random
matrix as the initial solution L
0
. The final solution is
L
MaxIter
L
LFDA
, where MaxIter is the number of maximum
iterations of SGD. An advantage of MPP is that it can be
used to reduce the feature dimension by directly tuning the
dimension of L.

3. EXPERIMENTS
In this section, the effectiveness of MPP for maximizing
MAP on a large CE-MRI brain image dataset is verified.

3.1. Image data
The T1-weighted CE-MRI images were acquired at
Nanfang Hospital, Guangzhou, China and General Hospital,
Tianjin Medical University, China from 2005.9 to 2010.10.
The dataset comprises 3,108 images including 705
meningiomas, 1,475 gliomas, and 928 pituitary tumors. In
the experiments, the tumor images of different views
(transverse, sagittal, and coronal) were considered together.
All tumors in the images were manually outlined by three
experienced radiologists. Three CE-MRI images with
outlined contours of brain tumors are shown in Fig. 1. The
manually outlined tumor contours and regions were used to
extract the visual features. We extracted the intensity,
texture, and shape features to characterize the tumor region,
including mean and variance of intensity, statistics of gray
level co-occurrence matrix, statistics of wavelet coefficients
of intensity in the tumor region, and statistics of wavelet
coefficients of the tumor shape. Finally, each tumor image
was represented by a 45-dimensional feature vector. Two
images containing tumors of same category were defined to
be relevant (similar); otherwise, they were considered
irrelevant (dissimilar).


Fig. 1. Three contrast-enhanced T1-weighted MRI brain
images with the manually outlined contours of tumors.

3.2. Experimental settings
In the experiments, the performance of DML algorithms
was evaluated through five-fold cross-validation method.
Each image in the test set was used as a query to retrieve the
training set to report performance. LFDA and LMNN were
implemented as baseline DML methods. The training set
was directly fed to LFDA and LMNN. For LFDA, the
number of nearest neighbors was set to 100, a value
different from the suggestion in [8], but which could
achieve higher MAP. Increasing the number of nearest
neighbors for LMNN would lead to lower MAP; thus, the
default parameters in the implementation provided by the
authors were used [9]. For MPP, the number of SGD
iteration was set to 1000, and the scaling parameter (in the
sigmoid function) and the regularization weight were set
to 1 and 0.0001, respectively. We selected an L with the
highest MAP on the validation set by 10 random restarts to
avoid local optima. To speed up MPP learning, each
training query pair only contained 200 images randomly
selected from the training set. The embedding dimension d
for three DML algorithms was empirically set to 10 which
could lead to satisfactory results.

3.3. Comparison of learned distance metrics
From Table 1, it can be seen that all of three learned
distance metrics can achieve better performance than
Euclidean distance, and MPP outperforms all the other
algorithms in terms of MAP, Prec@10, and Prec@20. Fig. 2
1903
shows the precision-recall curves and precision-scope
curves of different metrics. It is clear that MPP is superior
to the three other metrics. At the top left portion of the
precision-recall curves (where recall is small, Fig. 2a), the
gap between the precision of MPP and LFDA is relatively
small. However, at the middle portion of precision-recall
curves, precision of MPP is much higher than LFDA or
LMNN, which leads to higher MAP (69.3%). Prec@k of
four distance metrics is presented on Fig.2b. When scope k
is less than 5, Prec@k of MPP is slightly lower than LFDA.
When k is greater than 5, Prec@k of MPP is consistently
higher than the other metrics. Estimated Prec@10 and
Prec@20 of MPP can achieve 78.0% and 77.7%,
respectively.

Table 1. Retrieval performance of different distance metrics
Distance Metric MAP Prec@10 Prec@20
Euclidean 46.1% 67.5% 64.5%
LMNN 49.3% 76.3% 73.1%
LFDA 51.4% 77.0% 75.3%
MPP 69.3% 78.0% 77.7%


Fig. 2. (a) Precision-recall curves and (b) precision-scope
curves of different distance metrics.

4. CONCLUSION AND DISCUSSION
Due to the limited descriptive power of low-level image
features, they often fail to describe directly the high-level
semantic concepts and the performance of CBIR still need
to improve for practical applications. We propose the MPP
learning method which can find a linear projection for
maximizing MAP. MPP can reduce the pathological
semantic gap and improve retrieval performance of CBIR.
On a large CE-MRI image dataset of brain tumors, the
results demonstrate that MPP significantly outperforms the
commonly used Euclidean distance and the distance metrics
learned by LFDA and LMNN. MPP can also be used to
reduce the dimensionality of the features.
In the experiments, commonly-used image features are
extracted for the retrieval task. More powerful or task-
specific features able to incorporate more domain
knowledge can further improve the performance of CBIR
with MPP. There is some improvement space for MPP
learning, such as learning multiple local projections for
different regions in the feature space, similar to multi-metric
LMNN and stabilizing the solution of MPP by aggregation.
Additionally, the learning framework of MPP can be
extended to maximize other retrieval measures, such as
Prec@k, MRR, and DNCG for multilevel relevance. These
are our ongoing issues.

ACKNOWLEDGEMENT
This work was supported by Key Program of NSFC (No.
30730036), Major State Basic Research Development
Program of China (No. 2010CB732500) and NSFC (No.
31000450). The authors would like to thank Dr. Mark
Foskey with Department of Radiation Oncology, University
of North Carolina for the English revision of this paper.

REFERENCES

[1] H. Mller, N. Michoux, D. Bandon, and A. Geissbuhler, A
review of content-based image retrieval systems in medical
applications--clinical benefits and future directions, Int. J. Med.
Info., vol. 73, no. 1, pp. 1-23, 2004.
[2] R. Datta, D. Joshi, J. Li, and J. Z. Wang, Image retrieval:
Ideas, influences, and trends of the new age, ACM Computing
Surveys, vol. 40, no. 2, pp. 1-60, 2008.
[3] H. Guan, S. Antani, L. R. Long, and G. R. Thoma, Bridging
the semantic gap using ranking SVM for image retrieval, in IEEE
Int. Symp. Biomed. Imaging (ISBI), 2009.
[4] A. Frome, Y. Singer, and J. Malik, Image retrieval and
classification using local distance functions, in Conf. Neural
Information Processing Systems (NIPS), 2006.
[5] H. Chang and D.-Y. Yeung, Kernel-based distance metric
learning for content-based image retrieval, Image Vis. Comput.,
vol. 25, no. 5, pp. 695-703, 2007.
[6] L. Yang, R. Jin, L. Mummert, R. Sukthankar, A. Goode, B.
Zheng, S. C. Hoi, and M. Satyanarayanan, A boosting framework
for visuality-preserving distance metric learning and its application
to medical image retrieval, IEEE Trans. Pattern Anal. Mach.
Intell., vol. 32, no. 1, pp. 30-44, 2010.
[7] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. J. Russell, Distance
metric learning with application to clustering with side-
information, in Neural Info. Proc. Syst. (NIPS), 2002.
[8] S. Masashi, Dimensionality reduction of multimodal labeled
data by local fisher discriminant analysis, J. Mach. Learn. Res.,
vol. 8, pp. 1027-1061, 2007.
[9] K. Q. Weinberger and L. K. Saul, Distance metric learning for
large margin nearest neighbor classification, J. Mach. Learn. Res.,
vol. 10, pp. 207-244, 2009.
[10] W. Chen, T.-Y. Liu, Y. Lan, Z.-M. Ma, and H. Li, Ranking
measures and loss functions in learning to rank, in Neural Info.
Proc. Syst. (NIPS), 2009.
[11] O. Chapelle and M. Wu, Gradient descent optimization of
smoothed information retrieval metrics, Information Retrieval,
vol. 13, no. 3, pp. 216-235, 2010.
[12] T. Qin, T.-Y. Liu, and H. Li, A general approximation
framework for direct optimization of information retrieval
measures, Information Retrieval, vol. 13, no. 4, pp. 375-397,
2010.
[13] M. Taylor, J. Guiver, S. Robertson, and T. Minka, SoftRank:
optimizing non-smooth rank metrics, in Int. Conf. Web Search
and Web Data Mining, 2008.

1904

You might also like