Professional Documents
Culture Documents
Thesis PDF
Thesis PDF
for
Content Based Image Retrieval
Dissertation
von
2008
Dekan: Prof. Dr. Bernard Nebel
This thesis stems from my work at the chair of pattern recognition and image
processing (LMB) at the university of Freiburg, Germany. This work would
not have been possible without the bountiful love and grace of Lord Ganesha.
Firstly, I would like to thank Prof. Hans Burkhardt for supervising my work
and for providing an environment very conducive for research, and to Prof.
Lars Schmidt-Thieme for agreeing to be the co-examiner of the thesis. I would
like to thank my collegues at LMB for many fruitful discussions and ideas. Es-
pecially, I would like to thank Dr. Alaa Halawani and Alexandra Teynor, with
whom I had the opportunity to work directly. Many thanks to Stefan Teister
and his team for providing competent IT administration behind the scenes. I
am grateful to our group secretary Ms. Findlay for her help and encourage-
ment at many occasions.
Last but not the least, I would like to thank my parents for always believing
in me and always providing support.
Lokesh Setia
New Delhi
November 2008
Te Deum
- Charles Reznikoff
Zusammenfassung
Die stets ansteigende Zahl digitaler Informationen hat die Notwendigkeit ef-
fizienter Informationsuchmachinen geschaffen. Informationen die man nicht
leicht wiederfinden kann, könnten auch genau so gut verloren sein, so sagt
man. Weil Informationen in den unterschiedlichsten Formaten und Typen
vorkommen können, müssen auch die Suchmechanismen entsprechend un-
terschiedlich sein. In der vorliegenden Arbeit beschäftigen wir uns mit der
inhaltsbasierten Bildsuche, die dem Benutzer die Interaktion mit einer Bild-
datenbank mittels automatischer Analyse des Bildinhaltes erleichtert.
Wir kennen die Mehrdeutigkeiten, die sogar in sehr einfachen Sätzen natür-
licher Sprachen existieren. Ein Satz wie “Flying planes can be dangerous” kann
unterschiedliche Bedeutungen haben (,,Es kann gefährlich sein Flugzeuge zu
fliegen” vs ,,fliegende Flugzeuge können gefährlich sein.”). Das gleiche gilt
auch für Bilder. Tatsächlich ist das alte Sprichwort:,,Ein Bild sagt mehr als
tausend Worte” hier treffender als irgendwo sonst. Dieser Arbeit liegt die An-
sicht zugrunde, dass diese Mehrdeutigkeiten normal sind, und dass das Sys-
tem nicht im Vorhinein eine mögliche Bedeutung einer anderen bevorzugen
sollte. Die frühesten Image Retrieval Systeme erlaubten dem Benutzer seine
Sichtweise zu spezifizieren, indem sie ihm Zugang zu internen Systempara-
metern gewährten. Diese zu spezifizieren ist wenn nicht zu kompliziert so
doch zumindest sehr mühsam. Ein modernes System versucht die Sichtweise
des Benutzers zu lernen und somit dem Benutzer weniger Verantwortung zu
überlassen. Dies kann mittels Relevance Feedback realisiert werden, wobei
der Benutzer dem System nach und nach mehr Informationen gibt, in der Hoff-
nung auf bessere Ergebnisse. Man unterscheidet zwischen Kurzzeit-Relevance
Feedback, bei welchem die gesammelten Daten nach jeder Session verwor-
fen werden, und Langzeit- Relevance Feedback, bei welchem die Daten über
mehrere Sessions eines Benutzers oder sogar für mehrere Benutzer gesam-
melt werden. In dieser Arbeit beschäftigen wir uns jedoch ausschließlich mit
Kurzzeit-Relevance Feedback, da sich unserer Meinung nach die in digitalen
Bildern vorhandenen Mehrdeutigkeiten nicht anders modellieren lassen.
Im nächsten Teil der Arbeit wird die Bildsuche als Erweiterung traditioneller
v
vi
The ever increasing amount of digital information has created a need for ef-
fective information retrieval systems. As it is said, information which cannot
be found easily is as good as lost. As information comes in various formats
and types, their retrieval mechanisms also need to differ correspondingly. In
this work, we deal with the task of content based image retrieval, in which
the system facilitates the interaction between a user and an image database by
automatic analysis of the image content.
We know about the ambiguities that can exist even in the simplest of phrases
in a natural language. As an example, a sentence such as “Flying planes can be
dangerous” can either mean “Flying planes are dangerous” or “Flying planes is
dangerous”. Images are no different. In fact, the old saying “A picture is worth
a thousand words” is as true here as it is anywhere. In this work, we take the
view that these ambiguities are natural, and thus the system should not take
the one or the other viewpoint from the very beginning. The earliest systems
for image retrieval allowed the user to specify his or her viewpoint by giving
access to its internal system parameters, which can be complicated or tiring for
the user. A modern system, on the other hand, seeks to learn this viewpoint by
imposing less responsibilities on the user. This can be achieved using relevance
feedback, in which the user progressively gives the system more and more in-
formation, in return for better results. Relevance feedback can be short-term, in
which the data collected is discarded as soon as the session is over, and long-
term, in which the data can be collected over multiple sessions of one user, or
even over multiple users. In this work, however, we will constrain ourselves to
short-term relevance feedback, as in our view the ambiguities or the multiple
interpretations present in an image cannot be handled otherwise.
The later part of the thesis delves into image search as an offshoot of the
traditional text-based search engines. To this end, we explore the possibility
of annotating an image database using keywords. One advantage of this ap-
proach is that the user does not need to provide a suitable starting image for
the query. An equally important advantage is that the annotation process can
be carried out offline for the whole database, unlike relevance feedback which
vii
viii
must be carried out in real time. Apart from annotation, we show that fur-
ther data mining operations can also be carried out on image databases, which
can contribute to improving the effectiveness of the image search engine. We
conclude with the demonstration of various algorithms on a real life medical
image database in which very competitive results could be achieved in recent
international benchmarks.
Contents
1 Introduction 1
1.1 Structure of the Document . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . 3
ix
x CONTENTS
6 Conclusion 103
References 115
Introduction
1 With a purely theoretical interest, we note the analysis in Krauss and Starkman (2004),
according to which this growth rate must stop latest in 600 years, due to the fundamental
limits of the universe.
1
2 Introduction
In this work we will deal with both document retrieval and data mining scenar-
ios. As data type however, we will limit ourselves to 2-D digital images. More
specifically, we would consider vectorial features extracted from 2-D images,
achieved through a feature extraction process R M× N → Rd . We can state that
the algorithms presented in this thesis are directly usable with other media types, at
least as long as vectorial features can be extracted from them. This includes 3-D im-
ages, speech or music signals, and video. We limit our experiments to images
however, since the temporal dimension inherent in music and video signals
makes the user labelling process much slower, and thus interactive feedback is
much less effective compared to that in an image search system. The interested
reader may refer to Rho et al. (2007) and Wang et al. (2001) for retrieval issues
in other media types.
hierarchy generation are useful in guiding the user to the intended images.
We conclude by capturing the main points proposed in the thesis, and giving
pointers to possible future research work in the area.
5
6 Fundamentals of Image Retrieval
Figure 2.1: An image retrieval use case at its most abstract level
verified visually by viewing them alongside the query image. In a real system,
however, finding an appropriate initial query image can be difficult. Indeed
what, some may ask, is the use of an image retrieval system if the user already
possesses a similar looking image? While this criticism does hold for many
image collections, there are scenarios where this is the preferred query model.
An example could be in clinical use, where a doctor would like to retrieve the
case history of patients with a similar radiograph image compared to that of
the current patient.
Query
Image(s) Feature Vector(s)
Feature
User Extraction Relevance
Feedback
Similarity
Image Collection Feature Database Comparison
Feature
Extraction
... Indexing &
Retrieval
...
Result
a) Invariance
This is the so-called necessary condition for invariance. For digital images,
the transformation groups required in an application are typically one or more
8 Fundamentals of Image Retrieval
of the following:
In the above list, all except the monotonic intensity transformation are what
are known as geometric transformations. Only the location of the points is
transformed, while the value remains unchanged. For example, given an ini-
tial source point ( p x , py ) T , we have the destination point coordinates
p0x
px tx
= + (2.2)
p0y py ty
p0x
α11 α12 px tx α11 α12
6= 0
= + , (2.3)
p0y
α21 α22 py ty α21 α22
Differentiation The elements gS form an orbit in the feature space and can be
controlled using a parameter vector λ (whose dimensionality f is equal
to the degrees of freedom G ). An invariant feature should take a con-
stant value along the orbit, thus one tries to find features satisfying the
differential equation
δgλ S
= 0, ∀i = 1 . . . f (2.4)
δλi
Integration As early as in 1897, Hurwitz demonstrated the use of Haar Inte-
grals for generating invariant features (Hurwitz, 1897). One integrates
over the group elements which have been transformed using a (often
non-linear) kernel function f .
1
Z
F (S) = f ( gS) dg (2.5)
|G| G
b) Information content
Non-Vectorial Data
The scalar features extracted using the methods described in the previous sec-
tion must be combined in some way. The simplest way to group them is to cre-
ate a vector, with the scalar features being its members. This approach has the
10 Fundamentals of Image Retrieval
In some cases, it might be practicable to skip the feature extraction step, and
instead build the complexity in the matching algorithm. For example, in the
case of medical images, an algorithm might perform image registration and
consider the quality of the registration as a similarity measure, or judge the
similarity through the amount of deformation that must be carried out on one
of the images (e.g. Keysers et al. (2007)). Again, the kernelized methods in this
work can be adapted to the special matching processes, while the possibility
for special feedback algorithms also exists here.
Figure 2.3: An example of the semantic gap problem. The two images possess
very similar colour and texture characteristics, but differ vastly as
far as the semantics are concerned. (A colour version of this figure is
available on Page 126)
concept detector for every high level concept such as mood directly from image
data. This approach seems however, unlikely to work, as the number of pos-
sible high-level concepts can be very high, with possibly much higher intra-
concept variability as compared to low-level tasks. The other approach is to
have simple concept detectors and then use external rules to learn high-level
concepts. These rules could be combinational based on certain observations
and could also be learned automatically to a certain extent. Let us give an ex-
ample. Assume that we can predict with medium precision (say 70 %) the fol-
lowing information from the content of an image: a) the nationality of the peo-
ple present b) their location estimate and c) condition of the people’s clothes.
Then rules such as the following might lead us to the desired result:
∀ x,
((nationality( x ) = `english')
∧ (location( x ) = `green_grounds')
∧ (clothes = `muddy'))
⇒ playing_rugby( x ) (R1)
and
∀ x,
playing_rugby( x )
The rule R2 is simply dependent on the rule R1, however the rule R1 is
dependent on three sub-rules. Assuming that the detectors for the sub-rules
are statistically independent, the output of R1 is correct with a probability of
(0.7)3 = 0.343. Thus, we can observe that the output of a cascaded detector de-
grades much faster than the output of its individual units. This, together with
the higher number of required training images, is the reason why semantic
labelling still eludes us.
and so on...
The most common performance measures used in the literature are precision
and recall (Feng et al., 2003). Assume that the user has seen k result images, out
of which k R are good results (i.e. relevant), and the remaining k NR are non-
relevant. Further, let the total number of images in the database be N, out
of which NR are relevant for the current query, and NNR are not. Then, the
measures are defined as follows:
kR
Recallk = (2.7)
NR
The recall is thus the ratio of retrieved relevant images to the total number
of relevant images. By itself, it is not sufficient to measure system performance,
as one could increase k arbitrarily which would push Recall to 1 in the asymp-
totic case. Thus, one further defines:
kR
Precisionk = (2.8)
k
which measures the precision after k images have been retrieved. An al-
ternative way to picture this is in set notation as illustrated in Figure 2.6. Let
the set of retrieved images be denoted by R and the set of groundtruth target
images by T, then precision = R ∩ T/R and recall = R ∩ T/T.
The Precision and Recall values can be plotted against each other for dif-
ferent values of k, the result being known as a precision-recall graph, which is
well understood within the image retrieval community.
in the feature space, and which decreases monotonically upon increasing dis-
tance from the query image. These functions are known as similarity measures.
8 7 8 10
7 7 9
6
6 6 8
5 5 5 7
4 4 6
4
3 3 5
3
2 2 4
1 2 1 3
0 0 2
1
−1 −1 1
−2 −2 0
−2 0 2 4 6 8 −2 0 2 4 6 8
Figure 2.7: Ranking induced by similarity functions. The left image shows the
spherical ranking due to Euclidean distance, and the right image
shows that due to the Manhattan distance. The point (3, 3) is used
as the query point. (A colour version of this figure is available on Page
126)
!1/2
n
d E (x, y) = ∑ ( x i − y i )2 (2.10)
i =1
n
d M (x, y) = ∑ | xi − yi | (2.11)
i =1
n
HI(x, y) = ∑ min(xi , yi ) (2.12)
i =1
2.4 Initial Scenario: Single Query Image 17
In particular, the relation HI(x, y) = d M (x, y)/2 − 1 holds in case the vec-
tors describe valid probability distributions (i.e., they sum up to one, and
have no negative elements). The special case of feature vectors contain-
ing some kind of histograms is especially interesting in the field of image
retrieval, as many kinds of features of all main types (say, colour, texture
and shape) are in the form of histograms. A popular measure between
two probability distributions or their histogram based approximations
is the Kullback-Leibler Divergence, which measures how compact one
distribution can be coded using the other (Feng et al., 2003).
n
x
KLD(x, y) = ∑ xi log yii (2.13)
i =1
19
20 Relevance Feedback in CBIR (Online Methods)
In words, the optimum region Ω0 is the one which maximises the expected
number of relevant documents, out of all possible regions (continuous or dis-
continuous) which contain a total of exactly n documents. This definition,
however, does not take into account the fact that the ranking of the documents
3.1 A Mathematical Model of Relevance Feedback 21
within the top-n retrieved documents can also be important to the user. We
could say that the above definition optimizes with respect to the precision at n
performance measure (See Section 2.3).
Incorporating other performance measures would require finding an opti-
mum traversal path to rank documents in the feature space. As an example,
if the function to be optimized is the area under the precision-recall graph, then
the optimum ranking is in the direction of decreasing values of P(rel | x), as it
ensures an incrementally highest precision value along the p-r curve. Not all
performance measures can be so compactly written in an analytic form.
Thus, theoretically a document retrieval system only needs to find Ω0 in or-
der to give the best possible results. However, in practice, with limited labelled
training data, this is not feasible and only approximations can be reached. This
is mainly because in Equation 3.1, P(rel | x) remains largely unknown. The
other term p(x) is usually not a problem as this term depends only on the dis-
tribution of the database images (whether labelled or unlabelled), for which a
much larger number of samples is available.
Relevance feedback algorithms can be regarded as approximations to the
original goal of learning P(rel | x). To this end, various assumptions and sim-
plifications are made, and it is the aim of this chapter to gain an insight into
their consequences.
The most common assumption that all relevance feedback algorithms use
to some extent is the smoothness assumption:
i.e. if two vectors x1 and x2 are close enough, then their expected labels
would be similar with high confidence.
The most common simplifications are on the form of the term P(rel | x). In-
stead of trying to determine the term precisely at every data point x, which
cannot be done anyway, one limits its complexity in a predefined way. This
is in line with the Occam’s razor2 , and also follows our experience with ma-
chine learning algorithms in general (e.g. the tradeoff between generalisation
performance and overfitting in SVM, where the classifier boundary with the
lowest VC dimension, in other words, the boundary with the least complexity
is the one to be preferred (Vapnik, 1995). As an initial estimate, for example, an
image retrieval system might constrain the distribution to be spheric around
the starting query image (Figure 2.7).
2 entia non sunt multiplicanda praeter necessitatem, or entities should not be multiplied beyond
necessity.
22 Relevance Feedback in CBIR (Online Methods)
3.1.2 Visualisation
An important decision to make in any practical CBIR system is how to display
the results to the user. The most straightforward option is to show a linear list
of images with decreasing similarity values. In case the system wishes to query
the user about the relevance status of certain images other than the result im-
ages, this can be done using a second linear list in parallel with the first one.
However, the linear display system might not be the most efficient. An alter-
native is to lower the dimensionality of the image features, and display the im-
ages spatially in a 2-D grid (or possibly 3-D with appropriate navigation possi-
bilities). The advantage of this arrangement is that it conveys information not
just about the n similarity values of the query image to the result images, but
rather all of the (n+ 1
2 ) similarities between images. The dimensionality reduc-
3.2 Relevance Feedback Algorithms 23
tion can be easily performed using Principal Components Analysis (Duda et al.,
2000). However, we sometimes have the case where non-Euclidean metrics are
more appropriate, or even when the originial data points xi are not available at
all, only their inter-similarity δij = dis(xi , x j ) according to any (dis-)similarity
function. Principal components analysis can not be used in this case. This can
however be achieved using multi-dimensional scaling.
Multi-Dimensional Scaling
#1/2
∑i,j ( f (δij ) − ||pi − p j ||)2
"
{pi } = arg min
{ pi } ∑ij ||pi − p j ||2
Figure 3.1: 2-D image distribution obtained for the shown images by using
simple visual features together with the MDS algorithm. (A colour
version of this figure is available on Page 127)
3.2 Relevance Feedback Algorithms 25
nr ns
Rk S
Q=Q +β ∑0
−γ ∑ k (3.4)
n
k =1 r
n
k =1 s
where
Q is the new query vector
Q0 is the initial query vector
Rk is the vector for the kth relevant document
Sk is the vector for the kth non-relevant document
γ and β are adjustable parameters which control the contributions of the rele-
vant and non-relavant documents.
The formula can be seen in action in Figure 3.2. The net effect of the Roc-
chio’s algorithm is to add a displacement vector to the original query, so that
the resulting vector in general is further away from the non-relevant docu-
ments and nearer to the relevant documents. However, an undesired effect
is that the magnitude of the new query vector can be drastically different for
some values of γ and β. To control this, the resulting vectors were always
renormalized. This is fine for the case of document retrieval, where at the
time the feature vector consisted of the occurrences of various keywords, and
normalized term frequency makes sense. In the case of other types of fea-
ture vectors, as would be the case in image retrieval, this renormalization step
can have undesired consequences. Another aspect to be noticed is that the
term frequency vector consists of only positive values, but through the update
formula negative values could arise. The formula was used in such a way
that these negative values were made zero, so that the non-relevant document
could only affect the frequency of the terms which actually appear in the rel-
evant documents. This modification may not be adaptable to other kinds of
feature vectors, such as in image retrieval.
We assume a vectorial representation of the features extracted from the im-
age content. i.e. x = f(I), where x ∈ Rn , I ∈ R M× N . This representation is
mathematically most convenient, although other schemes may be more natu-
rally suitable, especially for a parts-based representation of image content.
Given an initial query vector xq , the aim of the CBIR system is then to re-
trieve images, which are most likely to be desired by the user. We define the
set of these desired images as the target images. The aim of the CBIR system
can then be restated as to determine the true distribution of the target images in the
input feature space. This is a non-trivial problem. Firstly, there is no certainty
that such a distribution exists, as e.g. would occur if the extracted features are
inadequate for describing the user’s intended search. Secondly, the system has
access to very limited information provided by the user in the form of labelled
26 Relevance Feedback in CBIR (Online Methods)
Rocchios update algorithm (β = 0.50, γ = 0.50) Rocchios update algorithm (β = 0.50, γ = 0.20)
5 5
q
4.5 4.5
q
4 4
q0 q0
3.5 3.5
3 3
Relevant Relevant
2.5 Non−Relevant 2.5 Non−Relevant
2 2
1.5 1.5
1 1
0.5 0.5
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Figure 3.2: Application of the Rocchio’s formula for different parameter val-
ues. As is apparent, in its default form this can lead to a magni-
fication of the feature vectors. In the figure, the crosses represent
the positive examples, while the circles represent the negative. q0
is the initial query point, and q the query point after applying the
Rocchio’s Algorithm.
samples.
In the absence of further information apart from the query vector xq , it
seems intuitive to make the target distribution symmetric around xq , and de-
creasing as one moves away from xq (following the smoothness assumption, Sec-
tion 3.1). This is in fact, what we would do when the image search process is
started with the help of the query image.
Active Learning An online learning system can decide what its weaknesses
are, and formulate queries for the user, which when answered would
accelerate the learning process considerably as compared to having just
randomly selected new labelled data.
to accommodate new data into the already trained system, which is usu-
ally faster than a complete retraining, which can be a significant advan-
tage, as speed is often a critical issue in interactive systems.
These terms are borrowed from the machine learning literature, even though
the original use in pedagogy predates it. A learner is passive, if it has no con-
trol over which information it receives from the outside world. If in addition,
the learner formulates questions in order to speed up the learning process, the
learner is termed as active. In the context of image retrieval, an active relevance
feedback algorithm delivers at each feedback round, a current best list of re-
sults, and a list of images which, when labelled by the user, would most likely
be useful for the learning algorithm (user query images). In general, these
two lists will be different, as the best user query images will be the ones about
which the system is most unsure, i.e., the ones which the system has evaluated,
neither as positively positive, nor as positively negative.
Granularity of Feedback
Scale to fit range Here one defines a range, such as [−1, 1] or [0, 1], in which
all the feature elements of the training vectors must lie. In general, If the
lowest observed value for a feature f is l, and the highest u, this can be
performed with
f −l
f new = (3.5)
u−l
After the normalisation, all training vectors lie inside a unit hypercube.
The used constants l and u must be saved in order to be applied to new
test data when available. Of course, it is entirely possible for unseen vec-
tors to fall outside the hypercube. Apart from this, the main disadvan-
tage of this normalisation method is its suspectibility towards outliers.
Scale to standard statistics In this method, the training data is scaled typically
to reach zero mean and unit variance per feature. If the variance of the
3.3 Feature Weighting and Selection Methods 29
feature element f is σ2 , and mean µ over all training vectors, then this
can be performed by
f −µ
f new = (3.6)
σ
where
µ = E{x} (3.8)
is the mean vector. The covariance matrix, being symmetric and positive
semidefinite, can always be diagonalized as
C xx = VΛV T (3.9)
x0 = Λ−1/2 V T (x − µ) (3.10)
“whitenes” the ensemble, i.e. the transformed ensemble has zero mean
and identity covariance matrix. In practice however, whitening should
not be performed before a feature weighting or selection process due to
several reasons. Whitening modifies the entire feature space instead of
scaling each feature element individually. In the original feature space
there might have been some “good” and some “bad” features, but this
seperation could easily get lost during the whitening process, especially
since the whitening, or indeed any other kind of feature normalisation,
must be done over the whole training data, and not class-wise (as oth-
erwise there would remain ambiguity over how to normalize the new
test data). Due to this reason, whitening is usually not performed before
applying other feature weighting algorithms.
30 Relevance Feedback in CBIR (Online Methods)
x0 = diag(α · x T )
or alternatively as:
α1 0 ··· 0
0 α2 ··· 0
x0 = ·x
.. .. ..
. . .
0 ··· 0 αn
means computationally easy even for moderately sized feature vectors. The
feature selection problem for a two-class classification task could be expressed
as an optimisation problem:
where yi ∈ {−1, 1} are the data labels, ω and b are the parameters of the
(linear) classification boundary, and kωk00 is the zero norm of the weight vector
ω defined as
The formulation searches for a solution with the smallest number of fea-
tures, which still classifies all of the training points correctly. In case the classes
are not linearly seperable, or if the possibility of outliers needs to be incorpo-
rated, the formulation can be extended with slack variables. It is of interest to
note that this optimization problem is the same as the optimum margin prob-
lem in support vector machines, except that instead of the L2 norm, the zero-
norm should be minimized. However, this is a crucial difference, and as shown
by Amaldi and Kann (1998), the above problem is an NP-hard problem.Weston
et al. (2003) propose modifications to the above zero-norm minimisation prob-
lem in order to make the task realistically solvable.
It is interesting to note that from a pure theoretical viewpoint, feature se-
lection (or even feature weighting) can offer no improvements if an optimum
classifier is used. This is because the Bayes classifier, which outputs the class
ωk with the highest a-posteriori probability P(ωk | x), has a monotonically in-
creasing behaviour with respect to the number of features. This means that
adding features can never decrease the expected Bayesian error rate, and thus
does no harm (apart from increased computational and memory requirements).
In practice, however, the Bayes rule cannot be applied because the underly-
ing class-specific distributions P(x | ωk ) are usually unknown. If we insist
on approximating the densities P(x | ωk ), ∀k = 1, . . . , K through some means
(e.g. Parzen window, Gaussian mixture models etc.), then it is often beneficial
to work in a lower-dimensional space, as the number of data points required
to cover the feature space with a fixed average density increases exponentially
with the dimensionality of the feature space.
32 Relevance Feedback in CBIR (Online Methods)
On the other hand, a feature ranking approach considers each feature element
in isolation, and assigns a goodness value to it. The features are sorted in de-
scending order according to their goodness value, and receive thereby a rank.
A subset of features can then be selected by either taking the top-k ranked fea-
tures, or by fixing a threshold directly to their goodness value. The main bene-
fit of feature ranking methods over subset selection methods is speed. They are
typically linear in the number of features, while the optimum subset selection
method has an exponential running time. The main drawback is the possible
performance impact. The fact that a feature is good by itself does not say much
about how it performs in conjunction with other features. It might be highly
correlated with other features, thus reducing its usefulness. As an example,
consider a pattern described by three features: A, B, and C, which are useful
for a particular classification task in this order. Let the feature vector contain
redundant copies of these features and be given by ( A, A, A, B, B, B, C, C, C ).
Now, a feature ranking method which considers each feature element in iso-
lation and outputs in the end a subset of size 3 would yield ( A, A, A), while
clearly the desired output in most cases is ( A, B, C ).
4 Wrapper based methods are those which do not have knowledge about the internals of
the classifier, i.e. treat it as a black box.
3.3 Feature Weighting and Selection Methods 33
Figure 3.3: Feature selection using hill climbing and forward search. Each
node shows the selection mask and the goodness criteria for that
particular subset. In the shown example, the hill climbing algo-
rithm does not reach the optimum node with a goodness of 0.53
(shown in green).
Correlation Coefficients
In this method, the goodness value of a feature is based on its correlation with
the output label. It is measured as:
( µ k + − µ k − )2
αk =
σk2+ + σk2−
where µk+ and µk− are the means, and σk+ and σk+ the standard deviations
of the kth feature for the positive and negative classes, respectively.
These methods work with the idea that given a linear classifier for a two-class
problem:
y = sign(hω, xi + b)
the absolute value of the coefficients of the hyperplane normal ω are indica-
tive of the importance of the features. This is illustrated in a 2-D toy example
34 Relevance Feedback in CBIR (Online Methods)
Figure 3.4: In this toy example, feature x1 may be considered more important
than x2 because of the normal vector
in Figure 3.4
We use the following methods of obtaining the linear classifier:
Linear SVM In this method, the normal coefficients of the SVM hyperplane
(see Section 3.4)
Linear Regression Here we select the hyperplane which minimises the squared
error for the output values predicted linearly from the training data.
where
l
(ωopt , bopt ) = arg min ∑ (yi − hω, xi i − b)2
ω, b i =1
1 1
P ( x | ωi ) = exp(− (x − µi ) T Σ−1 (x − µi )) (3.13)
(2π ) N/2 det(Σ) 1/2 2
3.4 Relevance Feedback using a Two-Class SVM 35
φ : Rn → F
x 7→ φ(x)
One obtains then a classification function of the form f (x) = sgn(w · φ(x) + b).
Through the use of a kernel k(u, v) = φ(u).φ(v) different boundaries can be
obtained. In fact, the kernel function k would lead to classifiers with maximum
margin in some mapped feature space even if the mapping φ itself is not an-
alytically defined, as long as the kernel satisfies Mercer’s condition (Mercer,
1909).
It should be noted that just correct classification is not the goal of a gen-
eral purpose CBIR system, as the concept of classes does not exist here in the
strict sense. More important is an intelligent ordering of the results as the user
would most likely see only the top few results. This behaviour is already com-
monly seen by text-search engines, where for e.g., some query keywords can
lead to millions of hits. In a two-class SVM, it makes sense to assume that
since the sign of the function f (x) is used as the decision boundary, the images
could be ordered on the basis of their decreasing values of f (x). This simple
procedure, as we see, provides good results. Furthermore, the user provides
feedback not on the most positive images which are shown as intermediate
results, but rather the images for which the magnitude of f (x) is as close to
zero, i.e. the images closest to the SVM boundary, as is suggested in Tong and
Chang (2001).
36 Relevance Feedback in CBIR (Online Methods)
1
min
R∈R,ζ ∈Rl ,c∈F
R2 +
νl ∑ ζi
i
subject to
kφ(xi ) − ck2 ≤ R2 + ζ i , ζ i ≥ 0, i = 1, ...l
The ζ i ’s are slack variables to denote the distance of an instance from the hy-
persphere. They are used to penalize outliers. If ζ i > 0 then the positive train-
ing instance xi is detected as an outlier and lies outside of the hypersphere with
radius R. To set the trade off between the radius of the ball and the number of
training instances it encloses the parameter ν ∈ [0, 1] is used. If ν is chosen
to be small, then the hypersphere is allowed to grow so that more training in-
stances can be put into the ball. If ν is chosen to be large,then the hypersphere
should be kept small while allowing that a fraction of the training instances
lie outside. The primal form of the optimization problem can be transformed
into a dual form using Lagrangian multipliers αi . The corresponding Lagrange
function is:
l l
1
L( R, ζ, c, α) = R2 +
νl ∑ ζ i + ∑ αi (kφ(xi ) − ck2 − R2 − ζ i )
i =1 i =1
with αi ≥ 0
This function has to be minimised. For the minimum the following conditions
have to hold:
l l
∂L
= 0 ⇔ 2R − 2R ∑ αi = 0 ⇔ ∑ αi = 1
∂R i =1 i =1
∂L l
∑il=1 αi φ(xi ) l
= 0 ⇔ ∑ αi (2c − 2φ(xi )) = 0 ⇔ c = ⇔ c = ∑ αi φ ( xi )
∂c i =1 ∑il=1 αi i =1
Now L should be minimised with respect to the ζ i ’s and with subject to ζ i > 0.
So either ∂L/∂ζ i = 0 if such a point exists, or ζ i = 0 and ∂L/∂ζ i > 0.
∂L(ζ, α) 1 1
= − αi ≥ 0 ⇔ αi ≤
∂ζ i νl νl
Now L can be rewritten without the ζ i ’s.
l l
L(x, α) = ∑ αi φ ( xi ) · φ ( xi ) − ∑ αi α j φ ( xi ) · φ ( x j )
i =1 i,j=1
subject to
1
0 ≤ αi ≤
νl
, ∑ αi = 1
i
The optimal α’s can be computed by solving this dual problem with the help
of a QP optimization method. After that the centre of the hypersphere can be
calculated, if the mapping φ(x) is known:
c= ∑ αi φ ( xi )
i
But the mapping φ(x) will be unknown in most cases. The decision function
f (x) = sgn( R2 − kφ(x) − ck2 ) can be computed without the centre using the
38 Relevance Feedback in CBIR (Online Methods)
The support vectors are those instances xi with 0 < αi < 1/(νl ), the xi ’s with
αi = νl1 are the outliers and the xi ’s with αi = 0 are the instances lying truly
inside the ball. The radius R is computed such that all support vectors lie
on the hull of the hypersphere. This is the case, if for all support vectors the
argument of the sgn is zero.
This function returns positive values for points inside this hypersphere and
negative outside (note that although we use the term hypersphere the actual
decision boundary can be varied by choosing different kernel functions). The
results are sorted on the basis of their “positiveness”. Since the actual value of
the function f (x) is not important we can speed up the process by noting that
the first two terms in the decision function are constants. Furthermore the last
term k (x, x) is also constant for many kernels (e.g. Gaussian). Thus, for such
kernels, the images can be ranked simply on the basis of decreasing values of
f 0 ( x ) = ∑i αi k ( x i , x )
In case the distance of an instance to the center of the hypersphere is needed,
it can be calculated in the following way:
s
d(x) = ∑ αi α j k(xi , x j ) − 2 ∑ αi k(xi , x) + k(x, x)
i,j i
3.6 Manifold learning 39
Step 2 Estimate the geodesic distance dG (pi , p j ) by calculating the shortest path
between pi and p j in the graph G .
Step 3 Apply classical MDS (see Section 3.1.2) on the geodesic distance matrix
DG to generate a lower dimensional Euclidean space which best matches
the manifold geometry.
Table 3.1: Shape ground truth used for the experiments, with a total of 429
images in 8 different categories. Sample images are shown along-
side the class name (Colour version of this table is available on Page
124)
# of mem-
Class Sample Images
bers
circle 263
diamond 11
shape-1 67
shape-2 16
shape-3 16
shape-4 24
shape-5 21
triangle 11
3.7 Experiments and Results 41
We perform the experiments on two databases. With the first database (trade-
mark images), ground truth is defined on the basis of a clearly defined rule:
Two images are similar if they have basically the same shape. The advantage
is that the efficiency of feature selection algorithms can be tested with ease.
With the second database (MPEG-7 content set), the ground truth is subjec-
tively defined by an end user. This is done because in most image databases
only subjective ground truth is available. We will first give a short description
of the used databases and the ground truth.
The trademark image database used in this work was first used in Wan Nu-
ral Jawahir (2006). The database consists of 1000 images depicting trademarks
or logos of companies, associations, sport clubs etc. The reason for choosing
this database is that the ground truth is very accurate and multiple meanings
can be associated with the search. For example, given a trademark image as
a query, the user might be looking for similar shapes, similar colours, or both
during his or her search. For the purpose of the experiments, we will constrain
ourselves to the ground truth arising from similar shape only.
42 Relevance Feedback in CBIR (Online Methods)
Features
8γ
2p + 2 N/2
2γ πqψ
N 2 γ∑ ∑ cos
Z pq = R pq f (ρ, θ ) (3.14)
=1 N ψ =1
4γ
The Zernike moments capture the shape information present in the trade-
marks.
Colour Spatial Features Color spatial features describe the colour distribu-
tion with spatial knowledge of pixels in an image. These features can
be extracted by using a simple and quick process. The original image I
is divided into 3 x 3 blocks of the same size. Each block consists of three
color components : Red, Green and Blue. Their mean values are stored as
features. Though these features are very simple, they are quite suitable
for the small sized trademark images. Each trademark image ends up
having a color-spatial feature vector of 27 dimensions.
Ground Truth
The ground truth for shape similarity was extracted manually. A total of 8
shape types were defined which covered a total of 429 images. The categories
and sample images are shown in Table 3.1.
Features
We use invariant image features based on the haar integral which were in-
troduced by Schulz-Mirbach (1995). Fast approximate invariant features were
successfully used for image retrieval by Siggelkow (2002). The invariant fea-
tures are constructed as follows. Let M = {M(i, j)}, 0 ≤ i < N, 0 ≤ j < M
be an image, with M(i, j) representing the gray-value at the pixel coordinate
(i, j). Let G be the transformation group of translations and rotations with ele-
ments g ∈ G acting on the images, such that the transformed image is gM. An
invariant feature must satisfy F ( gM) = F (M), ∀ g ∈ G. Such invariant features
can be constructed by integrating f ( gM) over the transformation group G.
Z
I (M) = 1/| G | f ( gM)dg
G
148 98
134
810 289
521 450 766
98 8 385
220 720
779
Figure 3.7: First round results for an image from class number 2. The first
row contains the query image. The next two rows show the top
results using the shape features, and the next two rows using the
colour features. The query images for the next round of relevance
feedback were automatically selected from this pool
Kernel k (x, y)
Linear x·y
Polynomial (γ(xi · x j ) + coe f 0)d , γ > 0
RBF exp(−γkx − yk2 ), γ > 0
Histogram Intersection ∑in=1 min( xi , yi )
3.8 Results with the Trademark Database 45
User−supplied Images
530.jpg 8.jpg
−−− −−−
Result Images
869.jpg 521.jpg 148.jpg 139.jpg 810.jpg
1.1049 1.0528 1.0268 1.0081 1.0003
Figure 3.8: Second round results using a two class SVM for relevance feed-
back. The +++ in the title indicates that the image was a positive
image, while - indicates that it was a negative image.
46 Relevance Feedback in CBIR (Online Methods)
User−supplied Images
530.jpg 8.jpg
−−− −−−
Result Images
138.jpg 139.jpg 98.jpg 450.jpg 869.jpg
0 0.85596 1.3248 1.6025 1.7168
Figure 3.9: Second round results after feature selection using correlation coef-
ficients. The best 20 features were used to generate the results. (A
colour version of this figure is available on Page 128)
3.8 Results with the Trademark Database 47
522 76
382 943 573 760
Figure 3.10: First round results for an image from class number 4. The first
row contains the query image. The next two rows show the top
results using the shape features, and the next two rows using the
colour features. The query images for the next round of relevance
feedback were automatically selected from this pool
48 Relevance Feedback in CBIR (Online Methods)
User−supplied Images
Result Images
4.jpg 892.jpg 383.jpg 407.jpg 426.jpg
1.0614 1.0519 1.0105 1.0005 1.0003
Figure 3.11: Second round results using a two class SVM for relevance feed-
back.
3.8 Results with the Trademark Database 49
User−supplied Images
Result Images
387.jpg 760.jpg 841.jpg 513.jpg 925.jpg
0 1.7956 2.3549 2.773 2.7847
Figure 3.12: Second round results after feature selection using correlation co-
efficients. The best 20 features were used to generate the results
50 Relevance Feedback in CBIR (Online Methods)
0.12
30
0.1 25
0.08 20
0.06 15
0.04 10
0.02 5
0 0
0 10 20 30 40 50 60 0 10 20 30 40 50 60
(a) (b)
8
0.14
7
0.12
6
0.1
5
0.08
4
0.06
3
0.04
2
0.02 1
0 0
0 10 20 30 40 50 60 0 10 20 30 40 50 60
(c) (d)
Figure 3.13: Feature weights obtained using (a) and (c): a linear SVM and (b)
and (d): correlation coefficients
3.8 Results with the Trademark Database 51
Figure 3.14: Screenshot of our CBIR system after labeling the first set of query
images
52 Relevance Feedback in CBIR (Online Methods)
Comparison of Two Class SVMs ( one Training Round) Improvement of a Two Class SVM with Intersection Kernel after each feedback round
intersection kernel
rbf kernel
1 linear kernel / sigmoid kernel 1
polynomial kernel (degree 5)
0.8 0.8
precision
0.6
precision
0.6
0.4 0.4
(a) Effect of different kernel functions (b) Results after multiple feedback rounds
Comparison of the best 1−Class with the best 2−Class SVM Comparison of the best 1−Class with the best 2−Class SVM after 6 rounds of relevance feedback
1 1
0.8 0.8
0.6 0.6
precision
precision
0.4 0.4
0.2 0.2
Histogramm Intersection Histogramm Intersection
One Class SVM with Intersection Kernel One Class SVM with Intersection Kernel
Two Class SVM with Intersection Kernel Two Class SVM with Intersection Kernel
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
recall recall
(a) After first round (b) After six rounds of Relevance Feedback
In this chapter, we deal with methods that potentially involve intensive com-
putations on the whole database, hence they are unsuitable for use in rele-
vance feedback scenarios. These methods are nevertheless important for con-
tent based image retrieval applications.
The main motivation for offline methods is to learn about the dataset as a
whole, i.e. not with respect to a particular query image. The learning can be in
various different ways. The knowledge gained can be used, e.g. in speeding up
the retrieval time of the system, but also to provide insights about the contents
of the database. If some kind of ground truth is available for the database,
this can be used to optimize the tunable system parameters in advance, for
example by selecting or weighting features in order to get results as close to
the ground truth as possible.
Since we will be using various clustering techniques in this chapter as well
as in Chapter 5, we summarize these shortly in the next section.
57
58 Offline methods in CBIR
Input space partitioning vs. data grouping There are cases in which we wish
to partition the input space into regions for each cluster (either hard- or
fuzzy partitioning). This is needed, for example, when new data becomes
available and must be assigned to one or more of the precalculated clus-
ters. Otherwise, it is sufficient to know the optimum groupings of input
feature vectors.
Flat vs. hierarchichal membership Flat clustering methods output just the fi-
nal clusters C1 , . . . , Ck , where Ci is either the partitioned region for the ith
cluster, or simply a subset of the set of input feature vectors. Many nat-
ural groupings however, are hierarchichal in nature. Consider for exam-
ple, the problem of clustering all of the natural languages in the world,
which are all born out of just a few ancient languages.
k
E= ∑ ∑ ||xi − µl ||2 (4.1)
l =1 { i | xi ∈ Cl }
The algorithm must be seeded with initial values for the k cluster centers.
At every iteration new cluster memberships and cluster centers are computed,
and it is guaranteed that the goodness measure E of Equation 4.1 can only de-
crease with each iteration. However, as with gradient descent algorithms, it
runs the risk of ending up at a local minima, instead of the global one. An-
other drawback is that the number of clusters k must be provided in advance,
although there has been some work done in automatically determining k (e.g.
Ishioka (2000), Ray and Turi (1999)), which mainly work by trying different
number of clusters and choosing the best k based on a pre-defined cluster
goodness criteria. The k-means algorithm is provided in Algorithm 4.1.
∑ x∈Ci , y∈C j d( x, y)
d(Ci , C j ) = (4.3)
|Ci ||C j |
similarity value
0.5
3.6 x3
0.4
x7
3.4 x2
x4 0.3
3.2 x6
0.2
3
0.1
2.8
0
3 3.5 4 x1 x8 x5 x2 x7 x6 x3 x4
one that contains all of the training data. The resulting hierarchichal tree can
be visualized using a dendrogram. One axis of a dendrogram shows the data
points, while the other axis indicates their grouping, as well as the distance
at which the grouping was achieved. Figure 4.1 shows a sample dendrogram
for the 2-D data points shown on the left. The dendrogram can be cut at any
desired similarity level in order to achieve the requisite number of clusters.
for many keywords, this method performs reasonably well, as can be seen in
the screenshot of Figure 4.2. As one can see there, there were over 20 million
hits for the keyword ocean, and even if there are good chances that only a
very small fraction of these actually contain images of oceans (i.e. the system
exhibits an overall low precision), such a system is quite useful because the
initial precision is high. Thus, there is a form of law of scales at work here.
The low precision comes from the fact that the web data is unstructured,
apart from some structure imposed by natural language which is often am-
biguous without contextual knowledge. Except for the knowledge of the statis-
tics with which words or constellations of words appear on a webpage, there
is little known about the semantics which are conveyed by the page. If the
semantic web (Shadbolt et al. (2006), Cardoso and Sheth (2006)) ever becomes
a reality, it should go a long way towards improving this contextual form of
automatic image annotation.
In this work, however, we wish to deal with purely content-based automatic
annotation of images. This problem is much more difficult because of the
large amount of variability in the image content which is possible without any
change in the semantics conveyed by the image. However, it is an exciting and
evolving research field, and a long-term research goal could be to bring the
machine performance closer to that of humans, who have the capacity to learn
atleast a few thousand concepts. The two possibilities of annotating images
(contextual and content-based) should be considered as complimentry and not
as competing with each other, as there would always be some situations where
a collection of images is available without any further information, and thus
only content-based annotation is applicable.
The tags can be broadly divided into two principled types:
Content Description These kinds of tags identify objects, describe scenes, etc.
occurring in the content of the images. Example keywords include people,
car, landscape, etc.
Meta-Data Tags Here we talk about imaging parameters or other available
information which might be difficult to reconstruct later using either au-
tomatic or manual methods. For example, most if not all digital cameras
nowadays add information such as shutter speed, ISO setting, date/time of
image capture, and even the location coordinates if a GPS receiver is built-
in, etc. By their very definition, it is either very difficult or impossible to
assign such tags from the pixel content alone.
Figure 4.2: Image search using keywords extracted using automated analysis
of the webpage containing the image. The screenshot shows the
results for the query keyword ocean using the commercial search
engine GoogleTM .
by the camera during the photo capture. As one can see from the example,
there could always be some tags which one could not expect the machine to
learn, as for example, the name of the city where an indoor image was taken.
However, this would be true even for a human observer unfamiliar with the
context of the image, thus the machine cannot be said to be inferior in such
situations.
We describe briefly some prior work in the field of automatic annotation.
Barnard et al. (2003) presented a scheme to link segmented image regions with
words, the results depending heavily on the quality of the segmentation. Vogel
(2004) assigned semantically meaningful labels to local image regions formed
by dividing the image into a rectangular grid. Li and Wang (2003) proposed
a statistical modeling approach using a 2-D Multiresolution Hidden Markov
Model for each keyword and choosing the keywords with higher likelihood
values. Cusano et al. (2003) use a multi-class SVM for annotation though their
scheme can hardly be judged due to their very small vocabulary consisting of
seven keywords.
In this work we describe our annotation methodology which consists of
a feature extraction, feature weighting, model evaluation and a keyword as-
signment routine (Setia and Burkhardt, 2006). Note that we sometimes use the
4.2 Automatic Annotation of Images 63
terms feature weighting and feature selection interchangeably, as once the sys-
tem has given a weight to each feature, they can always be ranked to select the
ones with the higher weights.
We describe briefly the outline of this section. We first give a description of
the visual features used, then present our feature weighting algorithm. Later
we give a description of our model based on the one-class SVM, and present
the results of the experiments. We conclude with a discussion and an outlook
for possible improvements and future work.
(b)
(a)
Figure 4.3: A tags example from the photo sharing website FlickrTM . a) An
example image, b) Labels supplied by users, and c) Tags automat-
ically added in the JPEG image’s EXIF section.
4.2 Automatic Annotation of Images 65
x 1 , . . . , x l ∈ Rn
x l +1 , . . . , x l + m ∈ Rn
with m l. Furthermore, we represent the i-th feature vector through the
notation
J
p̂( x (k) ) = ∑ π j N ( x (k) | Θ j )
j =1
(k)
∑l p̂( x (k) = xi )
avgk = i=1
l
The higher the average likelihood is, the more similar this feature is be-
tween the positive and the negative classes and therefore less discriminative
Thus, we define the weight for the k-th feature as
wk = 1/avgk
The weights are normalized so that ∑nk=1 wk = 1. This has the effect that
all models deliver optimum performance (tested through crossvalidation) for
about the same model parameters. The features are then weighted with wk
before fed to the model computation routine (each model gets its own sets
of weights). We show now that the weighting scheme is effective and the
weights can in fact even be directly interpreted for our features. To do this,
we plot in Fig. 4.4 the calculated weights for the 48 features for 4 corel cate-
gories: churches, space1, forests and flora. The training data consisted of
40 images each in the positive class and the complete Corel collection of 60,000
images as the negative (Note that it is immaterial here if the positive images
are considered for determining the gaussian mixture distribution or not, as we
4.2 Automatic Annotation of Images 67
have a very large number of samples available from the stochastic process).
The sequence of the 48 features is explained in the figure caption.
For the churches category, the maximum weight went to the edge features
corresponding to the directions 0◦ and 180◦ , i.e., the discriminative vertical
edges present in churches and other buildings (most images in the category
were taken upright). For the space1 category, the most discriminative feature
the system found was the 7th feature, which is the mean of the brightness (V)
component of the image (the images in the category are mostly dark). For the
forests category, texture features get more weight, as does the hue compo-
nent of the colour features. We however did find some categories where the
weights were somewhat counter-intuitive or difficult to interpret manually. An
example is the category flora in part d).
1
min
R∈R,ζ ∈Rl ,c∈F
R2 +
νl ∑ ζi
i
subject to
kφ(xi ) − ck2 ≤ R2 + ζ i , ζ i ≥ 0, i = 1, ..., l
φ(xi ) is the i-th vector transformed to another (possibly higher-dimensional)
space using the mapping φ. c is the center and R the radius of the hyper-
sphere in the transformed space. With the kernel trick (Vapnik, 1995) it is
possible to work in the transformed space without ever calculating the map
φ(xi ) explicitly. This can be achieved by defining a kernel function k (xi , x j ) =
68 Offline methods in CBIR
churches
0.045
0.04
Color
Texture
0.035
Shape
0.03
0.025
Weight
0.02
0.015
0.01
0.005
0
0 5 10 15 20 25 30 35 40 45 50
Dimension
0.08
space Color
0.07 Texture
Shape
0.06
0.05
Weight
0.04
0.03
0.02
0.01
0
0 5 10 15 20 25 30 35 40 45 50
Dimension
0.035
forests
0.03
0.025
0.02
0.015
0.01
0.005
0
0 5 10 15 20 25 30 35 40 45 50
flora
0.03
0.025
0.02
0.015
0.01
0.005
0
0 5 10 15 20 25 30 35 40 45 50
Figure 4.4: Feature weights for four sample Corel categories. Each row plots
the feature weights and a few sample images from the category.
The order of features in each graph is as follows 1) 18 colour fea-
tures: mean, variance and skewness of the hue (H) layer, followed
by that of S, V, HS, SV and VH layers. 2) 12 texture features: En-
tropy of H, V and D first level decomposition, followed by 2nd and
3rd levels 3) 18 edge features: Bins starting from 0◦ degrees in anti-
clockwise direction with each bin having a span of 20◦ . (A colour
version of this figure is available on Page 129)
4.2 Automatic Annotation of Images 69
hφ(xi ), φ(x j )i as the algorithm needs access only to the dot products between
the vectors, and not to the actual vectors themselves.
The tradeoff between the radius of the hypersphere and the number of out-
liers can be controlled by the single parameter ν ∈ (0, 1). Using Lagrange
multipliers, the above can be written in the dual form as:
subject to
1
0 ≤ αi ≤
νl
, ∑ αi = 1
i
The optimal α’s can be computed with the help of QP optimization algo-
rithms. The decision function then is of the form
This function returns positive for points inside this hypersphere and nega-
tive outside (note that although we use the term hypersphere, the actual deci-
sion boundary in the original space can be varied by choosing different kernel
functions. We use a gaussian kernel k (x, y) = exp(−γ kx − yk2 ), with γ and
ν determined emperically through cross-validation). Since we need a rank for
each keyword in order to annotate the image, we leave out the sign function,
so that the results can be sorted on the basis of their “positiveness”. Further-
more, it was found that the results are biased towards keywords whose train-
ing images are very dissimilar to each other, i.e., the models for which R2 term
is high. Compact models are penalised, and therefore we use the following
function instead for model evaluation:
Table 4.1: Sample annotation results from the system. The query images are
taken from the Corel Collection but did not belong to the train-
ing set. It can be observed that while the original category was
sometimes not found, it was due to the fact that the categories of-
ten overlapped, as the top matches do indeed contain very similar
categories, leading to robust annotation results. (A colour version of
this table is available on Page 125)
Original Final
Query Image Cate- Top 8 Matches
Annotation
gory
wildlife_rare,
grass, animal,
architect,
dog, rareanimal,
shells, dogs,
shell,
africa mammals,
mammal,
newzealand,
NewZealand,
197000,
pastoral
pastoral
plants, green,
foliage,
plant, flower,
wl_ocean can_park,
US_garden, flora,
green, foliage,
leaf, flora
texture13,
flower2
tribal, 239000,
thailand,
people, cloth,
189000 189000,
groups, perenial,
guard, face,
life, tribal
indonesia,
work
microimg, design1,
textures, texture,
lizard1 texture1, skins,
texture7,
natural,
microimage
texture9,
food2
used. Each category is manually labelled2 with few descriptive keywords (typ-
ically 3 to 5). Each category consists of 100 colour images of size 384 × 256, out
of which we select 40 images randomly as training images. Normally for im-
age annotation we would be training a model for every annotation keyword,
and would annotate a query image with the keywords whose models evalu-
ate the query image most favourably. For this experiment however, we learn
a model for every Corel category instead of each annotation keyword. Then, for
the best k category matches (we experiment with k = {5, 8, 11, 14}), the cate-
gory keywords are combined and the keywords least likely to have appeared
by chance are taken for annotation, as in Li and Wang (2003). This scheme
favours infrequent words like waterfall and asian over common ones like
landscape and people.
To have an estimate of the discriminative performance of the system, we
perform a classification task with the 600 categories. The system attains an
accuracy of 11.3 % as compared to 11.88 % that of ALIP. However, as also
pointed out in Li and Wang (2003), many of the categories overlap (e.g. Africa
and Kenya) and it is not clear how much can be read from this classification
performance. Furthermore, we found that although the best category match
was incorrect in the sense of the category ground truth, it was often meaningful
with regard to the query image. We provide some annotation examples in
Table 4.1.
For a more controlled test, we take 10 distinct Corel Categories, namely
Africa, beach, buildings, buses, dinosaurs, elephants, flowers, food, horses
and mountains. The confusion matrix for this task is shown in Table 4.2. Over-
all, the system attains classification accuracy of 67.8% as compared to 63.6%
attained in ALIP.
Computation Time: All experiments were performed on an Intel Pentium
IV 2.80 GHz single-CPU machine running Debian Linux. Calculation of im-
age features takes about 1.2 seconds per image. Model computation with 40
training vectors per model takes only about 20 msec per model. A new query
image needs about 4 seconds to be fully annotated (this includes computation
time for feature extraction, evaluation of 600 models, and decision on unlikely
keywords), as compared to 20 minutes for the HMM-based approach in ALIP.
This makes our system faster by a factor of 300 (or 100 taking the clock speed
of the ALIP system into account). The system scales linearly with the number
of models.
2 We thank James Wang for making the category annotation available for comparison
72 Offline methods in CBIR
% Afr. bch. bld. bus. dns. elph. flow. hrs. mnt. food
Africa 66 6 12 0 0 2 2 2 6 4
beach 16 32 28 0 0 8 2 2 6 6
buildings 6 6 76 2 0 0 2 0 6 2
buses 0 0 30 64 0 0 0 6 0 0
dinosaurs 0 0 2 0 94 0 0 0 0 4
elephants 28 0 0 0 0 50 0 8 12 2
flowers 10 0 4 0 0 0 78 0 4 4
horses 6 2 6 0 0 2 2 72 10 0
mountains 4 4 10 0 0 0 6 0 70 6
food 0 2 14 0 2 0 2 0 4 76
• In the case of image retrieval, for giving the user efficient access to the
database. For example, once the user starts with a query image, she can
immediately start exploring images which belong to classes similar to the
3 We define a tree to be node balanced if each node is symmetric, i.e. the left and the right
subtrees at each node are of the same size. A completely balanced binary tree would require
that the number of classes l be a power of 2.
74 Offline methods in CBIR
0.7 0.7
0.6 0.6
similarity value
similarity value
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
x7 x8 x5 x6 x1 x2 x3 x4 x1 x2 x3 x4 x5 x6 x7 x8
class of the query image. This has the potential for greatly enhancing the
user experience and shortening the search time.
In the case of image classification, the binary hierarchy tree can offer several
benefits:
• Faster classification: Widely used classifiers such as the SVM are origi-
nally designed as binary classifiers. Classifying l classes is usually done
using multiple binary classfiers, such as in the one-vs-one approach, in
which l (l + 1)/2 binary classifiers are trained and executed, and voting
4.3 Hierarchy Discovery in Image Databases 75
• Better tuning possibility: The hierarchy tree based method offers more
opportunities for classifier tuning. For example, the leaf class nodes
grouped at a lower level would be in general very similar and might
distinguish themselves using a few particular features only. Feature se-
lection thus could be more effective in such a scenario. Furthermore, the
higher nodes might be better classified using a smaller value of the SVM
cost parameter C as compared to the lower nodes, in order to have better
generalizibility.
Dissimilarity function
Clearly the most critical part of the proposed algorithm is the dissimilarity
function for a pair of classes Ci and Cj . However, this requires answering a
question which is as much philosophical as mathematical:
function should take into account the final goal which is a good visual group-
ing of classes, as well as good classification results using a hierarchical class-
fication on the generated hierarchy tree. Simple functions, based on mean
distance between the feature vectors contained in C p and Cq (Lei and Govin-
daraju, 2005), or the margin obtained by classifying C p and Cq using a SVM
(Chalasani et al., 2007) have been tried before. We propose two different match-
ing schemes, which we name as feature similarity and hyperplane similarity and
will be described below. Both schemes operate using an intermediate class Ck
for determining the dissimilarity. We define D kpq as the distance or dissimilar-
ity between C p and Cq using the intermediate class Ck . The overall distance
D pq can thus be defined over all intermediate classes excluding the classes in
question:
C
D pq = ∑ D kpq (4.4)
k=1,
k6= p,q
Other methods of combining D pq are certainly possible but were not inves-
tigated in this work. Now we describe the methods to calculate the D kpq s, and
the rationale behind them.
Feature Similarity
hω pk , xi + b pk = 0 (4.5)
F
∑ |ω pk | − |ωqk |
f f
D kpq =
(4.7)
f =1
Hyperplane similarity
Here we use the dot products between the SVM hyperplanes as the similarity
measure, i.e.
hω pk , ωqk i
D kpq = 1 − . (4.8)
kω pk kkω pk k
hω pk , ωqk i
D kpq = 1 −
kω pk kkω pk k
h ∑ i α i Φ ( x i ), ∑ j β j Φ ( y j ) i
= 1− 1 1
h ∑ i α i Φ ( x i ), ∑ j α j Φ ( x j ) i 2 h ∑ i β i Φ ( y i ), ∑ j β j Φ ( y j ) i 2
∑i ∑ j αi β j hΦ(xi ), Φ(y j )i
= 1− 1 1
(∑i ∑ j αi α j hΦ(xi ), Φ(x j )i) 2 (∑i ∑ j β i β j hΦ(yi ), Φ(y j )i) 2
∑i ∑ j αi β j K ( xi , y j )
= 1− 1 1
(∑i ∑ j αi α j K (xi , x j )) 2 (∑i ∑ j β i β j K (yi , y j )) 2
which can be computed using kernel evaluations between the support vec-
tors of the two hyperplanes. It should be noted that the bias terms b pk and
bqk do not play a role during the calculation of D kpq , and it is not clear if they
should. After all, the role of the bias term is only to shift the hyperplane in
the direction of the normal vector, thus making the first class more preferable
(for positive bias), or the second class (for negative bias). At this stage, it is
interesting to note the fact that one would expect a bias value of 0 for a neutral
classifier. This statement is in sense of the test points x which are very dissim-
ilar to all support vectors, i.e. K (x, xi ) ≈ 0, ∀i = 1 . . . NSV . A classifier with a
non-zero bias would classify all such points to exactly one of the classes which
may not be the desired behaviour, especially for distance based kernels.
Figure 4.6: Feature weights between all pairs of classes from the MNIST
database. The gray value indicates the feature weight. The first
column and the first row show the classes, in the form of the mean
value of the images of the particular class.
80 Offline methods in CBIR
1800
1600
class distance
1400
1200
1000
800
600
4 9 7 2 3 5 8 6 1 0
class index
Figure 4.7: Binary classification tree generated for the MNIST digit database
from the dendrogram that the class for the digit 5 is similar to the class for the
digit 8, the digit 4 is visually similar to the digit 9, and so on.
Corel Collection
Figure 4.8: Sample images from the Corel categories used for the experiments.
(A colour version of this figure is available on Page 130)
autumn
dogs wildcats
wl_afric
plants
flora
texture1
texture2 peoplei
aviation warplane
car_race women
car_old1
mountain
children ruins
churches
fashion1
roses
flower1
harbors
forests beach
fractals
men
people1
4.3 Hierarchy Discovery in Image Databases 83
22
20
18
16
14
12
10
6
children men dogs women fashion1 fractals texture1 texture2 mountain wildcats wl_afric car_race car_old1 harbors churches
people1 peoplei plants flora flower1 roses autumn forests aviation warplane ruins beach
Figure 4.9: Taxonomy for the 27 selected classes from the Corel collection
Offline methods in CBIR
Chapter 5
5.1 Overview
In Chapter 3 and 4 we have described various algorithms for learning in image
retrieval. These methods were compared with state-of-the-art algorithms for
each specific task at hand. However, a gap between the description of the indi-
vidual components and how they could possibly fit together in a real-life im-
age retrieval or classification system is felt, and this chapter intends to bridge
this gap.
We chose for this purpose the IRMA radiograph database from the Univer-
sity Hospital, Aachen, Germany1 . It consists of 10,000 fully classified radio-
graphs taken randomly from medical routine, out of which 1,000 radiographs
are considered to be part of the test set. The aim is to find out how well cur-
rent techniques can identify certain image parameters such as image modality,
body orientation, body region, and biological system examined based on the
content of the images. There are various reasons for choosing this database.
Firstly, the number of classes (116) is high enough to make it a suitable can-
didate for annotation tasks, while the query-by-example usage paradigm still
remains practical. Secondly, a fairly large number of reference results exist
for the database, partly because it was used in the ImageCLEF Benchmark for
three consecutive years (2005 through 2007, albeit in modified forms). Last but
not the least, it allows us to support the statement that we made earlier in the
thesis about the applicability of the algorithms to local features extracted at
1 Wewould like to thank Dr. TM Lehmann, Dept. of Medical Informatics, RWTH Aachen,
Germany for making the database available for research purposes.
85
86 Radiograph Annotation and Retrieval
Technical code (T) Defines the imaging modality used. This uses a maximum
of four positions, and contains a) The physical source (e.g. 1. x-ray or
2. ultrasound), b) Modality details (e.g. 12. fluoroscopy or 13. angiography),
c) Technique (e.g. 111. digital, 112. analog or 3. sterometry), and d) lastly
the subtechnique (e.g. 1111. tomography or 1114. parallel beam).
Directional code (D) Models the biological system being examined. This uses
three characters indicating a) The common orientation (e.g. 1. coronal or
2. sagittal), b) Detailed orientation (e.g. 11. posteroanterior), and c) lastly
the functional orientation tasks (e.g. 111. inspiration, 112. expiration or
113. valsalva etc.).
Anatomical code (A) Indicates the body region examined. A total of nine ma-
jor regions are defined (e.g. 1. total body, 2. head/skull, 3. spine etc.). The
major region is followed by two hierarchical subcodes, (e.g. 3. spine,
31. cervical spine, 311. dens).
Biological code (B) Defines the biological system examined. This supplements
the anatomical code which is not precise enough. The first position iden-
tifies one of ten possible organ systems (e.g. 1. cerebrospinal system), and
the remaining positions help in identifying exactly the organ in question
(e.g. 11. central nervous system, 111. mesencephalon).
The final code is a character string of not more than thirteen (13) characters:
{ TTTT − DDD − AAA − BBB}. Table 5.1 shows four sample images from the
database along with its code in descriptive notation. There are three different
usage scenarious where image processing and machine learning techniques
can be applied on this database.
5.2 Database and Objectives 87
Table 5.1: Sample images from the IRMA database, together with their class
number (1-116), and code description for the four independent axes
10015.png
Class #112
10020.png
Class #58
Flat Classification In this case, each possible code combination (i.e. { TTTT −
DDD − AAA − BBB}) is considered as a base class. The rationale being
that although the number of code combinations is quite high (∏4i=1 Ni ,
where Ni is the number of leaf nodes in the hierarchy tree of the ith axis),
in practice only a fraction of them are applicable or likely, which can be
assigned using standard multi-class classification techniques. One pos-
itive side-effect is that impossible code-combinations which can occur
during automatic meta-data assignment are ruled out in the flat classi-
fication scheme. However, it is possible that a misclassification in the
flat scheme would lead to multiple tags being wrongly assigned, which
might have been averted by having tags assigned on an axis-by-axis ba-
sis.
All of the above usage scenarios require robust and discriminative features.
We have developed for this purpose the so-called cluster cooccurrence ma-
trices, which use local features extracted around so-called interest points in
an image, and combine them using their spatial information to yield a single
global feature vector per image (Setia et al., 2008). We describe below the algo-
rithm to generate the cluster cooccurrence matrices.
5.3 Feature Extraction Algorithm 89
Interest Points Apply an interest point detector (in our case, the Loupias
Salient-Point Detector). Sort the obtained saliency map, and take the Ns
points with the highest saliency values for further computation.
Clustering Take a random subset of local feature vectors from all training
images. Cluster these feature vectors in Nc clusters according to some
optimization criteria. Save the cluster centers for later use with test im-
ages.
Cluster Co-occurrence Matrix The nearest cluster is calculated for all local
feature vectors of the image. The complete local feature vectors are dis-
carded, and the only retained information is the index of the nearest clus-
ter. Consider all possible salient-point pairs. A cluster co-occurrence ma-
trix of size Nc × Nc is generated sector-wise (i.e. over radial and angle
ranges), yielding a 4-D feature matrix. This 4-D matrix is flattened and
used as the final feature vector for the image for use, for example, with
an independent classifier.
l
s= ∑ | ci | (5.1)
i =1
5.3 Feature Extraction Algorithm 91
Figure 5.1: 1000 Interest points with the highest saliency value for each of the
two images shown on the left. Although some interest points are
found in the non-discriminative parts of the images (for example,
the man-made object embedded in the chest or the text in the top-
right corner), local methods are still very robust to partial match-
ing.
The above scenario is repeated for every wavelet coefficient that exceeds
a certain threshold, τ, in order to avoid computation time by not investigat-
ing small wavelet coefficients. We end up with a matrix (which we shall call
the “saliency map" here) representing the saliencies of the image pixels. The
saliency map is then sorted and a fixed number of salient points (Ns ) per im-
age is taken in this work. An alternative strategy is to fix a threshold, and
select all points having a saliency above this threshold. The pixels very near to
the image boundary (upto 6 pixels in our case) are not considered candidates
for an interest point, as the local features cannot be accurately calculated there
without introducing artifacts. The detected interest points for two sample im-
ages from the used database are shown in Figure 5.1.
n −1
LBP = ∑ s ( v i − v c ) 2i , where (5.2)
i =0
1, x ≥ 0
s (x) = (5.3)
0, x < 0,
where vi and vc are the grayvalues at a neighboring pixel and at the center
pixel, respectively, and n gives the number of the pixels in the circular neigh-
borhood of the center pixel. Since the signed difference (vi − vc ) is considered,
the effect of grayscale shifts is totally eliminated. Invariance against scaling
of the grayscale is achieved by the s operator as the sign of the difference is
mapped to 0 or 1.
It is obvious that the main disadvantage of these features is the discontinu-
ity of the LBP operator (the s function), which makes them sensitive to noise;
a small disturbance in the image may cause a big deviation of the feature. To
overcome this problem, Schael (2004) has introduced an operator which ex-
tends the step function in Equation 5.3 to a ramp function giving values in the
range of [0, 1]:
5.3 Feature Extraction Algorithm 93
1 if x < −e
e− x
rel ( x ) = 2e if −e ≤ x ≤ e , (5.4)
0 if e < x
where e is a threshold parameter. This way, the features are much more ro-
bust against noise, but we also sacrifice the 100% invariancy to monotonic
grayscale transformations (although the features are still robust to these trans-
formations). If e is set to zero, the rel function will reduce to the simple LBP
operator s.
Based on the relational operator defined in Equation 5.4, we define a rela-
tional function <( x, y, r1 , r2 , φ, n) 7→ Rn , calculated on a salient point ( x, y) of
the image I as center. To simplify notation, let the individual output values of
the function be given by
Rk = [<( x, y, r1 , r2 , φ, n)]k , k = 1, . . . , n
Then,
Rk = rel(I( x2 , y2 ) − I( x1 , y1 )),
where
( x1 , y1 ) = ( x + r1 cos(k · 2π/n), y + r1 sin(k · 2π/n)),
and
( x2 , y2 ) = ( x + r2 cos(k · 2π/n + φ), y + r2 sin(k · 2π/n + φ))
The process is illustrated in Figure 5.2. Bilinear interpolation is used for
points not lying exactly on the image grid. Based on different combinations
of r1 , r2 and φ, local information at different scales and orientations can be
captured. In this work, we use 3 sets of parameters, (0, 5, 0), (3, 6, π/2) and
(2, 3, π ), each with n = 12. The 3 subvectors are concatenated to yield a local
feature vector of length 36 at each salient point. It is of interest to note, that
in applications where rotation invariance is desired, a subvector can simply be
summed up to yield a rotation invariant descriptor.
The ensemble of local feature vectors extracted from all training images are
clustered as explained in the next section. To remain computationally feasible,
the process is carried out on 18000 randomly chosen local feature vectors.
5.3.4 Clustering
Clustering can be understood as a grouping of similar objects. For vectorial
data, this process has also been extensively studied in the branch of vector
94 Radiograph Annotation and Retrieval
c1
c2
c3
1
2
4
7
1
3
3
1
2
4
Sector-wise cooccurrence matrix
Figure 5.3: Schematic diagram depicting how the final features are reached.
A cluster index image is formed using the local feature vectors
around the salient points. For each sector of the ring, a cluster
co-occurrence matrix is formed by considering all pairs of salient
points whose orientation is the same as that of the sector with re-
spect to the center of the semi-circle. Taking the red-coloured point
with cluster index 4 as an example, the two other interest points
would be considered in the co-occurrence matrix.
than a uniformly spaced histogram whose bins have been derived more or less
independently of the training data.
There are many possible approaches to clustering, in this work we use one
of the most common algorithm, the k-means clustering algorithm. K-means is an
iterative algorithm which minimizes the sum, over all clusters, of the within-
cluster sums of point-to-cluster-centroid distances (see Section 4.1.1). The k in
k-means stands for the number of desired clusters and is an input to the algo-
rithm. Other decisions to be made include an appropriate distance measure,
which we simply take to be Euclidean, and the choice of initial clusters, which
we take to be randomly chosen local feature vectors.
The number of clusters is denoted in this work by Nc and must be selected
carefully as the size of the final feature vector increases quadratically with Nc .
All-Invariant Accumulator
The simplest histogram that can be built from a cluster index image is a 1-
D accumulator counting the number of times a cluster occurred for a given
image. All spatial information regarding the interest point is lost in the process.
The feature can be mathematically written as:
F ( c1 , c2 , d ) = #{ ( s1 , s2 ) | ( I ( s1 ) = c1 ) ∧ ( I ( s2 ) = c2 )
∧ ( Dd < ||s1 − s2 ||2 < Dd+1 ) }
where the extra dimension d runs from 1 to Nd (number of distance bins).
The crossvalidation results improve to about 68 %, but it should be noted
that this accumulator is rotation invariant (depends only on distance between
salient points), while the images are upright. Incorporating unnecessary in-
variance leads to a loss of discriminative performance. Especially in this task
of radiograph classification it can be seen that mirroring the position of interest
points leads to a completely another class, for example, a left hand becomes a
right hand, and so on. Thus we incorporate orientation information in the next
section.
S(c1 , c2 , d, a) = { ( s1 , s2 ) | I ( s1 ) = c1
∧ I ( s2 ) = c2
∧ Dd < ||s1 − s2 ||2 < Dd+1
∧ A a < ] ( s 1 , s 2 ) < A a +1 }
5.3 Feature Extraction Algorithm 97
10246.png (29) 11290.png (29) 15421.png (29) 14195.png (64) 10962.png (64)
d=0.0 d=120.2 d=131.6 10251.png (64) d=99.9 d=102.8
d=0.0
Figure 5.4: Top 8 nearest neighbour results for different query radiographs.
The top left image in each box is the query image, results follow
from left to right, then top to bottom. The image caption contains
the image name and label in (brackets). Finally the distance ac-
cording to the L1 norm is displayed underneath.
98 Radiograph Annotation and Retrieval
M(c1 , c2 , d, a) = | S(c1 , c2 , d, a) |
measured in pixels (in comparison, the larger dimension of images in the used
database was always 512), and the angle bin boundaries are for example,
measured in radians. It can be seen that only the [0, π ) angle range needs
to be covered, as each salient point pair (s1 , s2 ) would otherwise be counted
twice in the matrix (with cluster bins swapped, and in an angle bin which is at
an angle π radians from the other). A fuzzy accumulator can also be used to
generate the matrix M, but was not investigated in this work.
5.4 Results
As mentioned in the introduction, we will apply these features to the IRMA
database for three different usage scenarios, namely, retrieval using the query-
by-example (QBE) paradigm, classification into one of the given 116 base classes,
and assigning tags, namely, the technical, directional, anatomical and the biologi-
cal code.
10000
9000
8000
7000
6000
5000
Figure 5.5: Hierarchy tree for the 116-category IRMA 2006 database
99
100 Radiograph Annotation and Retrieval
This work
Previous Best
Baseline Results
Table 5.2: Results for the IRMA 05 Database. The comparison results are
taken from the ImageCLEF 2005 Benchmark, and from a recent im-
provement we are aware of.
5.4 Results 101
n
k(x, y) = ∑ min(xi , yi ) (5.5)
i =1
is positive definite. The proof is based on the idea that if each bin of the
histogram is transformed to a binary vector, filled with Nb 10 s where Nb is the
value of the histogram bin, and followed by Nh 00 s where Nh is the sum of all
the bins in the histogram, then the expression min( xi , yi ) can be conveniently
expressed as a scalar product of the corresponding binary vectors, thus satisfy-
ing the Mercer’s condition. Another important issue is that of normalization.
The cooccurrence features described here are not normalized, for the reason
that the total size of the histogram can potentially be an important indicator of
the class. In order to test this hypothesis, we normalize in different ways and
compare their performance. The normalization we propose can be performed
directly in the kernel matrix in case it is already been calculated for the test and
training data. In this way, repetitive computations can be avoided.
Given a kernel matrix consisting of Kij = {K}ij , the following normaliza-
tion methods are applied:
a)
Kij
Kij = (5.6)
Kii + K jj
which is equivalent to:
∑l min( xl , yl )
Kij = (5.7)
∑l xl + ∑l yl
and analogously:
b)
Kij
Kij = (5.8)
min(Kii + K jj )
102 Radiograph Annotation and Retrieval
Table 5.4: Part of the results of the medical annotation task in the ImageCLEF
2007 benchmark (lower score is better)
Method Score
Best Results (Tommasi et al., 2008) 26.84
Flat Classification 31.43
Axis-wise Flat Classification 45.48
Binary Classification Tree 47.94
c)
Kij
Kij = p (5.9)
Kii + K jj
Our group participated in the 2006 and 2007 editions of the ImageCLEF bench-
marks2 , while for the 2005 edition we performed our experiments at a later
date. The 2005 edition used a different version of the IRMA database than the
later years. As can be seen in Table 5.2, results better than previously known
in the literature could be achieved.
For the 2007 edition, an error counting scheme as described in Deselaers
et al. (2008) was used. Table 5.4 shows a sampling of the overall results. The
binary classification tree which was used for some results is shown in figure
5.5. Overall, it can be said that very competitive results could be achieved in
this popular international benchmark.
2 http://imageclef.org
Chapter 6
Conclusion
In this thesis, we have attempted a holistic approach to the task of image re-
trieval. Acknowledging that no single methodology could serve all possible
usage scenarios, we have shown the use of following possibilities.
Query by Example The classic image retrieval scenario. The user provides a
query image and the system provides the best matches as the results.
Due to the extremely limited data that the user provides (just the query
image), many simplifications have to be made. Results can be improved
if these simplifications take into account the prior knowledge available
about the database.
Image Annotation The image database is partially labelled (in the form of
“training images”). The system learns the correlation between the image
features and the keywords attached to the image. This is used to label the
other images present in the database. This can be performed offline, and
thus computational complexity is not as critical an issue as in relevance
feedback. Furthermore, the amount of training data is typically orders of
magnitude higher than that available during relevance feedback. Image
annotation also does not suffer from the zero-page problem, i.e. the prob-
lem of finding a suitable starting image, since the search can be started
by specifying a keyword.
Data Mining Apart from image annotation, we argued that various data min-
103
104 Conclusion
• For specialized image collections, CBIR and relevance feedback has the
potential to grow, as increased computing power and advancement in
feature extraction technology would lead to more satisfactory results.
This can include specialized medical image collections, which could aid
hospital staff in finding cases similar to the one in hand. The online con-
sumer market is one other area which could gain in popularity. Users
can search for products (clothing, jewellery, etc.) based on image search.
An important prerequisite for this, however, is that the images are col-
lected with this goal in mind. If all product images are collected in an
inconsistent manner, then CBIR based product search would not be as
useful.
• The user interface is a very important part of any interactive system, and
content based image retrieval is no exception. In this work we mentioned
several useful methods, like multidimensional scaling, hierarchy con-
struction etc. which could be used to make an intuitive graphical user
interface. This is an area however, where data collected from physiolog-
ical experiments on human beings should be used. Questions such as:
how should the image results be optimally presented? How many differ-
ent kinds of feedback can a user provide on an image? etc. can only be
answered by experimentation on real subjects.
106 Conclusion
List of Figures
3.1 2-D image distribution obtained for the shown images by us-
ing simple visual features together with the MDS algorithm. (A
colour version of this figure is available on Page 127) . . . . . . . . . 24
3.2 Application of the Rocchio’s formula for different parameter val-
ues. As is apparent, in its default form this can lead to a magnifi-
cation of the feature vectors. In the figure, the crosses represent
the positive examples, while the circles represent the negative.
q0 is the initial query point, and q the query point after applying
the Rocchio’s Algorithm. . . . . . . . . . . . . . . . . . . . . . . . 26
107
108 LIST OF FIGURES
3.3 Feature selection using hill climbing and forward search. Each
node shows the selection mask and the goodness criteria for that
particular subset. In the shown example, the hill climbing algo-
rithm does not reach the optimum node with a goodness of 0.53
(shown in green). . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 In this toy example, feature x1 may be considered more impor-
tant than x2 because of the normal vector . . . . . . . . . . . . . 34
3.5 One-Class SVM applied on a 2-D toy data set. (a) Using a lin-
ear kernel. (b) and (c) Using a gaussian kernel with different ν
and γ values. As can be seen, the one class SVMs are quite flex-
ible based on their parameters. (A colour version of this figure is
available on Page 126) . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 A spiral which is a 1-manifold in 2 − D space. The Geodesic
distance between two points is the length of the arc segment
connecting them, whereas the Euclidean distance is the straight-
line distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.7 First round results for an image from class number 2. The first
row contains the query image. The next two rows show the top
results using the shape features, and the next two rows using
the colour features. The query images for the next round of rel-
evance feedback were automatically selected from this pool . . 44
3.8 Second round results using a two class SVM for relevance feed-
back. The +++ in the title indicates that the image was a positive
image, while - indicates that it was a negative image. . . . . . 45
3.9 Second round results after feature selection using correlation co-
efficients. The best 20 features were used to generate the results.
(A colour version of this figure is available on Page 128) . . . . . . . 46
3.10 First round results for an image from class number 4. The first
row contains the query image. The next two rows show the top
results using the shape features, and the next two rows using
the colour features. The query images for the next round of rel-
evance feedback were automatically selected from this pool . . 47
3.11 Second round results using a two class SVM for relevance feed-
back. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.12 Second round results after feature selection using correlation co-
efficients. The best 20 features were used to generate the results 49
3.13 Feature weights obtained using (a) and (c): a linear SVM and (b)
and (d): correlation coefficients . . . . . . . . . . . . . . . . . . . 50
LIST OF FIGURES 109
3.14 Screenshot of our CBIR system after labeling the first set of query
images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.15 Results after the initial query round . . . . . . . . . . . . . . . . 52
3.16 Results after the second query round . . . . . . . . . . . . . . . . 53
3.17 Precision Recall plots with two-class SVM . . . . . . . . . . . . . 54
3.18 Comparison of different retrieval methods . . . . . . . . . . . . 54
5.1 1000 Interest points with the highest saliency value for each of
the two images shown on the left. Although some interest points
are found in the non-discriminative parts of the images (for ex-
ample, the man-made object embedded in the chest or the text
in the top-right corner), local methods are still very robust to
partial matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Calculation of a set of relational features. A feature is formed
by applying the relational function to the gray-value difference
of the pixels lying on specific distance and phase to the salient
point in question (i.e. center of the circles) . . . . . . . . . . . . . 92
5.3 Schematic diagram depicting how the final features are reached.
A cluster index image is formed using the local feature vectors
around the salient points. For each sector of the ring, a clus-
ter co-occurrence matrix is formed by considering all pairs of
salient points whose orientation is the same as that of the sec-
tor with respect to the center of the semi-circle. Taking the red-
coloured point with cluster index 4 as an example, the two other
interest points would be considered in the co-occurrence matrix. 94
5.4 Top 8 nearest neighbour results for different query radiographs.
The top left image in each box is the query image, results fol-
low from left to right, then top to bottom. The image caption
contains the image name and label in (brackets). Finally the dis-
tance according to the L1 norm is displayed underneath. . . . . 97
5.5 Hierarchy tree for the 116-category IRMA 2006 database . . . . 99
A.1 Figure 2.3 (Page 11): An example of the semantic gap problem.
The two images possess very similar colour and texture charac-
teristics, but differ vastly as far as the semantics are concerned. 126
A.2 Figure 2.7 (Page 16): Ranking induced by similarity functions.
The left image shows the spherical ranking due to Euclidean
distance, and the right image shows that due to the Manhattan
distance. The point (3, 3) is used as the query point. . . . . . . . 126
A.3 Figure 3.5 (Page 38): One-Class SVM applied on a 2-D toy data
set. (a) Using a linear kernel. (b) and (c) Using a gaussian kernel
with different ν and γ values. As can be seen, the one class
SVMs are quite flexible based on their parameters. . . . . . . . 126
A.4 Figure 3.1 (Page 24): 2-D image distribution obtained for the
shown images by using simple visual features together with the
MDS algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
LIST OF FIGURES 111
A.5 Figure 3.9 (Page 46): Second round results after feature selecting
using correlation coefficients. The best 20 features were used to
generate the results . . . . . . . . . . . . . . . . . . . . . . . . . . 128
A.6 Figure 4.4 (Page 68): Feature weights for four sample Corel cat-
egories. Each row plots the feature weights and a few sample
images from the category. The order of features in each graph is
as follows 1) 18 colour features: mean, variance and skewness
of the hue (H) layer, followed by that of S, V, HS, SV and VH
layers. 2) 12 texture features: Entropy of H, V and D first level
decomposition, followed by 2nd and 3rd levels 3) 18 edge fea-
tures: Bins starting from 0◦ degrees in anti-clockwise direction
with each bin having a span of 20◦ . . . . . . . . . . . . . . . . . . 129
A.7 Figure 4.8 (Page 82): Sample images from the Corel categories
used for the experiments . . . . . . . . . . . . . . . . . . . . . . . 130
112 LIST OF FIGURES
List of Tables
3.1 Shape ground truth used for the experiments, with a total of
429 images in 8 different categories. Sample images are shown
alongside the class name (Colour version of this table is available on
Page 124) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 SVM kernels for Image Retrieval . . . . . . . . . . . . . . . . . . 44
4.1 Sample annotation results from the system. The query images
are taken from the Corel Collection but did not belong to the
training set. It can be observed that while the original category
was sometimes not found, it was due to the fact that the cate-
gories often overlapped, as the top matches do indeed contain
very similar categories, leading to robust annotation results. (A
colour version of this table is available on Page 125) . . . . . . . . . . 70
4.2 Confusion matrix for the 10-category classification task . . . . . 72
5.1 Sample images from the IRMA database, together with their
class number (1-116), and code description for the four inde-
pendent axes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2 Results for the IRMA 05 Database. The comparison results are
taken from the ImageCLEF 2005 Benchmark, and from a recent
improvement we are aware of. . . . . . . . . . . . . . . . . . . . 100
5.3 Effect of kernel normalization on system performance . . . . . . 102
5.4 Part of the results of the medical annotation task in the Image-
CLEF 2007 benchmark (lower score is better) . . . . . . . . . . . 102
A.1 Table 3.1 (Page 40): Shape ground truth used for the experi-
ments, with a total of 429 images in 8 different categories. Sam-
ple images are shown alongside the class name . . . . . . . . . . 124
113
114 LIST OF TABLES
A.2 Table 4.1 (Page 70): Sample annotation results from the system.
The query images are taken from the Corel Collection but did
not belong to the training set. It can be observed that while the
original category was sometimes not found, it was due to the
fact that the categories often overlapped, as the top matches do
indeed contain very similar categories, leading to robust anno-
tation results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Bibliography
115
116 BIBLIOGRAPHY
T. Cox and M. Cox. Multidimensional scaling. Chapman and Hall, London, 1994.
ISBN:978-1584880943. 23
R. Kohavi and G. H. John. Wrappers for feature subset selection. Artif. Intell.,
97(1-2):273–324, 1997. ISSN:0004-3702. doi:10.1016/S0004-3702(97)00043-X.
65
N. Shadbolt, T. Berners-Lee, and W. Hall. The semantic web revisited. IEEE In-
telligent Systems, 21(3):96–101, 2006. ISSN:1541-1672. doi:10.1109/MIS.2006.
62. 61
S. Tong and E. Chang. Support vector machine active learning for image
retrieval. In MULTIMEDIA ’01: Proceedings of the ninth ACM international
conference on Multimedia, pages 107–118, New York, NY, USA, 2001. ACM.
ISBN:1-58113-394-4. doi:10.1145/500141.500159. 35
Colour Images
Selected colour images are reproduced in the following pages in order to en-
able grayscale printing of the remaining work.
123
124 Colour Images
Table A.1: Table 3.1 (Page 40): Shape ground truth used for the experiments,
with a total of 429 images in 8 different categories. Sample images
are shown alongside the class name
# of mem-
Class Sample Images
bers
circle 263
diamond 11
shape-1 67
shape-2 16
shape-3 16
shape-4 24
shape-5 21
triangle 11
125
Table A.2: Table 4.1 (Page 70): Sample annotation results from the system.
The query images are taken from the Corel Collection but did not
belong to the training set. It can be observed that while the original
category was sometimes not found, it was due to the fact that the
categories often overlapped, as the top matches do indeed contain
very similar categories, leading to robust annotation results.
Original Final
Query Image Cate- Top 8 Matches
Annotation
gory
wildlife_rare,
grass, animal,
architect,
dog, rareanimal,
shells, dogs,
shell,
africa mammals,
mammal,
newzealand,
NewZealand,
197000,
pastoral
pastoral
plants, green,
foliage,
plant, flower,
wl_ocean can_park,
US_garden, flora,
green, foliage,
leaf, flora
texture13,
flower2
tribal, 239000,
thailand,
people, cloth,
189000 189000,
groups, perenial,
guard, face,
life, tribal
indonesia,
work
microimg, design1,
textures, texture,
lizard1 texture1, skins,
texture7,
natural,
microimage
texture9,
food2
Figure A.1: Figure 2.3 (Page 11): An example of the semantic gap problem.
The two images possess very similar colour and texture charac-
teristics, but differ vastly as far as the semantics are concerned.
8 7 8 10
7 7 9
6
6 6 8
5 5 5 7
4 4 6
4
3 3 5
3
2 2 4
1 2 1 3
0 0 2
1
−1 −1 1
−2 −2 0
−2 0 2 4 6 8 −2 0 2 4 6 8
Figure A.2: Figure 2.7 (Page 16): Ranking induced by similarity functions.
The left image shows the spherical ranking due to Euclidean dis-
tance, and the right image shows that due to the Manhattan dis-
tance. The point (3, 3) is used as the query point.
Figure A.4: Figure 3.1 (Page 24): 2-D image distribution obtained for the
shown images by using simple visual features together with the
MDS algorithm
128 Colour Images
User−supplied Images
530.jpg 8.jpg
−−− −−−
Result Images
138.jpg 139.jpg 98.jpg 450.jpg 869.jpg
0 0.85596 1.3248 1.6025 1.7168
Figure A.5: Figure 3.9 (Page 46): Second round results after feature selecting
using correlation coefficients. The best 20 features were used to
generate the results
129
churches
0.045
0.04
Color
Texture
0.035
Shape
0.03
0.025
Weight
0.02
0.015
0.01
0.005
0
0 5 10 15 20 25 30 35 40 45 50
Dimension
0.08
space Color
0.07 Texture
Shape
0.06
0.05
Weight
0.04
0.03
0.02
0.01
0
0 5 10 15 20 25 30 35 40 45 50
Dimension
0.035
forests
0.03
0.025
0.02
0.015
0.01
0.005
0
0 5 10 15 20 25 30 35 40 45 50
flora
0.03
0.025
0.02
0.015
0.01
0.005
0
0 5 10 15 20 25 30 35 40 45 50
Figure A.6: Figure 4.4 (Page 68): Feature weights for four sample Corel cat-
egories. Each row plots the feature weights and a few sample
images from the category. The order of features in each graph is
as follows 1) 18 colour features: mean, variance and skewness of
the hue (H) layer, followed by that of S, V, HS, SV and VH layers.
2) 12 texture features: Entropy of H, V and D first level decom-
position, followed by 2nd and 3rd levels 3) 18 edge features: Bins
starting from 0◦ degrees in anti-clockwise direction with each bin
having a span of 20◦ .
130 Colour Images
Figure A.7: Figure 4.8 (Page 82): Sample images from the Corel categories
used for the experiments
autumn
dogs wildcats
wl_afric
plants
flora
texture1
texture2 peoplei
aviation warplane
car_race women
car_old1
mountain
children ruins
churches
fashion1
roses
flower1
harbors
forests beach
fractals
men
people1