Thesis PDF

Machine Learning Strategies
for
Content Based Image Retrieval
Dissertation
zur Erlangung des Doktorgrades

der Fakultät für Angewandte Wissenschaften
an der Albert-Ludwigs-Universität
Freiburg im Breisgau
von
Lokesh Setia, M.Sc.
2008
Dekan: Prof. Dr. Bernard Nebel
Prüfungskommission: Prof. Dr. Bernd Becker (Vorsitz)

Prof. Dr. Andreas Podelski (Beisitz)
Prof. Dr.-Ing. Hans Burkhardt (Gutachter)
Prof. Dr. Dr. Lars Schmidt-Thieme (Gutachter)
Datum der Disputation: 19. Juni 2008

Foreword
This thesis stems from my work at the chair of pattern recognition and image
processing (LMB) at the university of Freiburg, Germany. This work would
not have been possible without the bountiful love and grace of Lord Ganesha.
Firstly, I would like to thank Prof. Hans Burkhardt for supervising my work
and for providing an environment very conducive for research, and to Prof.
Lars Schmidt-Thieme for agreeing to be the co-examiner of the thesis. I would
like to thank my collegues at LMB for many fruitful discussions and ideas. Es-
pecially, I would like to thank Dr. Alaa Halawani and Alexandra Teynor, with
whom I had the opportunity to work directly. Many thanks to Stefan Teister
and his team for providing competent IT administration behind the scenes. I
am grateful to our group secretary Ms. Findlay for her help and encourage-
ment at many occasions.
Living in a foreign country can be difficult, especially for an extended pe-

riod of time. I consider myself most fortunate to have met a few friends
who have stood with me whenever I needed them. I wish to acknowledge
Herr Benno Schuett, Herr Andreas Hauser, Herr Andreas Schlegel, Prof. Ulrich
Kluge and Prof. Daniel Fischer. They have all been special to me in their own
distinctive way, and I could do no more justice by repeating it here in a few
words. I would also like to thank my Indian friends Amit Punde and Sandeep
Gupta for all the good times.
Last but not the least, I would like to thank my parents for always believing
in me and always providing support.
Lokesh Setia
New Delhi
November 2008
Te Deum
Not because of victories

I sing,
having none,
but for the common sunshine,
the breeze,
the largess of the spring.
Not for victory

but for the day’s work done
as well as I was able;
not for a seat upon the dais
but at the common table.
- Charles Reznikoff
Zusammenfassung
Die stets ansteigende Zahl digitaler Informationen hat die Notwendigkeit ef-
fizienter Informationsuchmachinen geschaffen. Informationen die man nicht
leicht wiederfinden kann, könnten auch genau so gut verloren sein, so sagt
man. Weil Informationen in den unterschiedlichsten Formaten und Typen
vorkommen können, müssen auch die Suchmechanismen entsprechend un-
terschiedlich sein. In der vorliegenden Arbeit beschäftigen wir uns mit der
inhaltsbasierten Bildsuche, die dem Benutzer die Interaktion mit einer Bild-
datenbank mittels automatischer Analyse des Bildinhaltes erleichtert.
Wir kennen die Mehrdeutigkeiten, die sogar in sehr einfachen Sätzen natür-
licher Sprachen existieren. Ein Satz wie “Flying planes can be dangerous” kann
unterschiedliche Bedeutungen haben (,,Es kann gefährlich sein Flugzeuge zu
fliegen” vs ,,fliegende Flugzeuge können gefährlich sein.”). Das gleiche gilt
auch für Bilder. Tatsächlich ist das alte Sprichwort:,,Ein Bild sagt mehr als
tausend Worte” hier treffender als irgendwo sonst. Dieser Arbeit liegt die An-
sicht zugrunde, dass diese Mehrdeutigkeiten normal sind, und dass das Sys-
tem nicht im Vorhinein eine mögliche Bedeutung einer anderen bevorzugen
sollte. Die frühesten Image Retrieval Systeme erlaubten dem Benutzer seine
Sichtweise zu spezifizieren, indem sie ihm Zugang zu internen Systempara-
metern gewährten. Diese zu spezifizieren ist wenn nicht zu kompliziert so
doch zumindest sehr mühsam. Ein modernes System versucht die Sichtweise
des Benutzers zu lernen und somit dem Benutzer weniger Verantwortung zu
überlassen. Dies kann mittels Relevance Feedback realisiert werden, wobei
der Benutzer dem System nach und nach mehr Informationen gibt, in der Hoff-
nung auf bessere Ergebnisse. Man unterscheidet zwischen Kurzzeit-Relevance
Feedback, bei welchem die gesammelten Daten nach jeder Session verwor-
fen werden, und Langzeit- Relevance Feedback, bei welchem die Daten über
mehrere Sessions eines Benutzers oder sogar für mehrere Benutzer gesam-
melt werden. In dieser Arbeit beschäftigen wir uns jedoch ausschließlich mit
Kurzzeit-Relevance Feedback, da sich unserer Meinung nach die in digitalen
Bildern vorhandenen Mehrdeutigkeiten nicht anders modellieren lassen.
Im nächsten Teil der Arbeit wird die Bildsuche als Erweiterung traditioneller
v
vi
textbasierter Suchsysteme eingehend analysiert. Hierzu wird die Möglichkeit

der Bildannotation mittels Schlüsselwörtern betrachtet. Ein Vorteil dieser Meth-
ode liegt darin, dass der Benutzer kein passendes Startbild für die Anfrage zur
Hand haben muss. Ein ebenso wichtiger Vorteil besteht darin, dass der An-
notationsschritt anders als Relevance Feedback offline vorgenommen werden
kann. Darüber hinaus zeigen wir, dass weitergehende Data Mining Opera-
tionen auf Bilddatenbanken ausgeführt werden können, die dazu beitragen
die Effektivität der Bildsuchmaschine zu steigern. Schließlich demonstrieren
wir die Anwendung verschiedener Algorithmen auf einer realen medizinis-
chen Datenbank, auf der sehr wettbewerbsfähige Ergebnisse erzielt werden
konnten.
Abstract
The ever increasing amount of digital information has created a need for ef-
fective information retrieval systems. As it is said, information which cannot
be found easily is as good as lost. As information comes in various formats
and types, their retrieval mechanisms also need to differ correspondingly. In
this work, we deal with the task of content based image retrieval, in which
the system facilitates the interaction between a user and an image database by
automatic analysis of the image content.
We know about the ambiguities that can exist even in the simplest of phrases
in a natural language. As an example, a sentence such as “Flying planes can be
dangerous” can either mean “Flying planes are dangerous” or “Flying planes is
dangerous”. Images are no different. In fact, the old saying “A picture is worth
a thousand words” is as true here as it is anywhere. In this work, we take the
view that these ambiguities are natural, and thus the system should not take
the one or the other viewpoint from the very beginning. The earliest systems
for image retrieval allowed the user to specify his or her viewpoint by giving
access to its internal system parameters, which can be complicated or tiring for
the user. A modern system, on the other hand, seeks to learn this viewpoint by
imposing less responsibilities on the user. This can be achieved using relevance
feedback, in which the user progressively gives the system more and more in-
formation, in return for better results. Relevance feedback can be short-term, in
which the data collected is discarded as soon as the session is over, and long-
term, in which the data can be collected over multiple sessions of one user, or
even over multiple users. In this work, however, we will constrain ourselves to
short-term relevance feedback, as in our view the ambiguities or the multiple
interpretations present in an image cannot be handled otherwise.
The later part of the thesis delves into image search as an offshoot of the
traditional text-based search engines. To this end, we explore the possibility
of annotating an image database using keywords. One advantage of this ap-
proach is that the user does not need to provide a suitable starting image for
the query. An equally important advantage is that the annotation process can
be carried out offline for the whole database, unlike relevance feedback which
vii
viii
must be carried out in real time. Apart from annotation, we show that fur-
ther data mining operations can also be carried out on image databases, which
can contribute to improving the effectiveness of the image search engine. We
conclude with the demonstration of various algorithms on a real life medical
image database in which very competitive results could be achieved in recent
international benchmarks.
Contents
1 Introduction 1
1.1 Structure of the Document . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . 3
2 Fundamentals of Image Retrieval 5

2.1 Components of a CBIR System . . . . . . . . . . . . . . . . . . . 6
2.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Semantic Gap . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Performance Measurement . . . . . . . . . . . . . . . . . . . . . 13
2.4 Initial Scenario: Single Query Image . . . . . . . . . . . . . . . . 14
2.4.1 Similarity Measures . . . . . . . . . . . . . . . . . . . . . 15
3 Relevance Feedback in CBIR (Online Methods) 19

3.1 A Mathematical Model of Relevance Feedback . . . . . . . . . . 19
3.1.1 User Model . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.2 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Relevance Feedback Algorithms . . . . . . . . . . . . . . . . . . 23
3.2.1 Rocchio’s Algorithm . . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Relevance Feedback vs. Retraining . . . . . . . . . . . . . 26
3.2.3 Categorisation of Relevance Feedback Algorithms . . . . 27
3.3 Feature Weighting and Selection Methods . . . . . . . . . . . . . 28
3.3.1 Feature Normalisation . . . . . . . . . . . . . . . . . . . . 28
3.3.2 Feature Weighting . . . . . . . . . . . . . . . . . . . . . . 30
3.3.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Relevance Feedback using a Two-Class SVM . . . . . . . . . . . 35
3.5 Relevance Feedback using a One-Class SVM . . . . . . . . . . . 36
3.6 Manifold learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
ix
x CONTENTS
3.7 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . 41

3.7.1 Database: Trademark image database . . . . . . . . . . . 41
3.7.2 Database: MPEG-7 Content Set . . . . . . . . . . . . . . . 42
3.8 Results with the Trademark Database . . . . . . . . . . . . . . . 43
3.9 Results with the MPEG-7 Content Set . . . . . . . . . . . . . . . 55
3.9.1 Good kernel functions for invariant feature histograms . 55
3.9.2 Comparison of two-class and one-class SVM . . . . . . . 55
3.9.3 Improvements over multiple feedback rounds . . . . . . 56
4 Offline methods in CBIR 57

4.1 Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.1 k-means Clustering . . . . . . . . . . . . . . . . . . . . . . 58
4.1.2 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . 58
4.2 Automatic Annotation of Images . . . . . . . . . . . . . . . . . . 60
4.2.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 63
4.2.2 Feature Weighting . . . . . . . . . . . . . . . . . . . . . . 65
4.2.3 Model Computation . . . . . . . . . . . . . . . . . . . . . 67
4.2.4 Experiments and Discussion . . . . . . . . . . . . . . . . 69
4.2.5 Conclusion and Future Outlook . . . . . . . . . . . . . . 72
4.3 Hierarchy Discovery in Image Databases . . . . . . . . . . . . . 73
4.3.1 Algorithm for hierarchy tree construction . . . . . . . . . 75
4.3.2 Experiments and Results . . . . . . . . . . . . . . . . . . . 78
4.3.3 Computation Time . . . . . . . . . . . . . . . . . . . . . . 83
5 Radiograph Annotation and Retrieval 85

5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Database and Objectives . . . . . . . . . . . . . . . . . . . . . . . 86
5.3 Feature Extraction Algorithm . . . . . . . . . . . . . . . . . . . . 89
5.3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3.2 Interest Point Detection . . . . . . . . . . . . . . . . . . . 90
5.3.3 Relational Features . . . . . . . . . . . . . . . . . . . . . . 91
5.3.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3.5 Building the Final Feature Vector . . . . . . . . . . . . . . 95
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4.1 Query-by-Example Results . . . . . . . . . . . . . . . . . 98
CONTENTS xi
5.4.2 Classification Results . . . . . . . . . . . . . . . . . . . . . 101
6 Conclusion 103
List of Figures 107
List of Tables 113
References 115
A Colour Images 123

xii CONTENTS
Chapter 1
Introduction
We live in the digital age, burdened by information overdose. According to a

celebrated prediction made by Gordon Moore in 1965 (Moore, 1965), the den-
sity of transistors on an integrated circuit, which translates roughly to com-
puting power, were to double every two years. A prediction, which has held
its ground until now1 . Similar exponential growth rates are also seen in other
computing technologies, e.g. in data storage capacities. Mankind has found
quite a few uses of these technologies, many of which were not possible even
a decade ago. Apart from textual information, we now have an increasing
percentage of information in other mediums, for example in the form of au-
dio, video and images, not to mention raw data which is created as a result of
various industrial and scientific applications.
Naturally, such a massive amount of data would be close to being useless
unless there are efficient ways to access it. It can thus be said that we need
efficient information retrieval systems. In this context, we would like to differ-
entiate between a document retrieval system and a data mining system. Both are
information processing tools. While with document retrieval we understand a
system which returns whole documents or parts of them which satisfy a partic-
ular query criterion, data mining can be understood as a system which extracts
information in a fundamentally new form in order to analyse the data from a
particular point of view (see Figure 1.1). Most applications with large amounts
of data would probably need both.
1 With a purely theoretical interest, we note the analysis in Krauss and Starkman (2004),
according to which this growth rate must stop latest in 600 years, due to the fundamental
limits of the universe.
1
2 Introduction
Figure 1.1: Information Retrieval Scenarios
Scope of this thesis
In this work we will deal with both document retrieval and data mining scenar-
ios. As data type however, we will limit ourselves to 2-D digital images. More
specifically, we would consider vectorial features extracted from 2-D images,
achieved through a feature extraction process R M× N → Rd . We can state that
the algorithms presented in this thesis are directly usable with other media types, at
least as long as vectorial features can be extracted from them. This includes 3-D im-
ages, speech or music signals, and video. We limit our experiments to images
however, since the temporal dimension inherent in music and video signals
makes the user labelling process much slower, and thus interactive feedback is
much less effective compared to that in an image search system. The interested
reader may refer to Rho et al. (2007) and Wang et al. (2001) for retrieval issues
in other media types.
1.1 Structure of the Document

This thesis is structured as follows. We begin by describing the standard com-
ponents of an image retrieval system, including common performance mea-
sures for evaluation purposes (Chapter 2). We then argue the need for rele-
vance feedback in an interactive retrieval setting. In Chapter 3 (Online learn-
ing) we define relevance feedback formally, and show the properties which
distinguish relevance feedback methods from a simple retraining with a larger
training set. Since interactive feedback has only limited potention due to the
computational time constraints imposed upon it, we propose in Chapter 4 two
approaches, both of which can be applied offline and are intended to aid the
user during retrieval. This includes annotation of images with preselected key-
words, and hierarchy construction of image categories. In Chapter 5, we use
the widely used IRMA Radiograph Database to demonstrate all phases of an
image search system. This is a typical real-life application where all proposed
methods, namely image retrieval using a query image, image annotation and
1.2 Contributions of this Thesis 3
hierarchy generation are useful in guiding the user to the intended images.
We conclude by capturing the main points proposed in the thesis, and giving
pointers to possible future research work in the area.
1.2 Contributions of this Thesis

The major contributions of this thesis can be summarised as follows:
• Theoretical Goal of Relevance Feedback We define the theoretical ex-

pectations one would have from a perfect relevance feedback system, and
show that all practical relevance feedback algorithms are either approxi-
mations or simplifications to such a system. Even in the midst of a multi-
tude of relevance feedback works existing in the literature, such ground
work seems to be missing.
• Relevance Feedback Algorithms We propose new relevance feedback al-

gorithms for use in image retrieval, and compare the results with other al-
gorithms or heuristics existing in the literature. The experiments are per-
formed in a controlled environment with predefined features and ground
truth. We show the improvement which relevance feedback makes pos-
sible and analyse the results.
• Automatic Image Annotation Although relevance feedback is a power-

ful tool, it has its limitations due to a small sample size in general and
the limit on computation time demanded in an interactive setting. Thus,
we propose enhancements which can be applied offline in an image re-
trieval system. An algorithm to automatically label images with prese-
lected keywords is presented, which allows the user to query in a tradi-
tional text based fashion. Encouraging results are obtained with the well
used Corel image collection.
• Visual Taxonomy Generation An algorithm to automatically generate a

visual taxonomy tree for an annotated database is presented. Such a tax-
onomy tree can greatly enhance the user experience in an image retrieval
system by grouping similar keywords and allowing the user to traverse
to them easily. Furthermore, we show that the algorithm can be used
to generate binary classification trees for multi-class classification prob-
lems. Our experiments show that much faster classification results are
possible with little or no impact on performance, when compared to the
standard one-vs-one or one-vs-rest approaches.
4 Introduction
Chapter 2
Fundamentals of Image Retrieval
Let there exist a database D consisting of digital images. A user intends to

access the database, but linear browsing is not practical since the databases are
usually large. So how should this be performed? This is the core of the image
retrieval problem, and is depicted symbolically in Figure 2.1. Since there is no
unique best solution to this problem, which holds for each and every database
and for different retrieval scenarios, we would look at various possibilities and
compare their merits in this thesis.
The initial point of entry in a search engine is a query, which tells the system
about what the user is looking for. Textual search engines, for example, can
accept whole words or phrases as input query. A query in a text-based search
engine may be seen as being a tiny document in itself. Indeed, most web search
engines now include a find similar web pages extension, which can be viewed as
a search request with a complete web page as the starting query.
The most natural extension of a textual query model for use in image re-
trieval would be a query-by-image-region model. The image region should con-
tain the object or objects which the user is looking for in an image. However,
there are many practical differences compared to the corresponding textual
query model. Unlike query text, which can be inputted from memory, a query
image region cannot usually be constructed on the fly (an exception may exist
for sketch retrieval, where the rough outline of the desired sketch may suffice).
Another difference is that while words are trivially extracted from text docu-
ments (e.g. using whitespace information), image segmentation into meaning-
ful regions is a very challenging problem whose solution still eludes us, save
for some specific problem settings (Cremers et al. (2006), Falcão et al. (2006)).
Due to this, the most popular query mechanism in image retrieval literature
has been the so-called query-by-example model, in which whole images are com-
pared. Its main advantage is ease of validation; the quality of the results can be
5
6 Fundamentals of Image Retrieval
Figure 2.1: An image retrieval use case at its most abstract level
verified visually by viewing them alongside the query image. In a real system,
however, finding an appropriate initial query image can be difficult. Indeed
what, some may ask, is the use of an image retrieval system if the user already
possesses a similar looking image? While this criticism does hold for many
image collections, there are scenarios where this is the preferred query model.
An example could be in clinical use, where a doctor would like to retrieve the
case history of patients with a similar radiograph image compared to that of
the current patient.
2.1 Components of a CBIR System

Figure 2.2 shows the building blocks of a typical content-based image retrieval
system, and how they relate to each other. First of all, in an offline opera-
tion, meaningful features are extracted from all images in the database. This
is a critical step and will be discussed in more detail in the next section. For
computational efficiency, the features can be indexed, for example in a R-tree
(Guttman, 1984) or in a K-D-B tree (Robinson, 1981).
The user starts the retrieval process by providing a query image. Other sce-
narios will be considered later in the thesis. The features for the query image
are extracted exactly in the same way as for the database images. The features
of the query image are then compared with the features of the database images
using some kind of a similarity measure. The database images which possess the
highest similarity measure value are returned as the result images.
With the help of relevance feedback, the user has the option to refine the re-
sults. The user is just asked to give feedback on which result images are rel-
evant to the user’s query and which are not. The system is then expected to
learn the distribution of the target images, i.e. the images the user is looking
for. The reader can refer to Halawani et al. (2006) which contains a survey of
the existing CBIR literature.
2.2 Feature Extraction 7
Query
Image(s) Feature Vector(s)
Feature
User Extraction Relevance
Feedback
Similarity
Image Collection Feature Database Comparison
Feature
Extraction
... Indexing &
Retrieval
...
Result
Figure 2.2: Block diagram of a content-based image retrieval system
2.2 Feature Extraction

Digital images possess a very high dimensionality. For example, an image of
size 1024 × 1024 pixels has more than a million dimensions when interpreted
as a vector. However, there is a huge amount of redundancy present in the
images. Furthermore, it is often desirable to remove the information content
which is not essential or even counter-productive for the particular retrieval
task at hand. This step of extracting meaningful information is termed as fea-
ture extraction. The most critical properties to be analysed for any kind of fea-
tures are its information content and invariance to certain transformations of the
data. These properties would be discussed next:
a) Invariance
Given a transformation group G with an element g acting on the data S, a

feature F (S) is said to be invariant with respect to G (Burkhardt, 1979), iff
S0 = gS ⇒ F (S0 ) = F (S), ∀ g ∈ G (2.1)
This is the so-called necessary condition for invariance. For digital images,
the transformation groups required in an application are typically one or more
of the following:
- the group of translations
- the group of rotations
- the group of Euclidean motion
- the group of similarity transformations
- the group of affine transformations
- the group of monotonic intensity transformation
In the above list, all except the monotonic intensity transformation are what
are known as geometric transformations. Only the location of the points is
transformed, while the value remains unchanged. For example, given an ini-
tial source point ( p x , py ) T , we have the destination point coordinates
p0x

px tx
= + (2.2)
p0y py ty
for the case of translation, and
p0x

α11 α12 px tx α11 α12
6= 0
= + , (2.3)
p0y

α21 α22 py ty α21 α22
for the general case of an affine transformation.

The number of independent tunable parameters in a transformation is known
as the degrees of freedom, for example, two and six for the case of translation and
affine transformations, respectively. In general, the variability within a class is
exponentially proportional to the degrees of freedom in the transformation the
class is allowed to have.
There are in principle three ways one can obtain invariant features (Burkhardt
and Siggelkow, 2001):
Normalization In this method one tries to find distinctive elements of a class

and normalize other elements with respect to it. For example, an image
of an object could be translated to have its center of gravity at a particular
point, or rotated to have its major axis aligned with the horizontal axis of
an image.
Differentiation The elements gS form an orbit in the feature space and can be
controlled using a parameter vector λ (whose dimensionality f is equal
to the degrees of freedom G ). An invariant feature should take a con-
stant value along the orbit, thus one tries to find features satisfying the
differential equation
δgλ S
= 0, ∀i = 1 . . . f (2.4)
δλi
Integration As early as in 1897, Hurwitz demonstrated the use of Haar Inte-
grals for generating invariant features (Hurwitz, 1897). One integrates
over the group elements which have been transformed using a (often
non-linear) kernel function f .
1
Z
F (S) = f ( gS) dg (2.5)
|G| G
b) Information content
Invariance towards desired transformations alone does not guarantee a good

performance for a classification or retrieval task. It is also important that the
features be discriminative for the classes in hand. As an extreme example,
consider the feature described by the average gray value of the image, i.e.
f = 1/MN ∑i ∑ j S(i, j). This feature is invariant against Euclidean motion
of objects present in the image, but unfortunately, it is invariant against many
more transformations. For example, one may permute the pixel positions in an
arbitrary manner without affecting the feature f , which may be an undesirable
behaviour. Mathematically, one can express this property in terms of what is
known as the sufficient condition for invariance (Burkhardt, 1979):
F (S0 ) = F (S) ⇒ S0 = gS, g ∈ G . (2.6)

Which means that if two patterns are mapped to the same point in the fea-
ture space, then they must belong to the same equivalence class. If both the
necessary condition (Equation 2.1) as well as the sufficient condition are satis-
fied, we say that the invariant mapping is complete.
Non-Vectorial Data
The scalar features extracted using the methods described in the previous sec-
tion must be combined in some way. The simplest way to group them is to cre-
ate a vector, with the scalar features being its members. This approach has the
advantage that even unrelated features can be grouped together, as no special

property about the features is assumed. The algorithms presented in this thesis
are in general applicable to such vectorial features, as is indeed most work ex-
isting in the pattern recognition literature. However, other data representation
formats such as strings, trees and graphs are getting popular, especially in do-
mains such as text mining and bioinformatics. Even a segmented image could
be thought of as an unordered set of its parts. Interactive retrieval is of interest
in many bioinformatics applications, such as protein retrieval, where a vector-
based feature approach might not be the best choice. As long as some kind of
similarity function can be defined between a pair of entities, the kernel-based
methods given in this thesis can be used right away, with some kind of guaran-
tee of optimality even if the similarity function does not define a semi-positive
definite kernel (Haasdonk, 2005). Naturally, special feedback algorithms opti-
mised for other data types are also possible and might yield better results.
Non-Standard Similarity Measures
In some cases, it might be practicable to skip the feature extraction step, and
instead build the complexity in the matching algorithm. For example, in the
case of medical images, an algorithm might perform image registration and
consider the quality of the registration as a similarity measure, or judge the
similarity through the amount of deformation that must be carried out on one
of the images (e.g. Keysers et al. (2007)). Again, the kernelized methods in this
work can be adapted to the special matching processes, while the possibility
for special feedback algorithms also exists here.
2.2.1 Semantic Gap

Due to the fact that current feature extraction methods are not powerful enough
to capture all kinds of subtle nuances present in natural images, we have what
is known as the problem of the semantic gap. The following definition was pro-
vided in Smeulders et al. (2000), which is arguably the most prevalent review
paper on the field of content based image retrieval:
“The semantic gap is the lack of coincidence between the information

that one can extract from the visual data and the interpretation that the
same data have for a user in a given situation.”
However, different researchers have taken somewhat differing interpreta-

tions, which leads to some confusion. Especially, some authors chose to use the
Figure 2.3: An example of the semantic gap problem. The two images possess
very similar colour and texture characteristics, but differ vastly as
far as the semantics are concerned. (A colour version of this figure is
available on Page 126)
word semantic as in semantic labelling or semantic classification when clearly the

terms visual labelling and visual classification would have been closer to the in-
tended meaning. The primary difference is that in variability. An image would
be a candidate for semantic labelling by a keyword if a human being could asso-
ciate the keyword with the image either directly or indirectly. That is to say, on
the basis of semantics. On the other hand, visual labelling merely postulates that
the images possess a visual (low-level) content similarity. Figures 2.4 and 2.5
give examples of a possible label cars in its visual and semantic connotations.
An example of the semantic gap problem is shown in Figure 2.3.
The semantic gap exists because the features or information that we can
currently extract from the pixel content of digital images are not rich or ex-
pressive enough to capture higher-level information which on the other hand
is readily performed by human beings. As an example, consider a face detec-
tion engine, which can detect, with a high enough precision (say 90 %), human
faces in a digital image. The high-level information missing for a particular
application could be, for example, the general mood present in the image. We
wish to understand the hurdles in learning complex such high-level concepts
from image data. There can be two fundamental approaches. One is to train a
Figure 2.4: Sample images from a hypothetical visual class cars

concept detector for every high level concept such as mood directly from image
data. This approach seems however, unlikely to work, as the number of pos-
sible high-level concepts can be very high, with possibly much higher intra-
concept variability as compared to low-level tasks. The other approach is to
have simple concept detectors and then use external rules to learn high-level
concepts. These rules could be combinational based on certain observations
and could also be learned automatically to a certain extent. Let us give an ex-
ample. Assume that we can predict with medium precision (say 70 %) the fol-
lowing information from the content of an image: a) the nationality of the peo-
ple present b) their location estimate and c) condition of the people’s clothes.
Then rules such as the following might lead us to the desired result:
∀ x,
((nationality( x ) = `english')
∧ (location( x ) = `green_grounds')
∧ (clothes = `muddy'))
⇒ playing_rugby( x ) (R1)
and
∀ x,
playing_rugby( x )
⇒ mood( x ) = `playful' (R2)
Although probabilistic logic would be more natural, the above example

uses first-order predicate logic for simplicity.
Figure 2.5: Sample images from a hypothetical semantic class cars

2.3 Performance Measurement 13
The rule R2 is simply dependent on the rule R1, however the rule R1 is
dependent on three sub-rules. Assuming that the detectors for the sub-rules
are statistically independent, the output of R1 is correct with a probability of
(0.7)3 = 0.343. Thus, we can observe that the output of a cascaded detector de-
grades much faster than the output of its individual units. This, together with
the higher number of required training images, is the reason why semantic
labelling still eludes us.
2.3 Performance Measurement

A timeless question in the field of content based image retrieval has been how
to quantify the performance of a system. In our opinion, the real issue is not
about finding good performance measures, but rather finding reliable ground-
truth for the scenario under consideration. Ground truth here refers to the cor-
rect classification, labelling, or the retrieval result, which a system can achieve
only in the best possible case1 .
Creating the ground truth for image retrieval or classification is in general
a challenging task. One must state clearly and unambiguously, the criterion
that were used in creating the ground truth. For example, in a collection of
images of different objects, each object might refer to a seperate class. The
seperation may be achieved at a visual, or at a more semantic level. For the
query-by-example paradigm used in content-based image retrieval (CBIR),
one often creates the so-called relevance lists. Each list contains, for a certain
query image, all images in the database which would be perceived as being
similar. Note that the list indicates only boolean information about each image
(relevant or non-relevant), attempts to increase it to a multi-valued discrete
variable are hampered by our own cognitive limits. Since similarity is subjec-
tive, a common practice is to have multiple lists per query image, each based
on the perceptions of a different individual, and to average out the results and
reach a common ground truth. The situation however becomes all the more
trickier when one has to evaluate relevance feedback results. This is because
the whole rationale behind relevance feedback is that an image can have mul-
tiple interpretations which the machine is supposed to learn. A static list runs
contrary to this rationale.
In this thesis, we would assume multiple interpretations for a query image
(but just one at a time, of course), and by analyzing how well an algorithm was
able to learn a particular interpretation. As an example, a query using an image
1 The term ground truth has its origins in cartography, satellite imagery and other remote
sensing techniques, where the truth literally lies on the ground.
which consists of a white rose in a garden may be interpreted as (considering

access to simple visual features only):
a) Return only images of white roses.
b) Return all roses, irrespective of their color.
c) Return all white flowers, of any kind.
and so on...
The most common performance measures used in the literature are precision
and recall (Feng et al., 2003). Assume that the user has seen k result images, out
of which k R are good results (i.e. relevant), and the remaining k NR are non-
relevant. Further, let the total number of images in the database be N, out
of which NR are relevant for the current query, and NNR are not. Then, the
measures are defined as follows:
kR
Recallk = (2.7)
NR
The recall is thus the ratio of retrieved relevant images to the total number
of relevant images. By itself, it is not sufficient to measure system performance,
as one could increase k arbitrarily which would push Recall to 1 in the asymp-
totic case. Thus, one further defines:
kR
Precisionk = (2.8)
k
which measures the precision after k images have been retrieved. An al-
ternative way to picture this is in set notation as illustrated in Figure 2.6. Let
the set of retrieved images be denoted by R and the set of groundtruth target
images by T, then precision = R ∩ T/R and recall = R ∩ T/T.
The Precision and Recall values can be plotted against each other for dif-
ferent values of k, the result being known as a precision-recall graph, which is
well understood within the image retrieval community.
2.4 Initial Scenario: Single Query Image

In the query-by-example paradigm, the user begins the image search process by
providing an initial query image. In the absence of any further information, the
best a CBIR system can do is to assume that the target images are distributed
according to some symmetric function around the location of the query image
2.4 Initial Scenario: Single Query Image 15
Figure 2.6: Interpretation of Precision and Recall in set notation
in the feature space, and which decreases monotonically upon increasing dis-
tance from the query image. These functions are known as similarity measures.
2.4.1 Similarity Measures

We would use both of the following terms in this thesis: similarity measure and
dissimilarity measure. They carry equivalent information in the sense that the
ranking induced by a similarity measure is simply the inverse of that induced
by the corresponding dissimilarity measure. A dissimilarity measure as de-
fined on a set X is a function d : X × X → R. The measure is said to be a valid
distance metric, if the following conditions are satisfied:
1. Non-negativity: d(x, y) ≥ 0, with the equality holding only in the case of

x = y.
2. Symmetry: d(x, y) = d(y, x).
3. Triangle Inequality : d(x, z) ≤ d(x, y) + d(y, z).
In general it is desirable to use measures which are valid distance metrics,

however it does not automatically mean that a non-metric function would per-
form poorly for a particular task. We give here a few common similarity mea-
sures which can be used in image retrieval. In the following, the elements of
a feature vector would be represented by the notation x = ( x1 , x2 , . . . , xn ), n
being the dimensionality of the feature space.
Minkowski distance The general Minkowski distance of norm p (also referred

8 7 8 10
7 7 9
6
6 6 8
5 5 5 7
4 4 6
4
3 3 5
3
2 2 4
1 2 1 3
0 0 2
1
−1 −1 1
−2 −2 0
−2 0 2 4 6 8 −2 0 2 4 6 8
Figure 2.7: Ranking induced by similarity functions. The left image shows the
spherical ranking due to Euclidean distance, and the right image
shows that due to the Manhattan distance. The point (3, 3) is used
as the query point. (A colour version of this figure is available on Page
126)
to as the distance induced by the L p norm) is given by

!1/p
n
d L p (x, y) = ∑ | xi − yi | p (2.9)
i =1
which leads to the following special cases:
Euclidean distance is the distance measure induced by the L2 norm:
!1/2
n
d E (x, y) = ∑ ( x i − y i )2 (2.10)
i =1
Manhattan distance Also known as the city-block distance, it is induced by

the L1 norm:
n
d M (x, y) = ∑ | xi − yi | (2.11)
i =1
The ranking induced by the Euclidean and the Manhattan distance is

illustrated in Figure 2.7.
It was shown in Swain and Ballard (1991) that the Manhattan distance is
closely related to the histogram intersection similarity measure:
n
HI(x, y) = ∑ min(xi , yi ) (2.12)
i =1
2.4 Initial Scenario: Single Query Image 17
In particular, the relation HI(x, y) = d M (x, y)/2 − 1 holds in case the vec-
tors describe valid probability distributions (i.e., they sum up to one, and
have no negative elements). The special case of feature vectors contain-
ing some kind of histograms is especially interesting in the field of image
retrieval, as many kinds of features of all main types (say, colour, texture
and shape) are in the form of histograms. A popular measure between
two probability distributions or their histogram based approximations
is the Kullback-Leibler Divergence, which measures how compact one
distribution can be coded using the other (Feng et al., 2003).
n
x
KLD(x, y) = ∑ xi log yii (2.13)
i =1
Its symmetrical version is known as the Jeffrey Divergence, and is de-

fined as:

x+y x+y
JD(x, y) = KLD x, + KLD y, (2.14)
2 2
Chapter 3
Relevance Feedback in CBIR

(Online Methods)
In the last chapter we discussed the various components of an image retrieval

system. One of the most critical modules is that of feature extraction, which
we discussed in detail in Section 2.2. Each feature element has, in general,
its own view of the data. Thus, it is not far off to assume that a particular
user viewpoint for a query image can be described using only a subset of the
features, or a subspace of the feature space1 . Earliest image retrieval systems,
such as the QBIC system developed by IBM (Flickner et al., 1995) allowed the
user to specify weights for different feature types (e.g. colour, texture etc.).
This approach however is cumbersome for the end user. If the features were to
be more complex, then this technique is unusable for anyone except perhaps
those familiar with the feature extraction algorithm.
In this work, we intend to learn the user viewpoint by asking for feedback
about the initial result images. This is an iterative process and can be repeated
as often as required, though the gains might become negligible at some point.
A survey of relevance feedback algorithms can be found in Zhou and Huang
(2003) and Crucianu et al. (2004). Before starting with algorithms for relevance
feedback, we will provide a mathematical model for what we are trying to
optimise.
3.1 A Mathematical Model of Relevance Feedback

We employ user feedback in order to improve the image retrieval results. It
is imperative to define this unambiguously in a mathematical framework, be-
1 the vector space spanned by the feature elements
19
20 Relevance Feedback in CBIR (Online Methods)
fore solutions can be proposed. We define relevance feedback as optimization

in a bayesian framework. The simplest case occurs when content-based im-
age retrieval is understood as an exact two-class classification problem. Each
image in the database is either relevant in the context of the present query, or
non-relevant. The goal of the system is to learn this distinction in image space.
However, due to the high input dimensionality of the images, this is rarely
feasible. The system should work in a much lower-dimensional feature space,
where unavoidably ambiguities arise as feature extraction is almost always a
many-to-one transformation. If the goal is to classify all images in the database
as either relevant or non-relevant, with minimum overall errors, then the naive
Bayes classifier provides the optimum decision at each feature point (one that
minimizes the expected error). However, minimizing this error is not the goal
of a CBIR system, as the user would probably never see all the images.
It is of utmost importance to have a mathematical model of relevance feed-
back (RF) before investigating any possible RF methods. We assume here a
database (e.g. of images) with N document objects. Each document object is
represented by a feature vector x in the feature space F. The vectors are dis-
tributed in the feature space according to a probability density function p(x)
in F. We define P(rel | x) as the probability that the document with the feature
vector x is relevant to the user in the current retrieval session. If the documents
are images with pixel values as the features, then it can be argued that P(rel | x)
is either exactly 1 or exactly 0 (in other words, the user finds a particular image
either relevant or not relevant, there is no uncertainity). However, ambigui-
ties could have arisen in the feature extraction stage, as the different images
leading to the same feature vector may not all belong to the same class (rel-
evant/not relevant). Furthermore, the user is assumed to be willing to view
a maximum of n result documents, i.e. a fraction ν = n/N of the complete
database.
The aim of the retrieval system can be stated as follows: Learn a region in
the feature space Ω0 ∈ F, such that
Z
Ω0 = arg max P(rel | x) p(x) dx (3.1)
Ω Ω
subject to the constraint Z

p(x) dx = ν (3.2)
Ω
In words, the optimum region Ω0 is the one which maximises the expected
number of relevant documents, out of all possible regions (continuous or dis-
continuous) which contain a total of exactly n documents. This definition,
however, does not take into account the fact that the ranking of the documents
3.1 A Mathematical Model of Relevance Feedback 21
within the top-n retrieved documents can also be important to the user. We
could say that the above definition optimizes with respect to the precision at n
performance measure (See Section 2.3).
Incorporating other performance measures would require finding an opti-
mum traversal path to rank documents in the feature space. As an example,
if the function to be optimized is the area under the precision-recall graph, then
the optimum ranking is in the direction of decreasing values of P(rel | x), as it
ensures an incrementally highest precision value along the p-r curve. Not all
performance measures can be so compactly written in an analytic form.
Thus, theoretically a document retrieval system only needs to find Ω0 in or-
der to give the best possible results. However, in practice, with limited labelled
training data, this is not feasible and only approximations can be reached. This
is mainly because in Equation 3.1, P(rel | x) remains largely unknown. The
other term p(x) is usually not a problem as this term depends only on the dis-
tribution of the database images (whether labelled or unlabelled), for which a
much larger number of samples is available.
Relevance feedback algorithms can be regarded as approximations to the
original goal of learning P(rel | x). To this end, various assumptions and sim-
plifications are made, and it is the aim of this chapter to gain an insight into
their consequences.
The most common assumption that all relevance feedback algorithms use
to some extent is the smoothness assumption:
x1 ≈ x2 =⇒ P(rel | x1 ) ≈ P(rel | x2 ) (3.3)
i.e. if two vectors x1 and x2 are close enough, then their expected labels
would be similar with high confidence.
The most common simplifications are on the form of the term P(rel | x). In-
stead of trying to determine the term precisely at every data point x, which
cannot be done anyway, one limits its complexity in a predefined way. This
is in line with the Occam’s razor2 , and also follows our experience with ma-
chine learning algorithms in general (e.g. the tradeoff between generalisation
performance and overfitting in SVM, where the classifier boundary with the
lowest VC dimension, in other words, the boundary with the least complexity
is the one to be preferred (Vapnik, 1995). As an initial estimate, for example, an
image retrieval system might constrain the distribution to be spheric around
the starting query image (Figure 2.7).
2 entia non sunt multiplicanda praeter necessitatem, or entities should not be multiplied beyond
necessity.
3.1.1 User Model

By defining a user model, we wish to clarify beforehand what the user is ac-
tually looking for. Most researchers assume that the user is looking for a cat-
egory, or more precisely, an equivalence class, of which the query image is a
member. Other models are however, also plausible. Cox et al. (2000), for ex-
ample, assume that the user is looking for a particular target item, and that the
user-labelling of an image as relevant or non-relevant merely acts as an indi-
cator of its proximity to the desired target image. The question “Which user
model is most suited for content-based image retrieval?” cannot be answered
in general. Different databases, different users, and even different queries for
the same database and user may not be easily groupable in just one of the user
models stated above.
Theoretically, the mathematical model question of Section 3.1 is general
enough to accomodate all possible user models, simply by knowing the func-
tion P(rel | x) completely. For example, in the case of a category search user
model, if the category distributions are completely disjoint, the function would
take a value of exactly 1 whereever x is in the same category as xq , xq being the
query feature vector, and 0 otherwise. In the case of a strict target image search,
the function would be non-zero exactly at x = xt , xt being the target feature
vector being searched for.
On a practical level, when the function P(rel | x) cannot be estimated to a
sufficient accuracy due to limited data, the question of the user model becomes
more important. This is because the supposition of a user model automatically
entails a simplification on the mathematical model, which, if grossly incorrect,
would lead to poor results.
3.1.2 Visualisation
An important decision to make in any practical CBIR system is how to display
the results to the user. The most straightforward option is to show a linear list
of images with decreasing similarity values. In case the system wishes to query
the user about the relevance status of certain images other than the result im-
ages, this can be done using a second linear list in parallel with the first one.
However, the linear display system might not be the most efficient. An alter-
native is to lower the dimensionality of the image features, and display the im-
ages spatially in a 2-D grid (or possibly 3-D with appropriate navigation possi-
bilities). The advantage of this arrangement is that it conveys information not
just about the n similarity values of the query image to the result images, but
rather all of the (n+ 1
2 ) similarities between images. The dimensionality reduc-
3.2 Relevance Feedback Algorithms 23
tion can be easily performed using Principal Components Analysis (Duda et al.,
2000). However, we sometimes have the case where non-Euclidean metrics are
more appropriate, or even when the originial data points xi are not available at
all, only their inter-similarity δij = dis(xi , x j ) according to any (dis-)similarity
function. Principal components analysis can not be used in this case. This can
however be achieved using multi-dimensional scaling.
Multi-Dimensional Scaling
The multi-dimensional scaling (MDS) technique finds, for a given distance

matrix {∆}ij = δij , i, j = 1, . . . n, a configuration of points {pi } in a low-
dimensional Euclidean space Rd , according to the minimizing criteria:
#1/2
∑i,j ( f (δij ) − ||pi − p j ||)2
"
{pi } = arg min
{ pi } ∑ij ||pi − p j ||2
where f is a monotonic, metric-preserving function. This formulation is

due to Kruksal (Kruskal, 1964), where the above expression to be minimized is
termed as STRESS. The minimization process can be carried out iteratively3 .
Since only the relative dissimilarity values between xi ’s are available, it is
evident that the STRESS value remains unchanged on similarity transforma-
tions and/or reflections. Thus, if more than one MDS output is shown to the
user, and there be common images between them, the placement of the images
could theoretically differ widely and thus be confusing to the user. In such a
case, Procrustes analysis (Cox and Cox, 1994) can be used to align the outputs
to the best possible orientation and scale, as is attempted in Rubner and Tomasi
(2001). An example of an MDS display is shown in Figure 3.1.
3.2 Relevance Feedback Algorithms
3.2.1 Rocchio’s Algorithm

The field of relevance feedback is as old as the field of information retrieval,
which started with the application of text document retrieval. The earliest
works in relevance feedback involved the technique which is now known as
query modification using the vector space model. In the early 1970’s, Rocchio pro-
posed the following query update formula (Rocchio, 1971):
3 We acknowledge Dr. Mark Strickert for providing the original implementation
Figure 3.1: 2-D image distribution obtained for the shown images by using
simple visual features together with the MDS algorithm. (A colour
version of this figure is available on Page 127)
nr ns
Rk S
Q=Q +β ∑0
−γ ∑ k (3.4)
n
k =1 r
n
k =1 s
where
Q is the new query vector
Q0 is the initial query vector
Rk is the vector for the kth relevant document
Sk is the vector for the kth non-relevant document
γ and β are adjustable parameters which control the contributions of the rele-
vant and non-relavant documents.
The formula can be seen in action in Figure 3.2. The net effect of the Roc-
chio’s algorithm is to add a displacement vector to the original query, so that
the resulting vector in general is further away from the non-relevant docu-
ments and nearer to the relevant documents. However, an undesired effect
is that the magnitude of the new query vector can be drastically different for
some values of γ and β. To control this, the resulting vectors were always
renormalized. This is fine for the case of document retrieval, where at the
time the feature vector consisted of the occurrences of various keywords, and
normalized term frequency makes sense. In the case of other types of fea-
ture vectors, as would be the case in image retrieval, this renormalization step
can have undesired consequences. Another aspect to be noticed is that the
term frequency vector consists of only positive values, but through the update
formula negative values could arise. The formula was used in such a way
that these negative values were made zero, so that the non-relevant document
could only affect the frequency of the terms which actually appear in the rel-
evant documents. This modification may not be adaptable to other kinds of
feature vectors, such as in image retrieval.
We assume a vectorial representation of the features extracted from the im-
age content. i.e. x = f(I), where x ∈ Rn , I ∈ R M× N . This representation is
mathematically most convenient, although other schemes may be more natu-
rally suitable, especially for a parts-based representation of image content.
Given an initial query vector xq , the aim of the CBIR system is then to re-
trieve images, which are most likely to be desired by the user. We define the
set of these desired images as the target images. The aim of the CBIR system
can then be restated as to determine the true distribution of the target images in the
input feature space. This is a non-trivial problem. Firstly, there is no certainty
that such a distribution exists, as e.g. would occur if the extracted features are
inadequate for describing the user’s intended search. Secondly, the system has
access to very limited information provided by the user in the form of labelled
Rocchios update algorithm (β = 0.50, γ = 0.50) Rocchios update algorithm (β = 0.50, γ = 0.20)
5 5
q
4.5 4.5
q
4 4
q0 q0
3.5 3.5
3 3
Relevant Relevant
2.5 Non−Relevant 2.5 Non−Relevant
2 2
1.5 1.5
1 1
0.5 0.5
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Figure 3.2: Application of the Rocchio’s formula for different parameter val-
ues. As is apparent, in its default form this can lead to a magni-
fication of the feature vectors. In the figure, the crosses represent
the positive examples, while the circles represent the negative. q0
is the initial query point, and q the query point after applying the
Rocchio’s Algorithm.
samples.
In the absence of further information apart from the query vector xq , it
seems intuitive to make the target distribution symmetric around xq , and de-
creasing as one moves away from xq (following the smoothness assumption, Sec-
tion 3.1). This is in fact, what we would do when the image search process is
started with the help of the query image.
3.2.2 Relevance Feedback vs. Retraining

An important question that arises in the context of relevance feedback is: Can
we really make use of the fact that the data in relevance feedback arrives incre-
mentally and is not available from the very start? Do we really need a seperate
discipline of online learning? The answer is yes, and there are fundamen-
tal and practical reasons to distinguish online learning from classical machine
learning, including
Active Learning An online learning system can decide what its weaknesses
are, and formulate queries for the user, which when answered would
accelerate the learning process considerably as compared to having just
randomly selected new labelled data.
Iterative training Some online learning methods can be specially programmed

to accommodate new data into the already trained system, which is usu-
ally faster than a complete retraining, which can be a significant advan-
tage, as speed is often a critical issue in interactive systems.
It is however, important to note that as far as the learning machine is con-

cerned, if the above factors do not play a role in an application, then a relevance
feedback algorithm is not different from a direct learning algorithm working
on the whole available data at any point. For the user, however the difference
still exists that he or she does not have to perform any unnecessary labelling.
The labelling is done in chunks until the desired satisfaction level is reached.
3.2.3 Categorisation of Relevance Feedback Algorithms

We differentiate between the following types of relevance feedback algorithms
in CBIR.
Active vs. Passive Learning
These terms are borrowed from the machine learning literature, even though
the original use in pedagogy predates it. A learner is passive, if it has no con-
trol over which information it receives from the outside world. If in addition,
the learner formulates questions in order to speed up the learning process, the
learner is termed as active. In the context of image retrieval, an active relevance
feedback algorithm delivers at each feedback round, a current best list of re-
sults, and a list of images which, when labelled by the user, would most likely
be useful for the learning algorithm (user query images). In general, these
two lists will be different, as the best user query images will be the ones about
which the system is most unsure, i.e., the ones which the system has evaluated,
neither as positively positive, nor as positively negative.
Short-term vs. Long-term Learning
Relevance feedback performed solely over a single session is referred to as

short-term relevance feedback. This work concentrates principally on this type
of feedback. This is the only sensible type if the aim of RF is to alleviate
the problem that different users or even the same user at different times are
looking for different things even when they supply the same query image. A
long-term relevance feedback system on the other hand collects feedback in-
formation over multiple sessions, perhaps over multiple users. This would
make sense under the assumption that there is a principle semantic meaning
to an image, and semantic interrelationships between images can be captured

which would otherwise not be possible using low level features automatically
extracted from the images.
Granularity of Feedback
Granularity of relevance feedback is an important design issue. A relevance

feedback algorithm could be based on both positive and negative feedback, or
on positive feedback alone. Granularity here refers to the number of possible
labels a user might attach to a result image. Most existing relevance feedback
systems use three levels, positive, neutral and negative (in numbers, +1, 0,
and -1), while some use five levels, from +2 down to -2. Although in theory
a higher number of feedback levels, or perhaps even a continuous measure of
relevance would be advantageous, it is generally understood that in this case
a consistent answer may not be expected from the user.
3.3 Feature Weighting and Selection Methods
3.3.1 Feature Normalisation

Before one begins with feature weighting or feature selection, it is often bet-
ter to start with uniformly scaled data, though for some algorithms it makes
absolutely no difference. One of the following normalisation methods may be
applied.
Scale to fit range Here one defines a range, such as [−1, 1] or [0, 1], in which
all the feature elements of the training vectors must lie. In general, If the
lowest observed value for a feature f is l, and the highest u, this can be
performed with
f −l
f new = (3.5)
u−l
After the normalisation, all training vectors lie inside a unit hypercube.
The used constants l and u must be saved in order to be applied to new
test data when available. Of course, it is entirely possible for unseen vec-
tors to fall outside the hypercube. Apart from this, the main disadvan-
tage of this normalisation method is its suspectibility towards outliers.
Scale to standard statistics In this method, the training data is scaled typically
to reach zero mean and unit variance per feature. If the variance of the
3.3 Feature Weighting and Selection Methods 29
feature element f is σ2 , and mean µ over all training vectors, then this
can be performed by
f −µ
f new = (3.6)
σ
Whitening This is the most general form of feature normalization in which

even the correlations between different feature elements are removed.
Let the training vectors have a covariance matrix given by
C xx = E{(x − µ)(x − µ) T } (3.7)
where
µ = E{x} (3.8)
is the mean vector. The covariance matrix, being symmetric and positive
semidefinite, can always be diagonalized as
C xx = VΛV T (3.9)
where V is an orthonormal matrix containing the eigenvectors, and Λ a

diagonal matrix containing the eigenvalues of the matrix C xx . It can be
easily shown that the transformation:
x0 = Λ−1/2 V T (x − µ) (3.10)
“whitenes” the ensemble, i.e. the transformed ensemble has zero mean
and identity covariance matrix. In practice however, whitening should
not be performed before a feature weighting or selection process due to
several reasons. Whitening modifies the entire feature space instead of
scaling each feature element individually. In the original feature space
there might have been some “good” and some “bad” features, but this
seperation could easily get lost during the whitening process, especially
since the whitening, or indeed any other kind of feature normalisation,
must be done over the whole training data, and not class-wise (as oth-
erwise there would remain ambiguity over how to normalize the new
test data). Due to this reason, whitening is usually not performed before
applying other feature weighting algorithms.
3.3.2 Feature Weighting

Given a set of images represented by feature vectors and labels {(xi , yi ), i =
1, . . . , l }, the aim of feature weighting is to find a weight vector α = (α1 , α2 , . . . , αn ) T
∈ Rn which then multiplied element-wise with each original feature vector
x = ( x1 , x2 , . . . , x n ) T , yields a system which is superior according to some op-
timality criteria. The transformation can be expressed as:
x0 = diag(α · x T )
or alternatively as:
 
α1 0 ··· 0
 0 α2 ··· 0 
x0 =  ·x
 
.. .. ..
 . . . 
0 ··· 0 αn
Thus we see that feature weighting is a linear transformation of the coor-

dinate system of the feature space. But is this really helpful for an image re-
trieval or classification system? The answer to this question lies in the inherent
capabilities of the retrieval algorithm. If the algorithm is inherently capable
of dealing with improperly scaled data, without adversly affecting its perfor-
mance, then external feature weighting is of little use. This, however, is not the
case with most retrieval and classification algorithms. For example, a simple
nearest neighbour classifier is wholly dependent on a similarity measure such
as the L2 -norm, where the performance of the system is significantly impaired
in case the data is inappropriately scaled. Even state-of-the-art classifiers such
as the Support Vector Machines cannot directly compensate for scaling of the
data, and would yield solutions which are optimum only in the scaled space
(the margin in an SVM being defined in terms of the Euclidean distance from
the hyperplane).
3.3.3 Feature Selection

Given a set of images represented by feature vectors and labels {(xi , yi ), i =
1, . . . , l }, the aim of feature selection is to find a weight vector α = (α1 , α2 , . . . , αn ) T
∈ {0, 1}n , such that the system using only the feature elements { x k | αk = 1}
performs optimally according to some predefined criteria. In this sense, fea-
ture selection is simply a special case of feature weighting in which the range
of αk ’s has been restricted to be binary. Even though this reduces the com-
plexity to some extent, the resulting feature subset selection problem is by no
means computationally easy even for moderately sized feature vectors. The
feature selection problem for a two-class classification task could be expressed
as an optimisation problem:
min kωk00 (3.11)

ω
subject to: yi (hω, xi i + b) ≥ 1, i = 1 . . . l
where yi ∈ {−1, 1} are the data labels, ω and b are the parameters of the
(linear) classification boundary, and kωk00 is the zero norm of the weight vector
ω defined as
kωk00 = card{ωi | ωi 6= 0} (3.12)
The formulation searches for a solution with the smallest number of fea-
tures, which still classifies all of the training points correctly. In case the classes
are not linearly seperable, or if the possibility of outliers needs to be incorpo-
rated, the formulation can be extended with slack variables. It is of interest to
note that this optimization problem is the same as the optimum margin prob-
lem in support vector machines, except that instead of the L2 norm, the zero-
norm should be minimized. However, this is a crucial difference, and as shown
by Amaldi and Kann (1998), the above problem is an NP-hard problem.Weston
et al. (2003) propose modifications to the above zero-norm minimisation prob-
lem in order to make the task realistically solvable.
It is interesting to note that from a pure theoretical viewpoint, feature se-
lection (or even feature weighting) can offer no improvements if an optimum
classifier is used. This is because the Bayes classifier, which outputs the class
ωk with the highest a-posteriori probability P(ωk | x), has a monotonically in-
creasing behaviour with respect to the number of features. This means that
adding features can never decrease the expected Bayesian error rate, and thus
does no harm (apart from increased computational and memory requirements).
In practice, however, the Bayes rule cannot be applied because the underly-
ing class-specific distributions P(x | ωk ) are usually unknown. If we insist
on approximating the densities P(x | ωk ), ∀k = 1, . . . , K through some means
(e.g. Parzen window, Gaussian mixture models etc.), then it is often beneficial
to work in a lower-dimensional space, as the number of data points required
to cover the feature space with a fixed average density increases exponentially
with the dimensionality of the feature space.
Direct Subset Selection Methods
A subset-based approach works by having a measure of goodness for every fea-

ture subset. Clearly, an exhaustive search for the best subset is prohibitively ex-
pensive. As an example, with just 256 features, the number of subsets already
is ≈ 1077 , comparable to the estimated number of atoms in the observable uni-
verse.
Thus, wrapper-based methods4 usually must restrict themselves to some
kind of heuristic search. This works as follows. Let all of the possible states
for α be nodes of a graph. Now, instead of evaluating the goodness function
for all nodes, we define a start-node, a search-path function which defines the
nodes which can be reached from a node, and a stopping-criteria which decides
when it is appropriate to stop the search and output the best encountered node.
Consider for example the hill climbing algorithm which starts from an empty
set and considers features for addition only if there is an increase in the good-
ness value (see Figure 3.3). Even then, direct subset selection methods can be
slow for interactive applications like relevance feedback.
Feature Ranking Methods
On the other hand, a feature ranking approach considers each feature element
in isolation, and assigns a goodness value to it. The features are sorted in de-
scending order according to their goodness value, and receive thereby a rank.
A subset of features can then be selected by either taking the top-k ranked fea-
tures, or by fixing a threshold directly to their goodness value. The main bene-
fit of feature ranking methods over subset selection methods is speed. They are
typically linear in the number of features, while the optimum subset selection
method has an exponential running time. The main drawback is the possible
performance impact. The fact that a feature is good by itself does not say much
about how it performs in conjunction with other features. It might be highly
correlated with other features, thus reducing its usefulness. As an example,
consider a pattern described by three features: A, B, and C, which are useful
for a particular classification task in this order. Let the feature vector contain
redundant copies of these features and be given by ( A, A, A, B, B, B, C, C, C ).
Now, a feature ranking method which considers each feature element in iso-
lation and outputs in the end a subset of size 3 would yield ( A, A, A), while
clearly the desired output in most cases is ( A, B, C ).
4 Wrapper based methods are those which do not have knowledge about the internals of
the classifier, i.e. treat it as a black box.
Figure 3.3: Feature selection using hill climbing and forward search. Each
node shows the selection mask and the goodness criteria for that
particular subset. In the shown example, the hill climbing algo-
rithm does not reach the optimum node with a goodness of 0.53
(shown in green).
Correlation Coefficients
In this method, the goodness value of a feature is based on its correlation with
the output label. It is measured as:
( µ k + − µ k − )2
αk =
σk2+ + σk2−
where µk+ and µk− are the means, and σk+ and σk+ the standard deviations
of the kth feature for the positive and negative classes, respectively.
Linear Classifier Coefficients
These methods work with the idea that given a linear classifier for a two-class
problem:
y = sign(hω, xi + b)
the absolute value of the coefficients of the hyperplane normal ω are indica-
tive of the importance of the features. This is illustrated in a 2-D toy example
Figure 3.4: In this toy example, feature x1 may be considered more important
than x2 because of the normal vector
in Figure 3.4
We use the following methods of obtaining the linear classifier:
Linear SVM In this method, the normal coefficients of the SVM hyperplane
(see Section 3.4)
f (x) = sgn(ω · φ(x) + b)
are directly used as feature weights.
Linear Regression Here we select the hyperplane which minimises the squared
error for the output values predicted linearly from the training data.
y pred = hωopt , xi + bopt
where
l
(ωopt , bopt ) = arg min ∑ (yi − hω, xi i − b)2
ω, b i =1
Linear Discriminant Analysis Linear Discriminant Analysis aims to find a lin-

ear combination of features which best seperate the given classes. The
classes are assumed to have normal distributions, i.e.,
1 1
P ( x | ωi ) = exp(− (x − µi ) T Σ−1 (x − µi )) (3.13)
(2π ) N/2 det(Σ) 1/2 2
3.4 Relevance Feedback using a Two-Class SVM 35
where Σ is the covariance matrix of the ensemble and µi is the mean

vector of the class ωi .
3.4 Relevance Feedback using a Two-Class SVM

Two-class SVM solves a classification problem by finding a maximum mar-
gin hyperplane that seperates the positive training instances from the negative
ones. Each training instance is represented as a vector x ∈ Rn and belongs to
one of the two classes L = {−1, 1}. The instances lying closest to the hyper-
plane are called support vectors and are the only vectors affecting the hyper-
plane. In many cases the training instances would not be linearly seperable in
the original feature space Rn . In this case they can be transformed nonlinearly
into a higher dimensional feature space F with a mapping
φ : Rn → F
x 7→ φ(x)
One obtains then a classification function of the form f (x) = sgn(w · φ(x) + b).
Through the use of a kernel k(u, v) = φ(u).φ(v) different boundaries can be
obtained. In fact, the kernel function k would lead to classifiers with maximum
margin in some mapped feature space even if the mapping φ itself is not an-
alytically defined, as long as the kernel satisfies Mercer’s condition (Mercer,
1909).
It should be noted that just correct classification is not the goal of a gen-
eral purpose CBIR system, as the concept of classes does not exist here in the
strict sense. More important is an intelligent ordering of the results as the user
would most likely see only the top few results. This behaviour is already com-
monly seen by text-search engines, where for e.g., some query keywords can
lead to millions of hits. In a two-class SVM, it makes sense to assume that
since the sign of the function f (x) is used as the decision boundary, the images
could be ordered on the basis of their decreasing values of f (x). This simple
procedure, as we see, provides good results. Furthermore, the user provides
feedback not on the most positive images which are shown as intermediate
results, but rather the images for which the magnitude of f (x) is as close to
zero, i.e. the images closest to the SVM boundary, as is suggested in Tong and
Chang (2001).
3.5 Relevance Feedback using a One-Class SVM

One-Class SVMs were first proposed by Schölkopf et al. (2001). One-Class
SVMs are binary functions which capture regions in input space where the
probability density lives (i.e. its support). Here we are interested only in the
distribution of the relevant images. We try to find a hypersphere which con-
tains most of the user-supplied relevant images while being as small as possi-
ble. This can be written in primal form as:
1
min
R∈R,ζ ∈Rl ,c∈F
R2 +
νl ∑ ζi
i
subject to
kφ(xi ) − ck2 ≤ R2 + ζ i , ζ i ≥ 0, i = 1, ...l
The ζ i ’s are slack variables to denote the distance of an instance from the hy-
persphere. They are used to penalize outliers. If ζ i > 0 then the positive train-
ing instance xi is detected as an outlier and lies outside of the hypersphere with
radius R. To set the trade off between the radius of the ball and the number of
training instances it encloses the parameter ν ∈ [0, 1] is used. If ν is chosen
to be small, then the hypersphere is allowed to grow so that more training in-
stances can be put into the ball. If ν is chosen to be large,then the hypersphere
should be kept small while allowing that a fraction of the training instances
lie outside. The primal form of the optimization problem can be transformed
into a dual form using Lagrangian multipliers αi . The corresponding Lagrange
function is:
l l
1
L( R, ζ, c, α) = R2 +
νl ∑ ζ i + ∑ αi (kφ(xi ) − ck2 − R2 − ζ i )
i =1 i =1
with αi ≥ 0
This function has to be minimised. For the minimum the following conditions
have to hold:
l l
∂L
= 0 ⇔ 2R − 2R ∑ αi = 0 ⇔ ∑ αi = 1
∂R i =1 i =1
∂L l
∑il=1 αi φ(xi ) l
= 0 ⇔ ∑ αi (2c − 2φ(xi )) = 0 ⇔ c = ⇔ c = ∑ αi φ ( xi )
∂c i =1 ∑il=1 αi i =1
The center c is completely determined by α alone. The radius R is also not

independent. So, the Lagrange function can be constructed only consisting of
3.5 Relevance Feedback using a One-Class SVM 37
the independent variables ζ and α.

l l l
1
L(ζ, α) =
νl ∑ ζ i + ∑ αi (kφ(xi ) − ∑ αi φ(xi )k2 − ζ i )
i =1 i =1 i =1
l
1
=
νl ∑ ζi
i =1
hl l l i
+ ∑ αi φ ( xi ) · φ ( xi ) + ∑ α j α k φ ( x j ) · φ ( x k ) − 2 ∑ α j φ ( x j ) · φ ( xi )
i =1 j,k=1 j =1
l
− ∑ αi ζ i
i =1
l l l l
1
=
νl ∑ ζ i + ∑ αi φ ( xi ) · φ ( xi ) − ∑ αi α j φ ( xi ) · φ ( x j ) − ∑ αi ζ i
i =1 i =1 i,j=1 i =1
Now L should be minimised with respect to the ζ i ’s and with subject to ζ i > 0.
So either ∂L/∂ζ i = 0 if such a point exists, or ζ i = 0 and ∂L/∂ζ i > 0.
∂L(ζ, α) 1 1
= − αi ≥ 0 ⇔ αi ≤
∂ζ i νl νl
Now L can be rewritten without the ζ i ’s.
l l
L(x, α) = ∑ αi φ ( xi ) · φ ( xi ) − ∑ αi α j φ ( xi ) · φ ( x j )
i =1 i,j=1
This leads to the following dual form:
min ∑ αi α j k (xi , x j ) − ∑ αi k(xi , xi )

α
i,j i
subject to
1
0 ≤ αi ≤
νl
, ∑ αi = 1
i
The optimal α’s can be computed by solving this dual problem with the help
of a QP optimization method. After that the centre of the hypersphere can be
calculated, if the mapping φ(x) is known:
c= ∑ αi φ ( xi )
i
But the mapping φ(x) will be unknown in most cases. The decision function
f (x) = sgn( R2 − kφ(x) − ck2 ) can be computed without the centre using the
(a) (b) (c)

Figure 3.5: One-Class SVM applied on a 2-D toy data set. (a) Using a linear
kernel. (b) and (c) Using a gaussian kernel with different ν and γ
values. As can be seen, the one class SVMs are quite flexible based
on their parameters. (A colour version of this figure is available on
Page 126)
corresponding kernel function:
f (x) = sgn( R2 − ∑ αi α j k (xi , x j ) + 2 ∑ αi k(xi , x) − k(x, x))

i,j i
The support vectors are those instances xi with 0 < αi < 1/(νl ), the xi ’s with
αi = νl1 are the outliers and the xi ’s with αi = 0 are the instances lying truly
inside the ball. The radius R is computed such that all support vectors lie
on the hull of the hypersphere. This is the case, if for all support vectors the
argument of the sgn is zero.
This function returns positive values for points inside this hypersphere and
negative outside (note that although we use the term hypersphere the actual
decision boundary can be varied by choosing different kernel functions). The
results are sorted on the basis of their “positiveness”. Since the actual value of
the function f (x) is not important we can speed up the process by noting that
the first two terms in the decision function are constants. Furthermore the last
term k (x, x) is also constant for many kernels (e.g. Gaussian). Thus, for such
kernels, the images can be ranked simply on the basis of decreasing values of
f 0 ( x ) = ∑i αi k ( x i , x )
In case the distance of an instance to the center of the hypersphere is needed,
it can be calculated in the following way:
s
d(x) = ∑ αi α j k(xi , x j ) − 2 ∑ αi k(xi , x) + k(x, x)
i,j i
3.6 Manifold learning 39
3.6 Manifold learning

The target images during a query might be distributed in a small subspace
around the query image. As such, linear dimensionality reduction techniques
may be applied to learn such a subspace. However, it is not ruled out that the
subspace has non-linear properties, and thus appears to have a much higher
dimensionality than it actually has, when linear techniques are applied. For ex-
ample, consider that the samples are points on a sphere, let us say, the earth’s
surface. If the points can be traversed only along the surface, then Euclidean
relationships clearly do not hold. Locally, however, the neighbourhood of a
point can be well approximated using 2-dimensional maps. This behaviour
can be captured using manifolds. In this context, one defines the geodesic dis-
tance which is the shortest distance between two points along the manifold.
Figure 3.6 shows an example of a 1-manifold in 2D space.
Given a distance matrix (Euclidean or other metric) between a set of points
{pi }, a standard method for estimating manifolds is to leave out the larger dis-
tances and trust only the remaining ones. Tenenbaum et al. (2000) implement
this using the 3-step Isomap algorithm:
Step 1 Determine which points are neighbours on the manifold M, based on

their Euclidean distance d E (pi , p j ) and a upper threshold τ. This rela-
tionship is stored as a weighted undirected graph G .
Step 2 Estimate the geodesic distance dG (pi , p j ) by calculating the shortest path
between pi and p j in the graph G .
Step 3 Apply classical MDS (see Section 3.1.2) on the geodesic distance matrix
DG to generate a lower dimensional Euclidean space which best matches
the manifold geometry.
Although it is highly plausible that low dimensionality manifolds exist in

the case of image retrieval, the main issue in manifold learning algorithms
such as Isomap is that the number of close image pairs is usually small for most
databases and for normal values of τ it is entirely possible that the resulting
graph G is disconnected, and thus the complete geodesic distance matrix can-
not be calculated (most entries are equal to infinity). If we further insist on us-
ing only user-supplied perceptive distances, i.e., set d E (pi , p j ) to a small value
in case the user judges the pair to be similar, and to a large value otherwise,
then the number of edges in the resulting graph is even sparser. For this rea-
son, we do not perform manifold learning formally in the context of relevance
feedback.
Table 3.1: Shape ground truth used for the experiments, with a total of 429
images in 8 different categories. Sample images are shown along-
side the class name (Colour version of this table is available on Page
124)
# of mem-
Class Sample Images
bers
circle 263
diamond 11
shape-1 67
shape-2 16
shape-3 16
shape-4 24
shape-5 21
triangle 11
3.7 Experiments and Results 41
Figure 3.6: A spiral which is a 1-manifold in 2 − D space. The Geodesic dis-

tance between two points is the length of the arc segment con-
necting them, whereas the Euclidean distance is the straight-line
distance
3.7 Experiments and Results
We perform the experiments on two databases. With the first database (trade-
mark images), ground truth is defined on the basis of a clearly defined rule:
Two images are similar if they have basically the same shape. The advantage
is that the efficiency of feature selection algorithms can be tested with ease.
With the second database (MPEG-7 content set), the ground truth is subjec-
tively defined by an end user. This is done because in most image databases
only subjective ground truth is available. We will first give a short description
of the used databases and the ground truth.
3.7.1 Database: Trademark image database
The trademark image database used in this work was first used in Wan Nu-
ral Jawahir (2006). The database consists of 1000 images depicting trademarks
or logos of companies, associations, sport clubs etc. The reason for choosing
this database is that the ground truth is very accurate and multiple meanings
can be associated with the search. For example, given a trademark image as
a query, the user might be looking for similar shapes, similar colours, or both
during his or her search. For the purpose of the experiments, we will constrain
ourselves to the ground truth arising from similar shape only.
Features
Two kinds of features are extracted from the images:
Zernike Moments Zernike moments are constructed using a set of complex

polynomials and are defined inside the unit circle and the radial polyno-
mial vector. 25 Zernike moments of order 0 to 8 in p and q, are extracted
using the following function:
8γ
2p + 2 N/2

2γ πqψ
N 2 γ∑ ∑ cos
Z pq = R pq f (ρ, θ ) (3.14)
=1 N ψ =1
4γ
The Zernike moments capture the shape information present in the trade-
marks.
Colour Spatial Features Color spatial features describe the colour distribu-
tion with spatial knowledge of pixels in an image. These features can
be extracted by using a simple and quick process. The original image I
is divided into 3 x 3 blocks of the same size. Each block consists of three
color components : Red, Green and Blue. Their mean values are stored as
features. Though these features are very simple, they are quite suitable
for the small sized trademark images. Each trademark image ends up
having a color-spatial feature vector of 27 dimensions.
Ground Truth
The ground truth for shape similarity was extracted manually. A total of 8
shape types were defined which covered a total of 429 images. The categories
and sample images are shown in Table 3.1.
3.7.2 Database: MPEG-7 Content Set

We decided to also use this database because it uses subjective ground truth,
which is usually the case in image retrieval. The database consists of around
2400 images from the MPEG-7 content set5 . A total of 15 images were used
as query images. For each image, the similar images were manually selected
from the whole database, thus describing the ground truth which was used for
the experiments.
5 Weacknowledge Tristan Savatier, Aljandro Jaimes, and the Department of Water Re-
sources, California, for providing them under the Licensing Agreement for the MPEG-7 Con-
tent Set (MPEG 98/N2466).
3.8 Results with the Trademark Database 43
Features
We use invariant image features based on the haar integral which were in-
troduced by Schulz-Mirbach (1995). Fast approximate invariant features were
successfully used for image retrieval by Siggelkow (2002). The invariant fea-
tures are constructed as follows. Let M = {M(i, j)}, 0 ≤ i < N, 0 ≤ j < M
be an image, with M(i, j) representing the gray-value at the pixel coordinate
(i, j). Let G be the transformation group of translations and rotations with ele-
ments g ∈ G acting on the images, such that the transformed image is gM. An
invariant feature must satisfy F ( gM) = F (M), ∀ g ∈ G. Such invariant features
can be constructed by integrating f ( gM) over the transformation group G.
Z
I (M) = 1/| G | f ( gM)dg
G
which for a discrete image is approximated using summations

N −1 M −1 2π (1−1/P)
1
I (M) =
PN M ∑ ∑ ∑ f ( gM)
t0 =0 t1 =0 φ=0,φ+=2π/P
The summations are replaced by histogramming operation which leads to

higher robustness against occlusion or background changes while preserving
invariance, although structural information is lost.
We use f (X) = (X(4, 0).X(0, 8))1/2 applied to each color layer of RGB space
to yield a 3D histogram of 8 × 8 × 8 = 512 bins.
3.8 Results with the Trademark Database

We show the effectiveness of the feature selection procedures by showing the
feature weights that got assigned during the relevance feedback process. Fig-
ures 3.7 through 3.12 show visually the relevance feedback results for two
different query images, using two different methods: a linear SVM and cor-
relation coefficients. Since other weighting methods produce similar results,
we refrain from providing their images. Figure 3.13 shows the corresponding
weights that got assigned to the features. Since the ground truth was defined
on the basis of the shape features, the best case scenario would be that they
get the highest possible weights and the color features get weights which are
as low as possible. Of course, this is unrealistic to achieve with such a small
sample set. Even then, it can be seen that the feature weights achieved are in
the correct direction, compared with the initial weights which were assigned
equally.
148 98
138 139 869 5
134
810 289
521 450 766
810 784 409 530

138 139
98 8 385
220 720
779
Figure 3.7: First round results for an image from class number 2. The first
row contains the query image. The next two rows show the top
results using the shape features, and the next two rows using the
colour features. The query images for the next round of relevance
feedback were automatically selected from this pool
Table 3.2: SVM kernels for Image Retrieval
Kernel k (x, y)
Linear x·y
Polynomial (γ(xi · x j ) + coe f 0)d , γ > 0
RBF exp(−γkx − yk2 ), γ > 0
Histogram Intersection ∑in=1 min( xi , yi )
User−supplied Images
138.jpg 139.jpg 148.jpg 869.jpg 5.jpg

+++ +++ +++ +++ +++

+++ +++ +++ +++ +++
138.jpg 139.jpg 810.jpg 98.jpg

+++ +++ +++ +++
530.jpg 8.jpg
−−− −−−
Result Images
1.1049 1.0528 1.0268 1.0081 1.0003

1.0002 1 0.99991 0.99978 0.99978

0.85398 0.77842 0.75543 0.72357 0.69039

0.68796 0.68505 0.67555 0.6692 0.65633
Active Query Images

12.7017 12.0928 11.8749 11.8641 11.7365

11.5781 11.4723 11.4431 11.3706 11.3501

11.3443 11.3287 11.3042 11.268 11.2507
Figure 3.8: Second round results using a two class SVM for relevance feed-
back. The +++ in the title indicates that the image was a positive
image, while - indicates that it was a negative image.

+++ +++ +++ +++ +++

+++ +++ +++ +++ +++

+++ +++ +++ +++
530.jpg 8.jpg
−−− −−−
Result Images
0 0.85596 1.3248 1.6025 1.7168

2.0004 2.0537 2.1205 2.1472 2.4451

2.5658 2.8443 2.8513 3.1206 3.1603

3.2268 3.2442 3.3503 3.3867 3.4662
Figure 3.9: Second round results after feature selection using correlation coef-
ficients. The best 20 features were used to generate the results. (A
colour version of this figure is available on Page 128)
387 410 383

407 4 426
841 892 430

582 583 760
387 513 316

328 681
803
522 76
382 943 573 760
Figure 3.10: First round results for an image from class number 4. The first
results using the shape features, and the next two rows using the
colour features. The query images for the next round of relevance
feedback were automatically selected from this pool

+++ +++ +++ +++ +++

+++ +++ +++ +++ +++
430.jpg 387.jpg 513.jpg 760.jpg

+++ +++ +++ +++

−−− −−− −−− −−− −−−
Result Images
1.0614 1.0519 1.0105 1.0005 1.0003

0.99998 0.99992 0.99992 0.99986 0.99985

0.99958 0.98438 0.84448 0.78078 0.72533

0.69515 0.63154 0.57582 0.44015 0.43099
Active Query Images

13.473 13.0222 12.7851 12.7367 12.6722

12.6287 12.551 12.5107 12.482 12.4744

12.4692 12.4476 12.4074 12.4031 12.3575
Figure 3.11: Second round results using a two class SVM for relevance feed-
back.

+++ +++ +++ +++ +++

+++ +++ +++ +++ +++
430.jpg 387.jpg 513.jpg 760.jpg

+++ +++ +++ +++

−−− −−− −−− −−− −−−
Result Images
0 1.7956 2.3549 2.773 2.7847

2.8798 3.3614 3.3729 3.5987 3.7804

3.8983 3.9454 4.0148 4.2267 4.2718

4.2802 4.2862 4.3166 4.3864 4.4098
Figure 3.12: Second round results after feature selection using correlation co-
efficients. The best 20 features were used to generate the results
Feature Weights: Correlation Coefficients, Query image# 138

Feature Weights: Linear SVM, Query image# 138
35
0.14
Shape Features
Shape Features
Color Features
Color Features
0.12
30
0.1 25
0.08 20
0.06 15
0.04 10
0.02 5
0 0
0 10 20 30 40 50 60 0 10 20 30 40 50 60
(a) (b)
Feature Weights: Correlation Coefficients, Query image# 387

Feature Weights: Linear SVM, Query image# 387
10
0.18
Shape Features
Shape Features
Color Features
Color Features 9
0.16
8
0.14
7
0.12
6
0.1
5
0.08
4
0.06
3
0.04
2
0.02 1
0 0
0 10 20 30 40 50 60 0 10 20 30 40 50 60
(c) (d)
Figure 3.13: Feature weights obtained using (a) and (c): a linear SVM and (b)
and (d): correlation coefficients
Figure 3.14: Screenshot of our CBIR system after labeling the first set of query
images
Figure 3.15: Results after the initial query round

Figure 3.16: Results after the second query round

Comparison of Two Class SVMs ( one Training Round) Improvement of a Two Class SVM with Intersection Kernel after each feedback round
intersection kernel
rbf kernel
1 linear kernel / sigmoid kernel 1
polynomial kernel (degree 5)
0.8 0.8
precision
0.6
precision
0.6
0.4 0.4
0.2 Feedback Round Nr.1

0.2
Feedback Round Nr.2
Feedback Round Nr.3
Feedback Round Nr.4
Feedback Round Nr.5
Feedback Round Nr.6
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
recall recall
(a) Effect of different kernel functions (b) Results after multiple feedback rounds
Figure 3.17: Precision Recall plots with two-class SVM
Comparison of the best 1−Class with the best 2−Class SVM Comparison of the best 1−Class with the best 2−Class SVM after 6 rounds of relevance feedback
1 1
0.8 0.8
0.6 0.6
precision
precision
0.4 0.4
0.2 0.2
Histogramm Intersection Histogramm Intersection
One Class SVM with Intersection Kernel One Class SVM with Intersection Kernel
Two Class SVM with Intersection Kernel Two Class SVM with Intersection Kernel
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
recall recall
(a) After first round (b) After six rounds of Relevance Feedback
Figure 3.18: Comparison of different retrieval methods

3.9 Results with the MPEG-7 Content Set 55
3.9 Results with the MPEG-7 Content Set

We perform experiments using the one class and the two class SVM. Feature
selection experiments are not performed as no feature ground truth exists for
this database.
3.9.1 Good kernel functions for invariant feature histograms

Selecting a good kernel function is critical to the performance of an SVM clas-
sifier. However, there exists no automatic method to find the optimum kernel
function for a particular data set. Moreover, in CBIR a new SVM needs to be
trained for each new query. Therefore, the best tuned parameters for a partic-
ular query image need not work well for all possible queries.
We tune the parameters for four kernel functions: Linear, Gaussian, Poly-
nomial and Histogram Intersection (HI) kernels based on the ground truth we
have for this database. The kernels are shown in table 3.2. The first three
kernels are common from SVM literature. The HI kernel we try out based
on our prior knowledge of our histogram-based features. Indeed, it has also
been shown in previous work that L1 based similarity measures are percep-
tually closer to human similarity definitions compared to L2 based measures.
In Chapelle et al. (1999) for example, interesting results are reported with the
Laplacian kernel, which is similar to the HI kernel.
Figure 3.17(a) shows the results with different kernels using two-class SVM
with three relevant and five irrelevant images. As expected, the Histogram
Intersection kernel performs better than the others.
3.9.2 Comparison of two-class and one-class SVM

This is a very interesting comparison. On the one hand, one expects a two-class
scheme to perform better as it uses all the information that the user provided
to the system, i.e. some relevant and some not relevant images. But on the
other hand, one can safely assume that although the relevant images might
form a cluster in the feature space, the irrelevant images may not, as they can
belong to any of the remaining classes in the database. Thus, if these few ir-
relevant images are say randomly distributed over the feature space then they
could possibly be of no help to a classifier which is trying to learn a decision
boundary separating the relevant images from the rest.
Figure 3.18(a) and 3.18(b) compare the results of two-class SVM vs. one-
class SVM after the first and sixth round of relevance feedback respectively.
Also shown for comparison is a simple ranking method based on L1 similarity

measure (Histogram Intersection). As can be seen, a two class SVM does not
perform as good as a one class SVM after the first round, as there are hardly
enough samples (positive and negative) for the classifier. After six rounds,
however, a two class SVM outperforms other methods.
3.9.3 Improvements over multiple feedback rounds

Figure 3.17(b) shows the Precision-Recall graph after multiple rounds of Rel-
evance Feedback with two-class SVM have been performed. As can be seen
from the graph, Relevance Feedback almost always leads to iterative improve-
ment, but reaches a point of diminishing returns. The improvement in the
initial rounds is very encouraging. The fact that the results could not be im-
proved beyond a saturation level could be either due to limitations in the dis-
criminatory performance of the features used or as a learning limitation of the
classifier used. Our understanding is that perfect retrieval could not be at-
tained with this combination of classifier and features because some images
in our ground truth were only semantically similar while being visually very
dissimilar.
Chapter 4
Offline methods in CBIR
In this chapter, we deal with methods that potentially involve intensive com-
putations on the whole database, hence they are unsuitable for use in rele-
vance feedback scenarios. These methods are nevertheless important for con-
tent based image retrieval applications.
The main motivation for offline methods is to learn about the dataset as a
whole, i.e. not with respect to a particular query image. The learning can be in
various different ways. The knowledge gained can be used, e.g. in speeding up
the retrieval time of the system, but also to provide insights about the contents
of the database. If some kind of ground truth is available for the database,
this can be used to optimize the tunable system parameters in advance, for
example by selecting or weighting features in order to get results as close to
the ground truth as possible.
Since we will be using various clustering techniques in this chapter as well
as in Chapter 5, we summarize these shortly in the next section.
4.1 Clustering Algorithms

Clustering is an unsupervised machine learning problem1 . In everyday lan-
guage, it may be defined as a grouping of “similar” objects. A similarity func-
tion and a clustering algorithm are the two main issues to decide during the
clustering process. In the following, we will assume the availability of n fea-
ture vectors, x1 , . . . , xn , and the aim is to assign them to k clusters.
Various kinds of clustering algorithms have been developed in the machine
learning literature (Jain and Dubes, 1988). They may be classified, for example,
along the following axes:
1 Unsupervised learning refers to the scenario where only unlabelled data is available
57
58 Offline methods in CBIR
Deterministic vs. fuzzy membership Deterministic methods assign each given

data point to a single cluster whereas in fuzzy methods a data point is al-
lowed to be part of multiple clusters, each with a certain probability.
Input space partitioning vs. data grouping There are cases in which we wish
to partition the input space into regions for each cluster (either hard- or
fuzzy partitioning). This is needed, for example, when new data becomes
available and must be assigned to one or more of the precalculated clus-
ters. Otherwise, it is sufficient to know the optimum groupings of input
feature vectors.
Flat vs. hierarchichal membership Flat clustering methods output just the fi-
nal clusters C1 , . . . , Ck , where Ci is either the partitioned region for the ith
cluster, or simply a subset of the set of input feature vectors. Many nat-
ural groupings however, are hierarchichal in nature. Consider for exam-
ple, the problem of clustering all of the natural languages in the world,
which are all born out of just a few ancient languages.
4.1.1 k-means Clustering

The k-means algorithm follows an iterative procedure to minimise the overall
intra-cluster variance
k
E= ∑ ∑ ||xi − µl ||2 (4.1)
l =1 { i | xi ∈ Cl }
The algorithm must be seeded with initial values for the k cluster centers.
At every iteration new cluster memberships and cluster centers are computed,
and it is guaranteed that the goodness measure E of Equation 4.1 can only de-
crease with each iteration. However, as with gradient descent algorithms, it
runs the risk of ending up at a local minima, instead of the global one. An-
other drawback is that the number of clusters k must be provided in advance,
although there has been some work done in automatically determining k (e.g.
Ishioka (2000), Ray and Turi (1999)), which mainly work by trying different
number of clusters and choosing the best k based on a pre-defined cluster
goodness criteria. The k-means algorithm is provided in Algorithm 4.1.
4.1.2 Hierarchical Clustering

Hierarchical clustering methods follow either a bottom-up approach, where
one starts with the input patterns, and merges them successively to form new
4.1 Clustering Algorithms 59
Algorithm 4.1 k-means cluster centers = K-Means( x1 . . . xn , µ1 , . . . , µk ,

maxIter)
Input: data points x1 . . . xn , initial cluster centers µ = {µ1 , . . . , µk }, and maxi-
mum number of iterations maxIter
1: j ← 0
2: repeat
// Cluster membership reassignment
3: for all data points xi do
4: Assign xi to cluster C p where p = arg minl ||xi − µl ||
5: end for
// Cluster center recomputation
6: for all cluster centers µl do
7: µold
l ← µl
1
|Cl | {i|x∑
8: µl ← xi
i ∈C }
l
9: end for
10: j = j+1
11: until (j = maxIter ∨ µ = µold )
12: return cluster centers µ = {µ1 , . . . , µk }
cluster groups, or alternatively a divisive (top-down) approach. Here we will

only engage ourselves with bottom-up clustering, which is also known as ag-
glomerative clustering. The agglomerative clustering algorithm starts with N
groups, each group containing exactly one training instance. At each iteration,
the two closest groups are merged together into one single group. Distance
between two data points may be defined in terms of any distance metric, e.g.
Euclidean or city-block distance. In agglomerative clustering however, an ad-
ditional distance, namely, the distance between two clusters, must be defined.
Most common choices are single linkage, where the distance between two clus-
ters is defined as
d(Ci , C j ) = min d( x, y) (4.2)

x ∈Ci , y∈C j
and average linkage, defined as
∑ x∈Ci , y∈C j d( x, y)
d(Ci , C j ) = (4.3)
|Ci ||C j |
This merging or linkage process continues until a single cluster remains,

Input Points x1 .. x8 Corresponding Dendrogram

0.8
4.2
0.7
4
x8 x1 0.6
3.8 x5
similarity value
0.5
3.6 x3
0.4
x7
3.4 x2
x4 0.3
3.2 x6
0.2
3
0.1
2.8
0
3 3.5 4 x1 x8 x5 x2 x7 x6 x3 x4
Figure 4.1: Sample points in two-dimensional space and their corresponding

dendrogram using single linkage
one that contains all of the training data. The resulting hierarchichal tree can
be visualized using a dendrogram. One axis of a dendrogram shows the data
points, while the other axis indicates their grouping, as well as the distance
at which the grouping was achieved. Figure 4.1 shows a sample dendrogram
for the 2-D data points shown on the left. The dendrogram can be cut at any
desired similarity level in order to achieve the requisite number of clusters.
4.2 Automatic Annotation of Images

Annotation of images is defined as the process of assigning meaningful labels
(also known as keywords or tags) to images. Images have been annotated long
before the advent of the digital era. News corporations have used it to manage
their huge collection of archived photographs. Even private users annotate
their photo albums to aid in recall in the future. It is therefore, not surpris-
ing that electronic keyword based search came quite early into existence, and
remains infact, the popular choice to-date.
Assigning the keywords manually to each and every image becomes cum-
bersome and impractical, if not impossible, for very large data sets. Thus,
we seek possibilities of semi-automatic or fully-automatic means for achiev-
ing this goal. A very practical extension of web search engines as they crawl
the internet or an internal document database, is to associate with every dis-
covered digital image, the text surrounding the image; the rationale being that
this text refers to the content of the image with a high probability. And indeed,
4.2 Automatic Annotation of Images 61
for many keywords, this method performs reasonably well, as can be seen in
the screenshot of Figure 4.2. As one can see there, there were over 20 million
hits for the keyword ocean, and even if there are good chances that only a
very small fraction of these actually contain images of oceans (i.e. the system
exhibits an overall low precision), such a system is quite useful because the
initial precision is high. Thus, there is a form of law of scales at work here.
The low precision comes from the fact that the web data is unstructured,
apart from some structure imposed by natural language which is often am-
biguous without contextual knowledge. Except for the knowledge of the statis-
tics with which words or constellations of words appear on a webpage, there
is little known about the semantics which are conveyed by the page. If the
semantic web (Shadbolt et al. (2006), Cardoso and Sheth (2006)) ever becomes
a reality, it should go a long way towards improving this contextual form of
automatic image annotation.
In this work, however, we wish to deal with purely content-based automatic
annotation of images. This problem is much more difficult because of the
large amount of variability in the image content which is possible without any
change in the semantics conveyed by the image. However, it is an exciting and
evolving research field, and a long-term research goal could be to bring the
machine performance closer to that of humans, who have the capacity to learn
atleast a few thousand concepts. The two possibilities of annotating images
(contextual and content-based) should be considered as complimentry and not
as competing with each other, as there would always be some situations where
a collection of images is available without any further information, and thus
only content-based annotation is applicable.
The tags can be broadly divided into two principled types:
Content Description These kinds of tags identify objects, describe scenes, etc.
occurring in the content of the images. Example keywords include people,
car, landscape, etc.
Meta-Data Tags Here we talk about imaging parameters or other available
information which might be difficult to reconstruct later using either au-
tomatic or manual methods. For example, most if not all digital cameras
nowadays add information such as shutter speed, ISO setting, date/time of
image capture, and even the location coordinates if a GPS receiver is built-
in, etc. By their very definition, it is either very difficult or impossible to
assign such tags from the pixel content alone.
Figure 4.3(a) shows an image from a photo-sharing community website

Flickr. Shown alongside are the user-supplied tags as well as the tags added
Figure 4.2: Image search using keywords extracted using automated analysis
of the webpage containing the image. The screenshot shows the
results for the query keyword ocean using the commercial search
engine GoogleTM .
by the camera during the photo capture. As one can see from the example,
there could always be some tags which one could not expect the machine to
learn, as for example, the name of the city where an indoor image was taken.
However, this would be true even for a human observer unfamiliar with the
context of the image, thus the machine cannot be said to be inferior in such
situations.
We describe briefly some prior work in the field of automatic annotation.
Barnard et al. (2003) presented a scheme to link segmented image regions with
words, the results depending heavily on the quality of the segmentation. Vogel
(2004) assigned semantically meaningful labels to local image regions formed
by dividing the image into a rectangular grid. Li and Wang (2003) proposed
a statistical modeling approach using a 2-D Multiresolution Hidden Markov
Model for each keyword and choosing the keywords with higher likelihood
values. Cusano et al. (2003) use a multi-class SVM for annotation though their
scheme can hardly be judged due to their very small vocabulary consisting of
seven keywords.
In this work we describe our annotation methodology which consists of
a feature extraction, feature weighting, model evaluation and a keyword as-
signment routine (Setia and Burkhardt, 2006). Note that we sometimes use the
terms feature weighting and feature selection interchangeably, as once the sys-
tem has given a weight to each feature, they can always be ranked to select the
ones with the higher weights.
We describe briefly the outline of this section. We first give a description of
the visual features used, then present our feature weighting algorithm. Later
we give a description of our model based on the one-class SVM, and present
the results of the experiments. We conclude with a discussion and an outlook
for possible improvements and future work.
4.2.1 Feature Extraction

To demonstrate the effectiveness of the feature weighting and model evalu-
ation modules, we use a small set of simple and well-known visual features
comprising of the following:
Colour Features: Colour features are widely used for image representation
because of their simplicity and effectiveness. We use color moments calculated
on HSV images. For each of the three layers we compute the layer mean, layer
variance and layer skewness respectively. This yields a 9-dimensional vector.
Since this does not incorporate any interlayer information, we calculate three
new layers SV, VH and HS non-linearly by point-wise multiplication of pairs
from original layers and calculate the same 3 moments also for the new lay-
ers. The final 18-dimensional vector outperformed a 512-bin 3D joint colour
histogram in CBIR tests that we performed.
Texture Features: Texture features can describe many visual properties that
are perceived by human beings, such as coarseness, contrast etc. (Tamura et al.,
1978) . We use the Discrete Wavelet Transformation (DWT) for calculating tex-
ture features. The original image is recursively subjected to the DWT using
the Haar (db1) wavelet. Each decomposition yields 4 subimages which are
the low-pass filtered image and the wavelets in three orientations: horizontal,
vertical and diagonal. We perform 4 level of decompositions and for the ori-
entation subimages we use the entropy (− ∑iL=1 H (i ) · log( H (i )), with H ∈ R L
being the normalized intensity histogram of the subimage) as the feature, thus
resulting in a 12-dimensional vector.
Edge Features: Shape features are particularly effective when image back-
ground is uncluttered and the object contour dominates. We use the edge-
orientation histogram (Vailaya et al., 1998) which we compute directly on gray-
scale images by first calculating the gradient at each point. For all points where
the gradient magnitude exceeds a certain threshold, the gradient direction is
correspondingly binned in the histogram. We use an 18-bin histogram which
Human Assigned Labels

India
Ladakh
trekking
geotagged
Leh-Manali trek
Padum-Wanla
(b)
(a)
Camera Assigned Tag Value

Camera Canon PowerShot SD400
Exposure 0.002 sec (1/640)
Aperture f/5.6
Focal Length 5.8 mm
Exposure Bias 0/3 EV
Flash Flash did not fire, auto mode
Date and Time (Original) 2005:08:09 23:17:39
Compressed Bits per Pixel 3 bits
Shutter Speed 298/32 (c)
Maximum Lens Aperture 95/32
Metering Mode Pattern
Sensing Method One-chip colour area sensor
Latitude N 33◦ 41’ 16.638"
Longitude E 76◦ 55’ 55.578"
Altitude 3457
Compression JPEG
City 1 km S of Honia
Country/Primary Location Name India
Figure 4.3: A tags example from the photo sharing website FlickrTM . a) An
example image, b) Labels supplied by users, and c) Tags automat-
ically added in the JPEG image’s EXIF section.
yields bins of size 20 degrees each.

The final feature vector is a concatenation of the above three vectors and
has a dimensionality of 48. In order to give an equal initial weight to all the
features, they are normalised using the scale to fit range method (see Section
3.3.1).
4.2.2 Feature Weighting

A large number of feature selection or feature weighting methods have been
proposed in the machine learning literature. The interested reader can refer to
Kohavi and John (1997) for an overview of some of the popular alternatives.
The main distinction is between the so called Filter methods, which compute
a ranking for the features without taking the inducer (classifier) into account,
and the Wrapper methods, which search in the set of subsets of features for the
optimum subset for the specific inducer. In Chapter 3 we mentioned a few
feature selection methods suitable for relevance feedback.
In this section, we propose a feature weighting method suitable for the im-
age annotation problem. Image annotation with keywords can be interpreted
as a classification problem but with two distinct characteristics: a) The number
of classes (keywords) can be very large, and b) An image object can belong
to multiple classes simultaneously (in other words, an image is usually anno-
tated with multiple keywords). Thus, traditional feature weighting methods
for multi-class classification are not only overloaded with the high number of
classes, but would also give incorrect weights because of the overlap between
the classes.
Our final aim is to learn a model for each class (keyword) based on a few
training images. If we consider the training data for all the classes collectively,
the properties of the ensemble become evident: the classes overlap, data be-
longing to the positive class (the class in question) is limited, but the data be-
longing to the negative classes is huge and spread around the feature space.
Thus a multi-class classifier or a feature selection method based on it would
not easily find decision boundaries or relevant features. We show that it is
indeed possible to weight the features effectively for each class, taking into ac-
count the general distribution of the features. Let us start with a short data
terminology. Let the training samples belonging to the positive class be given
through
x 1 , . . . , x l ∈ Rn
and the training examples in all the negative classes through

x l +1 , . . . , x l + m ∈ Rn
with m l. Furthermore, we represent the i-th feature vector through the
notation
(1) (2) (n)

xi = [ xi xi . . . xi ]
All features are first normalised to zero mean and unit variance. Then, we
estimate the distribution for each feature independently using the complete
training data. We use a gaussian mixture model (Russell and Norvig (2003),
pp. 724–727) with three components to estimate the density,
J
p̂( x (k) ) = ∑ π j N ( x (k) | Θ j )
j =1
where N is the normal distribution with parameters Θ j = (µ j , σj ), and π j is

the weight of the j-th component, with ∑ jJ=1 π j = 1. The density is estimated
using the expectation-maximization method.
We define the average likelihood for feature k, averaged only over the im-
ages of the positive class as
(k)
∑l p̂( x (k) = xi )
avgk = i=1
l
The higher the average likelihood is, the more similar this feature is be-
tween the positive and the negative classes and therefore less discriminative
Thus, we define the weight for the k-th feature as
wk = 1/avgk
The weights are normalized so that ∑nk=1 wk = 1. This has the effect that
all models deliver optimum performance (tested through crossvalidation) for
about the same model parameters. The features are then weighted with wk
before fed to the model computation routine (each model gets its own sets
of weights). We show now that the weighting scheme is effective and the
weights can in fact even be directly interpreted for our features. To do this,
we plot in Fig. 4.4 the calculated weights for the 48 features for 4 corel cate-
gories: churches, space1, forests and flora. The training data consisted of
40 images each in the positive class and the complete Corel collection of 60,000
images as the negative (Note that it is immaterial here if the positive images
are considered for determining the gaussian mixture distribution or not, as we
have a very large number of samples available from the stochastic process).
The sequence of the 48 features is explained in the figure caption.
For the churches category, the maximum weight went to the edge features
corresponding to the directions 0◦ and 180◦ , i.e., the discriminative vertical
edges present in churches and other buildings (most images in the category
were taken upright). For the space1 category, the most discriminative feature
the system found was the 7th feature, which is the mean of the brightness (V)
component of the image (the images in the category are mostly dark). For the
forests category, texture features get more weight, as does the hue compo-
nent of the colour features. We however did find some categories where the
weights were somewhat counter-intuitive or difficult to interpret manually. An
example is the category flora in part d).
4.2.3 Model Computation

We assume that the presence or absence of a keyword in an image can be tested
independently of other keywords. Though it is not necessarily true, it is a
reasonable assumption to keep the complexity of the overall system in check.
Otherwise, the system would need access to the conditional probabilities of
keywords given the presence of other keywords.
We propose a slightly modified one-class Support Vector Machine (SVM)
as our model. One-Class SVMs were introduced by Schölkopf et al. (2001).
One-Class SVMs are binary functions which capture regions in the input space
where the probability density lies (i.e. its support). We train a one-class SVM
for every keyword with the aim to determine subspaces in the feature space
where most of the data for that keyword is present.
One-Class SVMs are the solution to the following optimization problem:
Find a hypersphere in Rn which contains most of the training data and is at
the same time as small as possible. This can be written in primal form as:
1
min
R∈R,ζ ∈Rl ,c∈F
R2 +
νl ∑ ζi
i
subject to
kφ(xi ) − ck2 ≤ R2 + ζ i , ζ i ≥ 0, i = 1, ..., l
φ(xi ) is the i-th vector transformed to another (possibly higher-dimensional)
space using the mapping φ. c is the center and R the radius of the hyper-
sphere in the transformed space. With the kernel trick (Vapnik, 1995) it is
possible to work in the transformed space without ever calculating the map
φ(xi ) explicitly. This can be achieved by defining a kernel function k (xi , x j ) =
churches
0.045
0.04
Color
Texture
0.035
Shape
0.03
0.025
Weight
0.02
0.015
0.01
0.005
0
0 5 10 15 20 25 30 35 40 45 50
Dimension
0.08
space Color
0.07 Texture
Shape
0.06
0.05
Weight
0.04
0.03
0.02
0.01
0
0 5 10 15 20 25 30 35 40 45 50
Dimension
0.035
forests
0.03
0.025
0.02
0.015
0.01
0.005
0
0 5 10 15 20 25 30 35 40 45 50
flora
0.03
0.025
0.02
0.015
0.01
0.005
0
0 5 10 15 20 25 30 35 40 45 50
Figure 4.4: Feature weights for four sample Corel categories. Each row plots
the feature weights and a few sample images from the category.
The order of features in each graph is as follows 1) 18 colour fea-
tures: mean, variance and skewness of the hue (H) layer, followed
by that of S, V, HS, SV and VH layers. 2) 12 texture features: En-
tropy of H, V and D first level decomposition, followed by 2nd and
3rd levels 3) 18 edge features: Bins starting from 0◦ degrees in anti-
clockwise direction with each bin having a span of 20◦ . (A colour
version of this figure is available on Page 129)
hφ(xi ), φ(x j )i as the algorithm needs access only to the dot products between
the vectors, and not to the actual vectors themselves.
The tradeoff between the radius of the hypersphere and the number of out-
liers can be controlled by the single parameter ν ∈ (0, 1). Using Lagrange
multipliers, the above can be written in the dual form as:
min ∑ αi α j k (xi , x j ) − ∑ αi k(xi , xi )

α
i,j i
subject to
1
0 ≤ αi ≤
νl
, ∑ αi = 1
i
The optimal α’s can be computed with the help of QP optimization algo-
rithms. The decision function then is of the form
f (x) = sign( R2 − ∑ αi α j k(xi , x j ) + 2 ∑ αi k (xi , x) − k (x, x))

i,j i
This function returns positive for points inside this hypersphere and nega-
tive outside (note that although we use the term hypersphere, the actual deci-
sion boundary in the original space can be varied by choosing different kernel
functions. We use a gaussian kernel k (x, y) = exp(−γ kx − yk2 ), with γ and
ν determined emperically through cross-validation). Since we need a rank for
each keyword in order to annotate the image, we leave out the sign function,
so that the results can be sorted on the basis of their “positiveness”. Further-
more, it was found that the results are biased towards keywords whose train-
ing images are very dissimilar to each other, i.e., the models for which R2 term
is high. Compact models are penalised, and therefore we use the following
function instead for model evaluation:
R2 − ∑i,j αi α j k(xi , x j ) + 2 ∑i αi k (xi , x) − k (x, x)

g(x) =
R2
which can be interpreted as the normalized distance from the model bound-
ary in the transformed space.
4.2.4 Experiments and Discussion

We perform our experiments similar to the ALIP system (Li and Wang, 2003)
to facilitate an objective comparison. The Corel Database with 600 categories is
Table 4.1: Sample annotation results from the system. The query images are
taken from the Corel Collection but did not belong to the train-
ing set. It can be observed that while the original category was
sometimes not found, it was due to the fact that the categories of-
ten overlapped, as the top matches do indeed contain very similar
categories, leading to robust annotation results. (A colour version of
this table is available on Page 125)
Original Final
Query Image Cate- Top 8 Matches
Annotation
gory
wildlife_rare,
grass, animal,
architect,
dog, rareanimal,
shells, dogs,
shell,
africa mammals,
mammal,
newzealand,
NewZealand,
197000,
pastoral
pastoral
plants, green,
foliage,
plant, flower,
wl_ocean can_park,
US_garden, flora,
green, foliage,
leaf, flora
texture13,
flower2
tribal, 239000,
thailand,
people, cloth,
189000 189000,
groups, perenial,
guard, face,
life, tribal
indonesia,
work
rural_UK, forest, tree, forest,

zionpark, flowerbeds, flower,
holland plants, forests, ruralEngland,
perenial, Zion, flowerbed,
flower2 perenial
microimg, design1,
textures, texture,
lizard1 texture1, skins,
texture7,
natural,
microimage
texture9,
food2
canyon_park, isles2, Alaska,

US_parks, alaska, mountain,
yosemite 126000, park, landscape,
rural_UK, garden, house,
gardens, cal_sea California
used. Each category is manually labelled2 with few descriptive keywords (typ-
ically 3 to 5). Each category consists of 100 colour images of size 384 × 256, out
of which we select 40 images randomly as training images. Normally for im-
age annotation we would be training a model for every annotation keyword,
and would annotate a query image with the keywords whose models evalu-
ate the query image most favourably. For this experiment however, we learn
a model for every Corel category instead of each annotation keyword. Then, for
the best k category matches (we experiment with k = {5, 8, 11, 14}), the cate-
gory keywords are combined and the keywords least likely to have appeared
by chance are taken for annotation, as in Li and Wang (2003). This scheme
favours infrequent words like waterfall and asian over common ones like
landscape and people.
To have an estimate of the discriminative performance of the system, we
perform a classification task with the 600 categories. The system attains an
accuracy of 11.3 % as compared to 11.88 % that of ALIP. However, as also
pointed out in Li and Wang (2003), many of the categories overlap (e.g. Africa
and Kenya) and it is not clear how much can be read from this classification
performance. Furthermore, we found that although the best category match
was incorrect in the sense of the category ground truth, it was often meaningful
with regard to the query image. We provide some annotation examples in
Table 4.1.
For a more controlled test, we take 10 distinct Corel Categories, namely
Africa, beach, buildings, buses, dinosaurs, elephants, flowers, food, horses
and mountains. The confusion matrix for this task is shown in Table 4.2. Over-
all, the system attains classification accuracy of 67.8% as compared to 63.6%
attained in ALIP.
Computation Time: All experiments were performed on an Intel Pentium
IV 2.80 GHz single-CPU machine running Debian Linux. Calculation of im-
age features takes about 1.2 seconds per image. Model computation with 40
training vectors per model takes only about 20 msec per model. A new query
image needs about 4 seconds to be fully annotated (this includes computation
time for feature extraction, evaluation of 600 models, and decision on unlikely
keywords), as compared to 20 minutes for the HMM-based approach in ALIP.
This makes our system faster by a factor of 300 (or 100 taking the clock speed
of the ALIP system into account). The system scales linearly with the number
of models.
2 We thank James Wang for making the category annotation available for comparison
Table 4.2: Confusion matrix for the 10-category classification task
% Afr. bch. bld. bus. dns. elph. flow. hrs. mnt. food
Africa 66 6 12 0 0 2 2 2 6 4
beach 16 32 28 0 0 8 2 2 6 6
buildings 6 6 76 2 0 0 2 0 6 2
buses 0 0 30 64 0 0 0 6 0 0
dinosaurs 0 0 2 0 94 0 0 0 0 4
elephants 28 0 0 0 0 50 0 8 12 2
flowers 10 0 4 0 0 0 78 0 4 4
horses 6 2 6 0 0 2 2 72 10 0
mountains 4 4 10 0 0 0 6 0 70 6
food 0 2 14 0 2 0 2 0 4 76
4.2.5 Conclusion and Future Outlook

A feature weighting method and a modelling scheme based on the one-class
SVM for automatic image annotation was presented in this section. It is clear
that the power of the overall system is heavily dependant on the discrimina-
tive power of the features used. Thus, complex features should in general
be expected to lead to a performance improvement. Local features extracted
around interest points, e.g. Lowe (2004), have recently given excellent results
in the field of object recognition and could be directly plugged into the system
(at least the methods which can return a single consolidated feature vector per
image, instead of a bag of vectors).
It was shown that the modelling scheme scales well to larger number of
keywords, both in terms of annotation results quality as well as the speed of
execution. The system ran orders of magnitude faster than a MHMM-based
scheme while giving comparable or better results. The effectiveness of the fea-
ture weighting was also demonstrated as the small number of visual features
used lent themselves to direct interpretation.
A simplified view of the linguistic component of the annotation system was
taken, as it lies outside the scope of this work. Also, currently the system does
not check for mutually exclusive keywords or other inconsistencies, and ends
up annotating the same image with combinations like sunrise and sunset, or
with England and Finland. This can however be taken care of automatically to
an extent by extracting conditional probabilities of keywords given the pres-
ence or absence of other keywords, given sufficient training data.
4.3 Hierarchy Discovery in Image Databases 73
4.3 Hierarchy Discovery in Image Databases

Image databases which are labelled are often done so using a flat identification
scheme, i.e., each image is either given uniquely, one of the available labels,
or is annotated with a certain number of available keywords. However, since
in general the image content is subjective, not all annotaters may agree to this
selection of labels or keywords, as these may share similarities with each other.
A user looking for images with a particular label would usually also be inter-
ested in the ones with similar content. However, generating a complete taxon-
omy for all possible labels can be very time consuming, if not impossible for
a human annotator. Therefore, we look at automatic means for approximately
achieving the same goal.
Firstly, we define what we precisely wish to achieve. Given an image
database D in which each image I is labelled with one of the l labels, y1 . . . yl ,
the aim is to construct a hierarchy tree H, where the root node denotes the
whole database, there exists a unique leaf node for each label yi , and all the in-
termediate nodes exist in order to group similar nodes from a lower level into
a single entity. The most critical question is how the similarity between two
nodes can be defined, and would be investigated in this chapter. Choosing
an appropriate graph creation algorithm is another important design criteria,
once the similarity between nodes has been fixed. Here we choose agglomora-
tive clustering, which results in binary trees (i.e., a maximum of two child
nodes policy for every node). Once such a tree is constructed, this restriction
may be relaxed by grouping nodes with small distance between them as direct
children of a single node.
Figure 4.5 shows two possible binary hierarchy trees. While the first one is
more or less balanced3 , the second one is not. This property has an influence
on the usefulness of the tree in practical applications, as a linear tree structure
indicates that no hierarchy was found in the database. Such a linear structure is
of litte use for a user starting from a particular class yk and wishing to explore
similar classes.
The image hierarchy tree can be used for a variety of reasons:
• In the case of image retrieval, for giving the user efficient access to the
database. For example, once the user starts with a query image, she can
immediately start exploring images which belong to classes similar to the
3 We define a tree to be node balanced if each node is symmetric, i.e. the left and the right
subtrees at each node are of the same size. A completely balanced binary tree would require
that the number of classes l be a power of 2.
Balanced Dendrogram Unbalanced Dendrogram

0.8 0.8
0.7 0.7
0.6 0.6
similarity value
similarity value
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
x7 x8 x5 x6 x1 x2 x3 x4 x1 x2 x3 x4 x5 x6 x7 x8
Figure 4.5: Examples of balanced and unbalanced binary hierarchy trees
class of the query image. This has the potential for greatly enhancing the
user experience and shortening the search time.
In the case of image classification, the binary hierarchy tree can offer several
benefits:
• Partial classification: The classification starts at the root node, proceed-

ing downwards until a base class (leaf node) has been reached. Such a
tree based classification scheme can have the benefit that the classifica-
tion need not be executed until the leaf node, and stopped at any node
if the confidence level is not high enough. If the grouping during the
tree creation was proper, then this incomplete classification might still be
useful information.
• Interpretation of the classification results: The importance of this step is

sometimes underestimated in the pattern recognition community. Tools
such as the ROC curves, or the confusion matrix between the classes
provide useful information about where the classifier is going wrong.
Following the path of the test samples along the classification tree can
greatly augment this information and help in planning improvement of
the classification results.
• Faster classification: Widely used classifiers such as the SVM are origi-
nally designed as binary classifiers. Classifying l classes is usually done
using multiple binary classfiers, such as in the one-vs-one approach, in
which l (l + 1)/2 binary classifiers are trained and executed, and voting
is performed to determine the winning class, or with the one-vs-rest

approach, in which l classifiers are trained, each with one of the classes
against all others, and the winner class is the one with the most positive
decision function output. The hierarchy tree based classification needs in
the best case only log2 (l ) classifier evaluations, and is much faster in the
average case.
• Better tuning possibility: The hierarchy tree based method offers more
opportunities for classifier tuning. For example, the leaf class nodes
grouped at a lower level would be in general very similar and might
distinguish themselves using a few particular features only. Feature se-
lection thus could be more effective in such a scenario. Furthermore, the
higher nodes might be better classified using a smaller value of the SVM
cost parameter C as compared to the lower nodes, in order to have better
generalizibility.
4.3.1 Algorithm for hierarchy tree construction

A bottom-up clustering approach known as the agglomorative hierarchical
clustering algorithm (AHC) is used. This naturally yields a desired tree struc-
ture as opposed to methods such as the k-means clustering algorithm, in which
the objects are merely assigned to one of the k clusters, the number k having
been chosen in advance. The AHC algorithm starts by defining a dissimilarity
value between every pair of objects (in this case, the classes). The pair having
the smallest dissimilarity value is grouped and replaced by a single new node.
The dissimilarity of the new node with all the remaining classes is calculated,
and again the same process of replacing the best-match pair is perfomed. This
continues until only a single node remains which would be the base or the root
node of the resulting tree structure.
Dissimilarity function
Clearly the most critical part of the proposed algorithm is the dissimilarity
function for a pair of classes Ci and Cj . However, this requires answering a
question which is as much philosophical as mathematical:
Which properties make two classes similar to each other?
The question is philosophical, partly because it is incomplete, in this case

the term similar by itself does not mean much. Naturally, the (dis)similarity
function should take into account the final goal which is a good visual group-
ing of classes, as well as good classification results using a hierarchical class-
fication on the generated hierarchy tree. Simple functions, based on mean
distance between the feature vectors contained in C p and Cq (Lei and Govin-
daraju, 2005), or the margin obtained by classifying C p and Cq using a SVM
(Chalasani et al., 2007) have been tried before. We propose two different match-
ing schemes, which we name as feature similarity and hyperplane similarity and
will be described below. Both schemes operate using an intermediate class Ck
for determining the dissimilarity. We define D kpq as the distance or dissimilar-
ity between C p and Cq using the intermediate class Ck . The overall distance
D pq can thus be defined over all intermediate classes excluding the classes in
question:
C
D pq = ∑ D kpq (4.4)
k=1,
k6= p,q
Other methods of combining D pq are certainly possible but were not inves-
tigated in this work. Now we describe the methods to calculate the D kpq s, and
the rationale behind them.
Feature Similarity
This method works by extracting features essential for classifying C p against

Ck , and the ones for classifying Cq against Ck . The commoness of these sets of
features is an indication of the similarity of the classes C p and Cq . One way
to get an estimate of the feature importance is to train a linear classifier (e.g.
a linear SVM), and use the absolute value of the normal vector coefficients as
the measure of the relative feature importance.
Let the linear classification boundary between classes C p and Ck be given
through
hω pk , xi + b pk = 0 (4.5)
and the boundary between Cq and Ck through
hωqk , xi + bqk = 0 (4.6)
We assume, without any loss of generality that the hyperplane normals ω pk

and ωqk are already normalised. The distance D kpq can thus be defined as
F
∑ |ω pk | − |ωqk |
f f
D kpq =

(4.7)
f =1
In words, this is the L1 distance between the magnitudes of the normalised

hyperplane normal vectors. Although this measure gives good results, it is not
kernelizable as it uses properties other than the dot product. Thus, in cases
where the hyperplane yielded by the linear classifer was very crude, this may
give poor results. The alternative method proposed in the next section is easily
kernalizable, and thus may be preferable in such cases.
Hyperplane similarity
Here we use the dot products between the SVM hyperplanes as the similarity
measure, i.e.
hω pk , ωqk i
D kpq = 1 − . (4.8)
kω pk kkω pk k
The hyperplane is expressed as a sum of transformed training vectors, i.e.:

ω pk = ∑i αi · Φ(xi ) and ωqk = ∑ j β j · Φ(y j ). Φ is a (typically non-linear) trans-
formation to a new feature space, where the data might be better seperable
using a linear classifier. The SVM algorithm needs access only to dot products
between feature vectors hΦ(x1 ), Φ(x2 )i which might be directly computable
in the original feature space, thus sparing the computation cost of the trans-
formation Φ. In some cases, it might not even be possible to compute Φ(x).
The dot product in the transformed space is known as kernel evaulation and
would be denoted by:
K (x1 , x2 ) = hΦ(x1 ), Φ(x2 )i (4.9)
Thus, we can simplify:

hω pk , ωqk i
D kpq = 1 −
kω pk kkω pk k
h ∑ i α i Φ ( x i ), ∑ j β j Φ ( y j ) i
= 1− 1 1
h ∑ i α i Φ ( x i ), ∑ j α j Φ ( x j ) i 2 h ∑ i β i Φ ( y i ), ∑ j β j Φ ( y j ) i 2
∑i ∑ j αi β j hΦ(xi ), Φ(y j )i
= 1− 1 1
(∑i ∑ j αi α j hΦ(xi ), Φ(x j )i) 2 (∑i ∑ j β i β j hΦ(yi ), Φ(y j )i) 2
∑i ∑ j αi β j K ( xi , y j )
= 1− 1 1
(∑i ∑ j αi α j K (xi , x j )) 2 (∑i ∑ j β i β j K (yi , y j )) 2
which can be computed using kernel evaluations between the support vec-
tors of the two hyperplanes. It should be noted that the bias terms b pk and
bqk do not play a role during the calculation of D kpq , and it is not clear if they
should. After all, the role of the bias term is only to shift the hyperplane in
the direction of the normal vector, thus making the first class more preferable
(for positive bias), or the second class (for negative bias). At this stage, it is
interesting to note the fact that one would expect a bias value of 0 for a neutral
classifier. This statement is in sense of the test points x which are very dissim-
ilar to all support vectors, i.e. K (x, xi ) ≈ 0, ∀i = 1 . . . NSV . A classifier with a
non-zero bias would classify all such points to exactly one of the classes which
may not be the desired behaviour, especially for distance based kernels.
4.3.2 Experiments and Results

MNIST Collection
The MNIST database consists of digitized images of handwritten digits. The

images are 8-bit grayscale, and the dimensionality of the images is 28 × 28. For
our purpose, we chose to use the raw pixel values as our features, so that the
results can be interpreted as those achieved by base-line features. No prepro-
cessing, such as thinning or orientation normalization is performed. Feature
weights are calculated using the correlation coefficient method (Section 3.3.3),
and are shown in Figure 4.6. The first row and the first column show the mean
image of the corresponding class. A few feature weights can even be validated
visually. For example, the center pixels are important for differentiating an ‘8’
from a ‘0’.
The generated tree (dendrogram) is shown in Figure 4.7. One can interpret
Figure 4.6: Feature weights between all pairs of classes from the MNIST
database. The gray value indicates the feature weight. The first
column and the first row show the classes, in the form of the mean
value of the images of the particular class.
Binary Classification Tree (MNIST Database)
1800
1600
class distance
1400
1200
1000
800
600
4 9 7 2 3 5 8 6 1 0
class index
Figure 4.7: Binary classification tree generated for the MNIST digit database
from the dendrogram that the class for the digit 5 is similar to the class for the
digit 8, the digit 4 is visually similar to the digit 9, and so on.
Corel Collection
We include the results on this database as it is a very common dataset used in

CBIR literature, and because basic visual features (colour, texture and shape)
by themselves perform reasonably well. The complete database contains 600
categories, from which we select 27 categories for the experiments. This is
done because of two reasons. Firstly, a tree containing 600 base nodes is im-
practical on a printed paper, and secondly, we anyway wish to exclude the
so-called “loaded” categories, i.e. categories whose content is so diverse that
it is impossible to learn from the pixel information alone. As an example, con-
sider a category Amsterdam: an image might be annotated by this keyword
only because it was taken inside an apartment in Amsterdam, which is virtu-
ally indistinguishable from an apartment in another European town.
The following categories were selected: autumn, dogs, plants, texture1,
texture2, aviation, car_race, car_old1, children, fashion1, flower1, forests,
fractals, people1, wildcats, wl_afric, flora, peoplei, warplane, women,

mountain, ruins, churches, roses, harbors, beach, men. Examples from these
categories are shown in Figure 4.8.
The hierarchy tree generated for the Corel collection is shown in Figure 4.9.
As can be seen, visually similar categories were grouped together quite often.
Such a tree can be used to traverse a large database quickly, or to automatically
suggest similar categories to the user (Setia and Burkhardt, 2007).
Figure 4.8: Sample images from the Corel categories used for the experiments.
(A colour version of this figure is available on Page 130)
autumn
dogs wildcats
wl_afric
plants
flora
texture1
texture2 peoplei
aviation warplane
car_race women
car_old1
mountain
children ruins
churches
fashion1
roses
flower1
harbors
forests beach
fractals
men
people1
4.3.3 Computation Time

For the corel collection, i.e. with 27 labels each with 100 training vectors of
dimensionality 48, the construction of the taxonomy tree takes about a minute
on a standard modern PC (P IV 2.8 GHz running Ubuntu Linux). However,
since in reality the databases are expected to be much larger, it is interesting to
note how the algorithm would scale. The running time t is proportional to the
dimensionality F of the features as the L1 and L2 similarity measures are both
linear with respect to F. The effect of the number of training vectors (per label)
N is not exactly definable, but empirical studies (Platt et al., 2000) suggest that
SVM training is superlinear with respect to the number of training vectors,
perhaps even quadratic. Finally, the algorithm scales cubically with respect to
the number of labels L. This can be easily seen from the fact that the number
of label pairs increase quadratically, and the calculation of distance between a
label pair increases linearly with L. Thus, the running time can be expressed
in the O notation as O( L3 · N 2 · F ).
84
22
20
18
16
14
12
10
6
children men dogs women fashion1 fractals texture1 texture2 mountain wildcats wl_afric car_race car_old1 harbors churches
people1 peoplei plants flora flower1 roses autumn forests aviation warplane ruins beach
Figure 4.9: Taxonomy for the 27 selected classes from the Corel collection
Offline methods in CBIR
Chapter 5
Radiograph Annotation and

Retrieval
5.1 Overview
In Chapter 3 and 4 we have described various algorithms for learning in image
retrieval. These methods were compared with state-of-the-art algorithms for
each specific task at hand. However, a gap between the description of the indi-
vidual components and how they could possibly fit together in a real-life im-
age retrieval or classification system is felt, and this chapter intends to bridge
this gap.
We chose for this purpose the IRMA radiograph database from the Univer-
sity Hospital, Aachen, Germany1 . It consists of 10,000 fully classified radio-
graphs taken randomly from medical routine, out of which 1,000 radiographs
are considered to be part of the test set. The aim is to find out how well cur-
rent techniques can identify certain image parameters such as image modality,
body orientation, body region, and biological system examined based on the
content of the images. There are various reasons for choosing this database.
Firstly, the number of classes (116) is high enough to make it a suitable can-
didate for annotation tasks, while the query-by-example usage paradigm still
remains practical. Secondly, a fairly large number of reference results exist
for the database, partly because it was used in the ImageCLEF Benchmark for
three consecutive years (2005 through 2007, albeit in modified forms). Last but
not the least, it allows us to support the statement that we made earlier in the
thesis about the applicability of the algorithms to local features extracted at
1 Wewould like to thank Dr. TM Lehmann, Dept. of Medical Informatics, RWTH Aachen,
Germany for making the database available for research purposes.
85
86 Radiograph Annotation and Retrieval
interesting points in the images.
5.2 Database and Objectives

The IRMA radiograph database uses the IRMA coding system (Lehmann et al.,
2003), which is a mono-hierarchical multi-axial classification code for medical
images. The coding system consists of four independent axes, with three to
four positions in the range {0, . . . , 9, a, . . . , z}. Here a 0 denotes the tag “un-
specified” to indicate the end of a path along the axis under current consid-
eration. We give a short description of the coding scheme below. For the full
listing, please refer to Lehmann et al. (2003). The four axes are:
Technical code (T) Defines the imaging modality used. This uses a maximum
of four positions, and contains a) The physical source (e.g. 1. x-ray or
2. ultrasound), b) Modality details (e.g. 12. fluoroscopy or 13. angiography),
c) Technique (e.g. 111. digital, 112. analog or 3. sterometry), and d) lastly
the subtechnique (e.g. 1111. tomography or 1114. parallel beam).
Directional code (D) Models the biological system being examined. This uses
three characters indicating a) The common orientation (e.g. 1. coronal or
2. sagittal), b) Detailed orientation (e.g. 11. posteroanterior), and c) lastly
the functional orientation tasks (e.g. 111. inspiration, 112. expiration or
113. valsalva etc.).
Anatomical code (A) Indicates the body region examined. A total of nine ma-
jor regions are defined (e.g. 1. total body, 2. head/skull, 3. spine etc.). The
major region is followed by two hierarchical subcodes, (e.g. 3. spine,
31. cervical spine, 311. dens).
Biological code (B) Defines the biological system examined. This supplements
the anatomical code which is not precise enough. The first position iden-
tifies one of ten possible organ systems (e.g. 1. cerebrospinal system), and
the remaining positions help in identifying exactly the organ in question
(e.g. 11. central nervous system, 111. mesencephalon).
The final code is a character string of not more than thirteen (13) characters:
{ TTTT − DDD − AAA − BBB}. Table 5.1 shows four sample images from the
database along with its code in descriptive notation. There are three different
usage scenarious where image processing and machine learning techniques
can be applied on this database.
5.2 Database and Objectives 87
Table 5.1: Sample images from the IRMA database, together with their class
number (1-116), and code description for the four independent axes
10015.png
Class #112
(T): x-ray, plain radiography, analog,

high beam energy
(D): sagittal, lateral, right-left, inspiration
(A): chest
10020.png
Class #58

overview image
(D): sagittal, lateral, right-left
(A): cranium, neuro cranium
(B): musculosceletal system
2889.png
Class # 113

low beam energy
(D): axial,craniocaudal
(A): breast (mamma), right breast
(B): reproductive system, female system, breast
10745.png
Class # 85

overview image
(D): sagittal, lateromedial
(A): upper extremity (arm), radio carpal joint,
left carpal joint
(B): musculosceletal system
Query-By-Example A doctor or an assistant supplies the radiograph image

acquired from a patient to the system. The aim could be to find the case
history of another patient who was suffering from a similar or identical
disease. Note that detecting abnormalities such as bone fractures or in-
serted metallic rods would perhaps require extracting features which are
very specific to the task.
Meta-Data Assignment Radiograph images are typically tagged (e.g. using

the above described IRMA code) by the hospital staff during the time of
the aquisition. This helps them to find images later which match certain
criterion. The assignment step can however be time consuming and er-
ror prone. According to a study carried out at the University Hospital in
Aachen and described in Güld et al. (2002) there were over 15% errors of
human origin in assigning just a single tag during normal clinical rou-
tine to be found. This might be due to overworked staff operating under
a tight schedule. However, once a tag assignment error creeps in, the im-
age is practically as good as lost as far as meta-data search is concerned.
We show that the meta-data assignment step can be fully automated, and
in fact, error rates much lower than 15% can already be realized by meth-
ods proposed in this thesis.
Flat Classification In this case, each possible code combination (i.e. { TTTT −
DDD − AAA − BBB}) is considered as a base class. The rationale being
that although the number of code combinations is quite high (∏4i=1 Ni ,
where Ni is the number of leaf nodes in the hierarchy tree of the ith axis),
in practice only a fraction of them are applicable or likely, which can be
assigned using standard multi-class classification techniques. One pos-
itive side-effect is that impossible code-combinations which can occur
during automatic meta-data assignment are ruled out in the flat classi-
fication scheme. However, it is possible that a misclassification in the
flat scheme would lead to multiple tags being wrongly assigned, which
might have been averted by having tags assigned on an axis-by-axis ba-
sis.
All of the above usage scenarios require robust and discriminative features.
We have developed for this purpose the so-called cluster cooccurrence ma-
trices, which use local features extracted around so-called interest points in
an image, and combine them using their spatial information to yield a single
global feature vector per image (Setia et al., 2008). We describe below the algo-
rithm to generate the cluster cooccurrence matrices.
5.3 Feature Extraction Algorithm 89
5.3 Feature Extraction Algorithm
5.3.1 Related Work

The research area of general image classification has in recent times shifted to-
wards methods using local information extracted in various ways from areas
around the so-called interest points. Their main advantage is increased ro-
bustness towards occlusion by partial matching. Current object classification
systems differ in the way the local information of the patches is combined:
some use only the feature vectors extracted at the points, others also take their
(normalised) position information into account. Examples for effective object
classification systems neglecting the spatial layout of the interest points are
by Dance et al. (2004) or Deselaers et al. (2006a). They use histograms of patch
cluster memberships as features and for classification, SVM and discriminative
training respectively is used. An example for the combination of local appear-
ance and positional information was by Weber et al. (2000), further developed
by Fergus et al. (2003). They introduced a so called “constellation model”, i.e.
specific local image features in a probabilistic spatial arrangement, to decide
whether a certain object is present in a scene or not. The positional informa-
tion is also used by Agarwal and Roth (2002). They classify sub-windows in
images using binary vectors coding the occurrences and spatial relations of lo-
cal features. The reader can refer to Teynor et al. (2006) for a survey of common
patch based methods. In order to incorporate the spatial position of interest
points, we first proposed the use of co-occurrence relationships derived from
cluster memberships of relational features calculated over different distances
and orientations in Setia et al. (2006). The multi-dimensional co-occurrence
matrix captures the statistical properties of the joint distribution of cluster in-
dices, which describes the appearance and structure of the image, and various
classification and annotation strategies are applied on these features.
The major steps involved in the cluster cooccurrence matrix generation al-
gorithm can be summarized as follows. For an given image, do the following:
Preprocessing Convert image to grayscale if needed, normalize grayvalues

between 0 and 1.
Interest Points Apply an interest point detector (in our case, the Loupias
Salient-Point Detector). Sort the obtained saliency map, and take the Ns
points with the highest saliency values for further computation.
Local Relational Feature Generation Evaluate a number of relational func-

tions <( x, y, r1 , r2 , φ, n). Each function gives for each interest point, a sub
feature vector of length n. These are concatenated to get a local feature

vector for each interest point.
Clustering Take a random subset of local feature vectors from all training
images. Cluster these feature vectors in Nc clusters according to some
optimization criteria. Save the cluster centers for later use with test im-
ages.
Cluster Co-occurrence Matrix The nearest cluster is calculated for all local
feature vectors of the image. The complete local feature vectors are dis-
carded, and the only retained information is the index of the nearest clus-
ter. Consider all possible salient-point pairs. A cluster co-occurrence ma-
trix of size Nc × Nc is generated sector-wise (i.e. over radial and angle
ranges), yielding a 4-D feature matrix. This 4-D matrix is flattened and
used as the final feature vector for the image for use, for example, with
an independent classifier.
5.3.2 Interest Point Detection

The method used here is the salient point extraction algorithm introduced by
Loupias and Sebe (1999). We decided to use this salient point detector as it was
found that it has more information content compared with the well-known
Harris detector (Sebe and Lew, 2003). Our informal tests affirmed these state-
ments for the images taken from the IRMA database used that we use here.
The assumption is that image points, where high variations occur, repre-
sent important information in the image (areas of high relevance) and are thus
extracted. One can study the variations that are present in an image using
the wavelet analysis which allows for multiresolution representation of a sig-
nal (image). The algorithm starts from the coarsest resolution after represent-
ing the image in the wavelet domain, always going back one step to a finer
resolution, choosing from the set of available points the one with the highest
wavelet coeffient at that level. This is applied until one ends up with picking
one coefficient at level 1 (level 0 represents the original image). This coefficient
represents a number of points in the original image. Among these points, the
point with the maximum gradient is chosen and is given a value representing
its saliency. This saliency value is equal to the sum of the absolute value of the
wavelet coefficients along the whole track:
l
s= ∑ | ci | (5.1)
i =1
Figure 5.1: 1000 Interest points with the highest saliency value for each of the
two images shown on the left. Although some interest points are
found in the non-discriminative parts of the images (for example,
the man-made object embedded in the chest or the text in the top-
right corner), local methods are still very robust to partial match-
ing.
The above scenario is repeated for every wavelet coefficient that exceeds
a certain threshold, τ, in order to avoid computation time by not investigat-
ing small wavelet coefficients. We end up with a matrix (which we shall call
the “saliency map" here) representing the saliencies of the image pixels. The
saliency map is then sorted and a fixed number of salient points (Ns ) per im-
age is taken in this work. An alternative strategy is to fix a threshold, and
select all points having a saliency above this threshold. The pixels very near to
the image boundary (upto 6 pixels in our case) are not considered candidates
for an interest point, as the local features cannot be accurately calculated there
without introducing artifacts. The detected interest points for two sample im-
ages from the used database are shown in Figure 5.1.
5.3.3 Relational Features

Relational features are motivated from the use of relational kernels in texture
classification, introduced by Schael (2004) based on the Local Binary Pattern
(LBP) texture features (Ojala et al., 2002) which map the relation between a
center pixel and the pixels in its neighborhood into a binary pattern.
Local Binary Pattern features are invariant against monotonic grayscale
Figure 5.2: Calculation of a set of relational features. A feature is formed by

applying the relational function to the gray-value difference of the
pixels lying on specific distance and phase to the salient point in
question (i.e. center of the circles)
transformations. They eliminate the effect of illumination by comparing the

value of a center pixel with the values of the pixels in its neighborhood. Then
the sign of the difference is considered instead of the value itself. If the value
of a neighboring pixel is greater than or equal the value of the center pixel,
then the difference is mapped to the value 1, else it is set to 0. Applying this to
all pixels in a circular neighborhood of the center pixel, we end up with a bi-
nary pattern which can be transformed into a unique number as follows (Ojala
et al., 2002):
n −1
LBP = ∑ s ( v i − v c ) 2i , where (5.2)
i =0

1, x ≥ 0
s (x) = (5.3)
0, x < 0,
where vi and vc are the grayvalues at a neighboring pixel and at the center
pixel, respectively, and n gives the number of the pixels in the circular neigh-
borhood of the center pixel. Since the signed difference (vi − vc ) is considered,
the effect of grayscale shifts is totally eliminated. Invariance against scaling
of the grayscale is achieved by the s operator as the sign of the difference is
mapped to 0 or 1.
It is obvious that the main disadvantage of these features is the discontinu-
ity of the LBP operator (the s function), which makes them sensitive to noise;
a small disturbance in the image may cause a big deviation of the feature. To
overcome this problem, Schael (2004) has introduced an operator which ex-
tends the step function in Equation 5.3 to a ramp function giving values in the
range of [0, 1]:

 1 if x < −e
e− x
rel ( x ) = 2e if −e ≤ x ≤ e , (5.4)
0 if e < x

where e is a threshold parameter. This way, the features are much more ro-
bust against noise, but we also sacrifice the 100% invariancy to monotonic
grayscale transformations (although the features are still robust to these trans-
formations). If e is set to zero, the rel function will reduce to the simple LBP
operator s.
Based on the relational operator defined in Equation 5.4, we define a rela-
tional function <( x, y, r1 , r2 , φ, n) 7→ Rn , calculated on a salient point ( x, y) of
the image I as center. To simplify notation, let the individual output values of
the function be given by
Rk = [<( x, y, r1 , r2 , φ, n)]k , k = 1, . . . , n
Then,
Rk = rel(I( x2 , y2 ) − I( x1 , y1 )),
where
( x1 , y1 ) = ( x + r1 cos(k · 2π/n), y + r1 sin(k · 2π/n)),
and
( x2 , y2 ) = ( x + r2 cos(k · 2π/n + φ), y + r2 sin(k · 2π/n + φ))
The process is illustrated in Figure 5.2. Bilinear interpolation is used for
points not lying exactly on the image grid. Based on different combinations
of r1 , r2 and φ, local information at different scales and orientations can be
captured. In this work, we use 3 sets of parameters, (0, 5, 0), (3, 6, π/2) and
(2, 3, π ), each with n = 12. The 3 subvectors are concatenated to yield a local
feature vector of length 36 at each salient point. It is of interest to note, that
in applications where rotation invariance is desired, a subvector can simply be
summed up to yield a rotation invariant descriptor.
The ensemble of local feature vectors extracted from all training images are
clustered as explained in the next section. To remain computationally feasible,
the process is carried out on 18000 randomly chosen local feature vectors.
5.3.4 Clustering
Clustering can be understood as a grouping of similar objects. For vectorial
data, this process has also been extensively studied in the branch of vector
Local Feature Vectors in the image Database of Cluster Centroids
c1
c2
c3
Cluster Index Image
1
2
4
7
1
3
3
1
2
4
Sector-wise cooccurrence matrix
Figure 5.3: Schematic diagram depicting how the final features are reached.
A cluster index image is formed using the local feature vectors
around the salient points. For each sector of the ring, a cluster
co-occurrence matrix is formed by considering all pairs of salient
points whose orientation is the same as that of the sector with re-
spect to the center of the semi-circle. Taking the red-coloured point
with cluster index 4 as an example, the two other interest points
would be considered in the co-occurrence matrix.
quantization. The main benefit of clustering is that of compaction (and thus

implicitly abstraction) of data. In this respect, clustering can be more effective
than a uniformly spaced histogram whose bins have been derived more or less
independently of the training data.
There are many possible approaches to clustering, in this work we use one
of the most common algorithm, the k-means clustering algorithm. K-means is an
iterative algorithm which minimizes the sum, over all clusters, of the within-
cluster sums of point-to-cluster-centroid distances (see Section 4.1.1). The k in
k-means stands for the number of desired clusters and is an input to the algo-
rithm. Other decisions to be made include an appropriate distance measure,
which we simply take to be Euclidean, and the choice of initial clusters, which
we take to be randomly chosen local feature vectors.
The number of clusters is denoted in this work by Nc and must be selected
carefully as the size of the final feature vector increases quadratically with Nc .
5.3.5 Building the Final Feature Vector

An open research problem with local features has been on how to incorpo-
rate in the image similarity definition, both the similarity between local fea-
ture vectors as well as the spatial orientation of the salient points where the
local feature vectors were extracted. In Deselaers et al. (2006b), for example, it
was found that even simply appending the raw ( x, y) coordinates of the salient
point to its local feature vector improves classification performance for tasks in
which translation invariance is not required. In our case, the local features af-
ter clustering are combined to form a final feature vector using accumulators.
The following accumulators were tried and compared for their effectiveness.
All-Invariant Accumulator
The simplest histogram that can be built from a cluster index image is a 1-
D accumulator counting the number of times a cluster occurred for a given
image. All spatial information regarding the interest point is lost in the process.
The feature can be mathematically written as:
F(c) = #{ s | ( I (s) = c)}, c = 1, . . . , Nc
In our experiments, the crossvalidation results (for the 116-category database

from ImageCLEF 2006) were never better than 60 %. It is clear that critical
spatial information is being lost in the process of histogram construction.
Rotation Invariant Accumulator
We first incorporate spatial information, by counting pairs of salient points

lying within a certain distance range, and possessing particular cluster indices,
i.e. a 3-D accumulator defined by:
F ( c1 , c2 , d ) = #{ ( s1 , s2 ) | ( I ( s1 ) = c1 ) ∧ ( I ( s2 ) = c2 )
∧ ( Dd < ||s1 − s2 ||2 < Dd+1 ) }
where the extra dimension d runs from 1 to Nd (number of distance bins).
The crossvalidation results improve to about 68 %, but it should be noted
that this accumulator is rotation invariant (depends only on distance between
salient points), while the images are upright. Incorporating unnecessary in-
variance leads to a loss of discriminative performance. Especially in this task
of radiograph classification it can be seen that mirroring the position of interest
points leads to a completely another class, for example, a left hand becomes a
right hand, and so on. Thus we incorporate orientation information in the next
section.
Orientation Variant Accumulator
These features can also be interpreted as a kind of cluster co-occurrence ma-

trices (CCM), i.e. as features measuring the joint probability of two kinds of
local regions to occur at a specific distance and orientation to each other. The
process of generating a cluster cooccurrence matrix is illustratively shown in
Figure 5.3. The algorithm for generating a CCM from a cluster image can be
described as follows:
Let s be the location ( x, y) of a salient point, and I (s) denote the cluster
index assigned to the local relational feature vector extracted at s. We define
a vector defining the bin boundaries for the distance quantization in the co-
occurrence matrix, D = ( D1 , D2 , . . . D Nd +1 ) T , and another for the angle quan-
tization, A = ( A1 , A2 , . . . A N] +1 ) T , where Nd and N] are the number of quan-
tization bins used for the radial and angular direction, respectively. The set of
all salient point pairs having cluster indices c1 and c2 respectively, and located
at a specific spatial orientation to each other is then given as:
S(c1 , c2 , d, a) = { ( s1 , s2 ) | I ( s1 ) = c1
∧ I ( s2 ) = c2
∧ Dd < ||s1 − s2 ||2 < Dd+1
∧ A a < ] ( s 1 , s 2 ) < A a +1 }
10246.png (29) 11290.png (29) 15421.png (29) 14195.png (64) 10962.png (64)
d=0.0 d=120.2 d=131.6 10251.png (64) d=99.9 d=102.8
d=0.0
10247.png (29) 18318.png (108) 10818.png (64) 13505.png (64)

9618.png (108) 11580.png (64)
d=133.8 d=133.9 d=104.0 d=105.3
d=132.1 d=105.8
18366.png (29) 10261.png (111) 15315.png (64) 11800.png (64)

d=136.4 16951.png (109) d=106.8 12779.png (64)
d=136.6 d=138.4 d=105.9 d=107.1
10597.png (22) 17095.png (22) 10054.png (22)

d=0.0 d=129.0 d=139.2
10593.png (10) 10513.png (6) 10512.png (32)
d=0.0 d=83.8 d=86.5
11279.png (22) 11055.png (22) 10229.png (44)

d=151.9 d=152.9 d=154.3
7827.png (5) 15319.png (10) 2133.png (52)
d=88.9 d=90.4 d=91.7
16132.png (21) 18524.png (22) 14198.png (22)

7052.png (5) 14839.png (5) d=154.9 d=157.2 d=157.6
d=91.9 d=92.5 8320.png (32)
d=93.8
Figure 5.4: Top 8 nearest neighbour results for different query radiographs.
The top left image in each box is the query image, results follow
from left to right, then top to bottom. The image caption contains
the image name and label in (brackets). Finally the distance ac-
cording to the L1 norm is displayed underneath.
The indices run c1 = 1, . . . , Nc , c2 = 1, . . . , Nc , d = 1, . . . , Nd , and a =

1, . . . , N] . Furthermore, ](s1 , s2 ) is the angle in the range [0, 2π ) made by
the vector s2 − s1 with the x-axis. The co-occurrence matrix M consists of the
cardinalities of the above sets.
M(c1 , c2 , d, a) = | S(c1 , c2 , d, a) |
The resulting CCM can be interpreted either as a single 4-D array, or as a

series of 2-D CCM matrices, one each for a specific ring sector. To give an idea
about the values which can be used in practice, the distance bin boundaries
selected in our experiments are for example,
D = ( 0, 15, 30, . . . , 150 )T
measured in pixels (in comparison, the larger dimension of images in the used
database was always 512), and the angle bin boundaries are for example,
A = ( 0, π/4, π/2, 3π/4, π )T
measured in radians. It can be seen that only the [0, π ) angle range needs
to be covered, as each salient point pair (s1 , s2 ) would otherwise be counted
twice in the matrix (with cluster bins swapped, and in an angle bin which is at
an angle π radians from the other). A fuzzy accumulator can also be used to
generate the matrix M, but was not investigated in this work.
5.4 Results
As mentioned in the introduction, we will apply these features to the IRMA
database for three different usage scenarios, namely, retrieval using the query-
by-example (QBE) paradigm, classification into one of the given 116 base classes,
and assigning tags, namely, the technical, directional, anatomical and the biologi-
cal code.
5.4.1 Query-by-Example Results

Figure 5.4 shows the nearest neighbour results for four different radiographs.
As can be seen, the proposed features are mostly well suited for this task.
11000
5.4 Results
10000
9000
8000
7000
6000
5000
108111109110 6 112 1 95 10 94 96 5 51 52 32 9 58 64 59 8 106 97 14 84 55 27 70 31 53 28 65 67 68 99 102 26 29 54 57 11 12 13 60 49 82 78 50 83 79 74 75 91 92 30 69 61 62 63 113115114116 98 23 2 56 3 4 100103 15 16 33 71 34 35 36 37 38 101105 39 40 104 21 85 86 22 44 45 47 25 66 93 107 7 17 18 87 88 19 20 90 42 77 43 41 76 89 48 24 46 80 81 72 73
Figure 5.5: Hierarchy tree for the 116-category IRMA 2006 database
99
Method Group ER (%)
This work
Cluster-Coocurrence Matrices w/ Rel. Features Uni Freiburg

- 20 × 20 × 10 × 4 CCM, 1000 SP∗ 8.1
- 20 × 20 × 10 × 4 CCM, 600 SP 8.9
- 15 × 15 × 10 × 4 CCM, 1000 SP 9.1
∗ SP = salient points
Previous Best
Sparse Histograms w/ Position RWTH-Aachen

(Deselaers et al., 2006b)
- using maximum entropy classification 9.3
- using support vector machine 10.0
ImageCLEF 2005 Benchmark, Top-7 results
Image Distortion Model (Keysers et al., 2004) RWTH Aachen 12.6

Image Distortion Model & Texture IRMA-Group 13.3
Patch-Based Classifier (Maximum Entropy) RWTH Aachen 13.9
Patch-Based Classifier (Boosting) Uni-Liège 14.1
Image Distortion Model & Texture IRMA-Group 14.6
Decision Trees (Marée et al., 2005) Uni-Liège 14.7
GNU Image Finding Tool (GIFT) Uni Geneva 20.6
Baseline Results
32 × 32 images as feature, 1-NN/L2 classification - 36.8
Table 5.2: Results for the IRMA 05 Database. The comparison results are
taken from the ImageCLEF 2005 Benchmark, and from a recent im-
provement we are aware of.
5.4 Results 101
5.4.2 Classification Results

Choice of SVM kernel
It is of utmost importance to choose an appropriate SVM kernel function, for it

encompasses the prior knowledge about pattern similarity. In line with obser-
vations made in Chapelle et al. (1999), kernels based on L1 similarity measure
seem to give the best results with the features presented in Section 5.3, since
they are accumulator based. It was shown in Barla et al. (2002) that the his-
togram intersection kernel function defined by
n
k(x, y) = ∑ min(xi , yi ) (5.5)
i =1
is positive definite. The proof is based on the idea that if each bin of the
histogram is transformed to a binary vector, filled with Nb 10 s where Nb is the
value of the histogram bin, and followed by Nh 00 s where Nh is the sum of all
the bins in the histogram, then the expression min( xi , yi ) can be conveniently
expressed as a scalar product of the corresponding binary vectors, thus satisfy-
ing the Mercer’s condition. Another important issue is that of normalization.
The cooccurrence features described here are not normalized, for the reason
that the total size of the histogram can potentially be an important indicator of
the class. In order to test this hypothesis, we normalize in different ways and
compare their performance. The normalization we propose can be performed
directly in the kernel matrix in case it is already been calculated for the test and
training data. In this way, repetitive computations can be avoided.
Given a kernel matrix consisting of Kij = {K}ij , the following normaliza-
tion methods are applied:
a)
Kij
Kij = (5.6)
Kii + K jj
which is equivalent to:
∑l min( xl , yl )
Kij = (5.7)
∑l xl + ∑l yl
and analogously:
b)
Kij
Kij = (5.8)
min(Kii + K jj )
Table 5.3: Effect of kernel normalization on system performance
Normalization method Cross-validation Results

No normalization 86.0 %
Equation 5.6 86.7 %
Equation 5.8 42.9 %
Equation 5.9 86.4 %
Table 5.4: Part of the results of the medical annotation task in the ImageCLEF
2007 benchmark (lower score is better)
Method Score
Best Results (Tommasi et al., 2008) 26.84
Flat Classification 31.43
Axis-wise Flat Classification 45.48
Binary Classification Tree 47.94
c)
Kij
Kij = p (5.9)
Kii + K jj
The cross-validation results for the different normalization methods are

shown in Table 5.3. The normalization according to Equation 5.6 gave the best
results for the cross-validation tests.
Results from ImageCLEF Benchmarks
Our group participated in the 2006 and 2007 editions of the ImageCLEF bench-
marks2 , while for the 2005 edition we performed our experiments at a later
date. The 2005 edition used a different version of the IRMA database than the
later years. As can be seen in Table 5.2, results better than previously known
in the literature could be achieved.
For the 2007 edition, an error counting scheme as described in Deselaers
et al. (2008) was used. Table 5.4 shows a sampling of the overall results. The
binary classification tree which was used for some results is shown in figure
5.5. Overall, it can be said that very competitive results could be achieved in
this popular international benchmark.
2 http://imageclef.org
Chapter 6
Conclusion
In this thesis, we have attempted a holistic approach to the task of image re-
trieval. Acknowledging that no single methodology could serve all possible
usage scenarios, we have shown the use of following possibilities.
Query by Example The classic image retrieval scenario. The user provides a
query image and the system provides the best matches as the results.
Due to the extremely limited data that the user provides (just the query
image), many simplifications have to be made. Results can be improved
if these simplifications take into account the prior knowledge available
about the database.
Relevance Feedback The system is interactive. The user provides feedback

on the quality of the individual results, and the system updates its un-
derstanding of what the user is looking for. This is a classical machine
learning situation, however it is bound by computational and data con-
straints.
Image Annotation The image database is partially labelled (in the form of
“training images”). The system learns the correlation between the image
features and the keywords attached to the image. This is used to label the
other images present in the database. This can be performed offline, and
thus computational complexity is not as critical an issue as in relevance
feedback. Furthermore, the amount of training data is typically orders of
magnitude higher than that available during relevance feedback. Image
annotation also does not suffer from the zero-page problem, i.e. the prob-
lem of finding a suitable starting image, since the search can be started
by specifying a keyword.
Data Mining Apart from image annotation, we argued that various data min-
103
104 Conclusion
ing operations can be performed offline on an image database. The in-

formation extracted during the data mining process can be used intelli-
gently during retrieval, in order to reduce the search time for the user,
or to make the search process more intuitive and enjoyable. As an ex-
ample, we constructed a hierarchy tree for the various labels with which
a database was annotated. This can be used to auto-suggest related key-
words which the user also might find useful. It is interesting to note that
related keywords can also be constructed without image content analysis
tools, but these are of a completely different nature, and even compli-
ment those suggested by content analysis. For example, a system might
suggest the keyword London as being related to England, simply because
this pair occurred together frequently in training data. The content anal-
ysis system, on the other hand, considers two keywords similar only if
their corresponding images were visually similar.
To conclude, we would like to share some thoughts on the future of image

retrieval in general.
• At an end-user level, where the aim is to manage private image collec-

tions, the query-by-example paradigm is of very limited use. Applica-
tions which would allow users to supply keywords manually, and which
group images automatically based on meta-data provided by the modern
digital cameras, should grow in use.
• For specialized image collections, CBIR and relevance feedback has the
potential to grow, as increased computing power and advancement in
feature extraction technology would lead to more satisfactory results.
This can include specialized medical image collections, which could aid
hospital staff in finding cases similar to the one in hand. The online con-
sumer market is one other area which could gain in popularity. Users
can search for products (clothing, jewellery, etc.) based on image search.
An important prerequisite for this, however, is that the images are col-
lected with this goal in mind. If all product images are collected in an
inconsistent manner, then CBIR based product search would not be as
useful.
• In order to make searching in large databases practicable, indexing should

be performed. In contrast to the indexing performed by textual search
engines, this indexing is often only approximate, though the approxima-
tion error has an upper bound. For object search systems, it would be
interesting to see local features getting compared directly, instead of first
combining them into a single global vector.
105
• The user interface is a very important part of any interactive system, and
content based image retrieval is no exception. In this work we mentioned
several useful methods, like multidimensional scaling, hierarchy con-
struction etc. which could be used to make an intuitive graphical user
interface. This is an area however, where data collected from physiolog-
ical experiments on human beings should be used. Questions such as:
how should the image results be optimally presented? How many differ-
ent kinds of feedback can a user provide on an image? etc. can only be
answered by experimentation on real subjects.
106 Conclusion
List of Figures
1.1 Information Retrieval Scenarios . . . . . . . . . . . . . . . . . . . 2
2.1 An image retrieval use case at its most abstract level . . . . . . . 6

2.2 Block diagram of a content-based image retrieval system . . . . 7
2.3 An example of the semantic gap problem. The two images pos-
sess very similar colour and texture characteristics, but differ
vastly as far as the semantics are concerned. (A colour version of
this figure is available on Page 126) . . . . . . . . . . . . . . . . . . 11
2.4 Sample images from a hypothetical visual class cars . . . . . . . 11
2.5 Sample images from a hypothetical semantic class cars . . . . . 12
2.6 Interpretation of Precision and Recall in set notation . . . . . . . 15
2.7 Ranking induced by similarity functions. The left image shows
the spherical ranking due to Euclidean distance, and the right
image shows that due to the Manhattan distance. The point (3,
3) is used as the query point. (A colour version of this figure is
available on Page 126) . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 2-D image distribution obtained for the shown images by us-
ing simple visual features together with the MDS algorithm. (A
colour version of this figure is available on Page 127) . . . . . . . . . 24
3.2 Application of the Rocchio’s formula for different parameter val-
ues. As is apparent, in its default form this can lead to a magnifi-
cation of the feature vectors. In the figure, the crosses represent
the positive examples, while the circles represent the negative.
q0 is the initial query point, and q the query point after applying
the Rocchio’s Algorithm. . . . . . . . . . . . . . . . . . . . . . . . 26
107
108 LIST OF FIGURES
3.3 Feature selection using hill climbing and forward search. Each
node shows the selection mask and the goodness criteria for that
particular subset. In the shown example, the hill climbing algo-
rithm does not reach the optimum node with a goodness of 0.53
(shown in green). . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 In this toy example, feature x1 may be considered more impor-
tant than x2 because of the normal vector . . . . . . . . . . . . . 34
3.5 One-Class SVM applied on a 2-D toy data set. (a) Using a lin-
ear kernel. (b) and (c) Using a gaussian kernel with different ν
and γ values. As can be seen, the one class SVMs are quite flex-
ible based on their parameters. (A colour version of this figure is
available on Page 126) . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 A spiral which is a 1-manifold in 2 − D space. The Geodesic
distance between two points is the length of the arc segment
connecting them, whereas the Euclidean distance is the straight-
line distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.7 First round results for an image from class number 2. The first
results using the shape features, and the next two rows using
the colour features. The query images for the next round of rel-
evance feedback were automatically selected from this pool . . 44
3.8 Second round results using a two class SVM for relevance feed-
back. The +++ in the title indicates that the image was a positive
image, while - indicates that it was a negative image. . . . . . 45
3.9 Second round results after feature selection using correlation co-
efficients. The best 20 features were used to generate the results.
(A colour version of this figure is available on Page 128) . . . . . . . 46
3.10 First round results for an image from class number 4. The first
results using the shape features, and the next two rows using
the colour features. The query images for the next round of rel-
evance feedback were automatically selected from this pool . . 47
3.11 Second round results using a two class SVM for relevance feed-
back. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.12 Second round results after feature selection using correlation co-
efficients. The best 20 features were used to generate the results 49
3.13 Feature weights obtained using (a) and (c): a linear SVM and (b)
and (d): correlation coefficients . . . . . . . . . . . . . . . . . . . 50
LIST OF FIGURES 109
3.14 Screenshot of our CBIR system after labeling the first set of query
images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.15 Results after the initial query round . . . . . . . . . . . . . . . . 52
3.16 Results after the second query round . . . . . . . . . . . . . . . . 53
3.17 Precision Recall plots with two-class SVM . . . . . . . . . . . . . 54
3.18 Comparison of different retrieval methods . . . . . . . . . . . . 54
4.1 Sample points in two-dimensional space and their correspond-

ing dendrogram using single linkage . . . . . . . . . . . . . . . . 60
4.2 Image search using keywords extracted using automated analy-
sis of the webpage containing the image. The screenshot shows
the results for the query keyword ocean using the commercial
search engine GoogleTM . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3 A tags example from the photo sharing website FlickrTM . a) An
example image, b) Labels supplied by users, and c) Tags auto-
matically added in the JPEG image’s EXIF section. . . . . . . . . 64
4.4 Feature weights for four sample Corel categories. Each row
plots the feature weights and a few sample images from the cat-
egory. The order of features in each graph is as follows 1) 18
colour features: mean, variance and skewness of the hue (H)
layer, followed by that of S, V, HS, SV and VH layers. 2) 12 tex-
ture features: Entropy of H, V and D first level decomposition,
followed by 2nd and 3rd levels 3) 18 edge features: Bins starting
from 0◦ degrees in anti-clockwise direction with each bin having
a span of 20◦ . (A colour version of this figure is available on Page 129) 68
4.5 Examples of balanced and unbalanced binary hierarchy trees . . 74
4.6 Feature weights between all pairs of classes from the MNIST
database. The gray value indicates the feature weight. The first
column and the first row show the classes, in the form of the
mean value of the images of the particular class. . . . . . . . . . 79
4.7 Binary classification tree generated for the MNIST digit database 80
4.8 Sample images from the Corel categories used for the experi-
ments. (A colour version of this figure is available on Page 130) . . . 82
4.9 Taxonomy for the 27 selected classes from the Corel collection . 84
110 LIST OF FIGURES
5.1 1000 Interest points with the highest saliency value for each of
the two images shown on the left. Although some interest points
are found in the non-discriminative parts of the images (for ex-
ample, the man-made object embedded in the chest or the text
in the top-right corner), local methods are still very robust to
partial matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Calculation of a set of relational features. A feature is formed
by applying the relational function to the gray-value difference
of the pixels lying on specific distance and phase to the salient
point in question (i.e. center of the circles) . . . . . . . . . . . . . 92
5.3 Schematic diagram depicting how the final features are reached.
A cluster index image is formed using the local feature vectors
around the salient points. For each sector of the ring, a clus-
ter co-occurrence matrix is formed by considering all pairs of
salient points whose orientation is the same as that of the sec-
tor with respect to the center of the semi-circle. Taking the red-
coloured point with cluster index 4 as an example, the two other
interest points would be considered in the co-occurrence matrix. 94
5.4 Top 8 nearest neighbour results for different query radiographs.
The top left image in each box is the query image, results fol-
low from left to right, then top to bottom. The image caption
contains the image name and label in (brackets). Finally the dis-
tance according to the L1 norm is displayed underneath. . . . . 97
5.5 Hierarchy tree for the 116-category IRMA 2006 database . . . . 99
A.1 Figure 2.3 (Page 11): An example of the semantic gap problem.
The two images possess very similar colour and texture charac-
teristics, but differ vastly as far as the semantics are concerned. 126
A.2 Figure 2.7 (Page 16): Ranking induced by similarity functions.
The left image shows the spherical ranking due to Euclidean
distance, and the right image shows that due to the Manhattan
distance. The point (3, 3) is used as the query point. . . . . . . . 126
A.3 Figure 3.5 (Page 38): One-Class SVM applied on a 2-D toy data
set. (a) Using a linear kernel. (b) and (c) Using a gaussian kernel
with different ν and γ values. As can be seen, the one class
SVMs are quite flexible based on their parameters. . . . . . . . 126
A.4 Figure 3.1 (Page 24): 2-D image distribution obtained for the
shown images by using simple visual features together with the
MDS algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
LIST OF FIGURES 111
A.5 Figure 3.9 (Page 46): Second round results after feature selecting
using correlation coefficients. The best 20 features were used to
generate the results . . . . . . . . . . . . . . . . . . . . . . . . . . 128
A.6 Figure 4.4 (Page 68): Feature weights for four sample Corel cat-
egories. Each row plots the feature weights and a few sample
images from the category. The order of features in each graph is
as follows 1) 18 colour features: mean, variance and skewness
of the hue (H) layer, followed by that of S, V, HS, SV and VH
layers. 2) 12 texture features: Entropy of H, V and D first level
decomposition, followed by 2nd and 3rd levels 3) 18 edge fea-
tures: Bins starting from 0◦ degrees in anti-clockwise direction
with each bin having a span of 20◦ . . . . . . . . . . . . . . . . . . 129
A.7 Figure 4.8 (Page 82): Sample images from the Corel categories
used for the experiments . . . . . . . . . . . . . . . . . . . . . . . 130
112 LIST OF FIGURES
List of Tables
3.1 Shape ground truth used for the experiments, with a total of
429 images in 8 different categories. Sample images are shown
alongside the class name (Colour version of this table is available on
Page 124) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 SVM kernels for Image Retrieval . . . . . . . . . . . . . . . . . . 44
4.1 Sample annotation results from the system. The query images
are taken from the Corel Collection but did not belong to the
training set. It can be observed that while the original category
was sometimes not found, it was due to the fact that the cate-
gories often overlapped, as the top matches do indeed contain
very similar categories, leading to robust annotation results. (A
colour version of this table is available on Page 125) . . . . . . . . . . 70
4.2 Confusion matrix for the 10-category classification task . . . . . 72
5.1 Sample images from the IRMA database, together with their
class number (1-116), and code description for the four inde-
pendent axes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2 Results for the IRMA 05 Database. The comparison results are
taken from the ImageCLEF 2005 Benchmark, and from a recent
improvement we are aware of. . . . . . . . . . . . . . . . . . . . 100
5.3 Effect of kernel normalization on system performance . . . . . . 102
5.4 Part of the results of the medical annotation task in the Image-
CLEF 2007 benchmark (lower score is better) . . . . . . . . . . . 102
A.1 Table 3.1 (Page 40): Shape ground truth used for the experi-
ments, with a total of 429 images in 8 different categories. Sam-
ple images are shown alongside the class name . . . . . . . . . . 124
113
114 LIST OF TABLES
A.2 Table 4.1 (Page 70): Sample annotation results from the system.
The query images are taken from the Corel Collection but did
not belong to the training set. It can be observed that while the
original category was sometimes not found, it was due to the
fact that the categories often overlapped, as the top matches do
indeed contain very similar categories, leading to robust anno-
tation results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Bibliography
S. Agarwal and D. Roth. Learning a sparse representation for object detec-

tion. In ECCV ’02: Proceedings of the 7th European Conference on Computer
Vision-Part IV, pages 113–130, London, UK, 2002. Springer-Verlag. ISBN:
3-540-43748-7. 89
E. Amaldi and V. Kann. On the approximability of minimizing nonzero

variables or unsatisfied relations in linear systems. Theoretical Com-
puter Science, 209(1–2):237–260, 1998. URL http://citeseer.ist.psu.edu/
amaldi97approximability.html. 31
A. Barla, E. Franceschi, F. Odone, and A. Verri. Image kernels. In SVM

’02: Proceedings of the First International Workshop on Pattern Recognition with
Support Vector Machines, pages 83–96, London, UK, 2002. Springer-Verlag.
ISBN:3-540-44016-X. 101
K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D. M. Blei, and M. I. Jordan.

Matching words and pictures. J. Mach. Learn. Res., 3:1107–1135, 2003. ISSN:
1533-7928. 62
H. Burkhardt. Transformationen zur lageinvarianten Merkmalgewinnung.

Technical report, Universität Karlsruhe, 1979. Ersch. als Fortschrittbericht
(Reihe 10, Nr. 7) der VDI-Zeitschriften, VDI-Verlag. 7, 9
H. Burkhardt and S. Siggelkow. Invariant features in pattern recognition –

fundamentals and applications. In C. Kotropoulos and I. Pitas, editors, Non-
linear Model-Based Image/Video Processing and Analysis, pages 269–307. John
Wiley & Sons, 2001. ISBN:978-0471377351. 8
J. Cardoso and A. Sheth. Semantic Web Services, Processes and Applications.

Springer, 2006. ISBN:0-38730239-5. 61
T. K. Chalasani, A. M. Namboodiri, and C. Jawahar. Support vector ma-

chine based hierachical classifieds for large class problems. In Proc. of the
115
116 BIBLIOGRAPHY
Sixth International Conference on Advances in Pattern Recognition(ICAPR 2007),

Kolkatta, India, 2007. 76
O. Chapelle, P. Haffner, and V. N. Vapnik. Support vector machines for

histogram-based image classification. Neural Networks, IEEE Transactions on,
10(5):1055–1064, 1999. doi:10.1109/72.788646. 55, 101
I. J. Cox, M. L. Miller, T. P. Minka, T. Papathomas, and P. N. Yianilos. The

bayesian image retrieval system, pichunter: Theory, implementation and
psychophysical experiments. IEEE Transactions on Image Processing, 9(1):20–
37, January 2000. 22
T. Cox and M. Cox. Multidimensional scaling. Chapman and Hall, London, 1994.
ISBN:978-1584880943. 23
D. Cremers, M. Rousson, and R. Deriche. A review of statistical approaches to

level set segmentation: integrating color, texture, motion and shape. Interna-
tional Journal of Computer Vision, 2006. URL http://www-cvpr.iai.uni-bonn.
de/. 5
M. Crucianu, M. Ferecatu, and N. Boujemaa. Relevance feedback for image

retrieval: a short review. In In State of the Art in Audiovisual Content-Based
Retrieval, Information Universal Access and Interaction including Datamodels and
Languages (DELOS2 Report, 2004. 19
C. Cusano, G. Ciocca, and R. Schettini. Image annotation using SVM. Internet

Imaging V, 5304(1):330–338, 2003. doi:10.1117/12.526746. URL http://link.
aip.org/link/?PSI/5304/330/1. 62
C. Dance, J. Willamowski, L. Fan, C. Bray, and G. Csurka. Visual categorization

with bags of keypoints. In ECCV International Workshop on Statistical Learning
in Computer Vision, 2004. URL http://www.xrce.xerox.com/Publications/
Attachments/2004-010/2004_010.pdf. 89
T. Deselaers, A. Hegerath, D. Keysers, and H. Ney. Sparse patch-histograms

for object classification in cluttered images. In Pattern Recognition, 28th
DAGM Symposium, Berlin, Germany, September 12-14, 2006, Proceedings, vol-
ume 4174 of Lecture Notes in Computer Science, pages 202–211, 2006a. ISBN:
3-540-44412-2. 89
T. Deselaers, A. Hegerath, D. Keysers, and H. Ney. Sparse patch-histograms for

object classification in cluttered images. In DAGM 2006, Pattern Recognition,
26th DAGM Symposium, Lecture Notes in Computer Science, page accepted
BIBLIOGRAPHY 117
for publication, Berlin, Germany, September 2006b. ISBN:978-3-540-44412-1.

95, 100
T. Deselaers, H. Müller, and T. Deserno. Automatic Medical Image Annotation

in ImageCLEF 2007: Overview, Results, and Discussion. Pattern Recogni-
tion Letters, Special Issue on Medical Image Annotation in ImageCLEF 2007 (to
appear), 2008. 102
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley-Interscience

Publication, 2000. ISBN:978-0471056690. 23
A. X. Falcão, P. A. Miranda, , and A. Rocha. A linear-time approach for im-

age segmentation using graph-cut measures. In Advanced Concepts for In-
telligent Vision Systems (LNCS), volume 4179, pages 138–149, Antwerp, Bel-
gium, 2006. Springer Berlin / Heidelberg. URL http://www.ic.unicamp.br/
~rocha/pub/index.html. 5
D. Feng, W. Siu, and H. J. Zhang, editors. Multimedia Information Retrieval and

Management. Springer Verlag, 2003. ISBN:978-3-540-00244-4. 14, 17
R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsuper-

vised scale-invariant learning. In Computer Vision and Pattern Recognition,
2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 2, pages
II–264–II–271 vol.2, 2003. URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?
arnumber=1211479. 89
M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani,

J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker. Query by image and
video content: The qbic system. IEEE Computer, 28(9):23–32, 1995. 19
M. O. Güld, M. Kohnen, D. Keysers, H. Schubert, B. Wein, J. Bredno, and

T. Lehmann. Quality of DICOM header information for image categoriza-
tion. In SPIE Intl. Symposium on Medical Imaging, Proc. SPIE, pages 280–287,
San Diego, CA, Feb. 2002. 88
A. Guttman. R-trees: a dynamic index structure for spatial searching. In

SIGMOD ’84: Proceedings of the 1984 ACM SIGMOD international confer-
ence on Management of data, pages 47–57, New York, NY, USA, 1984. ACM.
ISBN:0-89791-128-8. doi:10.1145/602259.602266. 6
B. Haasdonk. Feature space interpretation of SVMs with indefinite kernels.

IEEE Trans. Pattern Anal. and Mach. Intell., 27(4):482–492, April 2005. doi:
10.1109/TPAMI.2005.78. 10
118 BIBLIOGRAPHY
A. Halawani, A. Teynor, L. Setia, G. Brunner, and H. Burkhardt. Fundamentals

and applications of image retrieval: An overview. Datenbank-Spektrum, 18:
14–23, 2006. 6
A. Hurwitz. Über die Erzeugung der Invarianten durch Integration.

Nachrichten von der Gesellschaft der Wissenschaften zu Göttingen, pages 71–90,
1897. URL http://dz-srv1.sub.uni-goettingen.de/sub/digbib/loader?did=
D52776. 9
T. Ishioka. Extended k-means with an efficient estimation of the number of

clusters. In IDEAL ’00: Proceedings of the Second International Conference on
Intelligent Data Engineering and Automated Learning, Data Mining, Financial
Engineering, and Intelligent Agents, pages 17–22, London, UK, 2000. Springer-
Verlag. ISBN:3-540-41450-9. 58
A. K. Jain and R. C. Dubes. Algorithms for clustering data. Prentice-Hall, Inc.,

Upper Saddle River, NJ, USA, 1988. ISBN:0-13-022278-X. 57
D. Keysers, C. Gollan, and H. Ney. Classification of medical images using non-

linear distortion models. In Bildverarbeitung für die Medizin, pages 366–370,
2004. ISBN:978-3-540-21059-7. 100
D. Keysers, T. Deselaers, C. Gollan, and H. Ney. Deformation models for image

recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29
(8):1422–1435, 2007. ISSN:0162-8828. doi:10.1109/TPAMI.2007.1153. 10
R. Kohavi and G. H. John. Wrappers for feature subset selection. Artif. Intell.,
97(1-2):273–324, 1997. ISSN:0004-3702. doi:10.1016/S0004-3702(97)00043-X.
65
L. M. Krauss and G. D. Starkman. Universal limits on computation, 2004. URL

http://arxiv.org/abs/astro-ph/0404510. 1
J. B. Kruskal. Multidimensional scaling by optimizing goodness of fit to a non-

parametric hypothesis. Psychometrika, 29:1–27, March 1964. doi:10.1007/
BF02289565. 23
T. Lehmann, H. Schubert, D. Keysers, M. Kohnen, and B. Wein. The IRMA

code for unique classification of medical images. In Proc. Medical Imaging
2003, Proc. SPIE, pages 109–117, San Diego, CA, May 2003. 86
H. Lei and V. Govindaraju. Half-against-half multi-class support vector ma-

chines. In Multiple Classifier Systems, volume 3541 of Lecture Notes in Com-
puter Science, pages 156–164. Springer, 2005. ISBN:3-540-26306-3. 76
BIBLIOGRAPHY 119
J. Li and J. Z. Wang. Automatic linguistic indexing of pictures by a statistical

modeling approach. IEEE Trans. Pattern Anal. Mach. Intell., 25(9):1075–1088,
2003. ISSN:0162-8828. doi:10.1109/TPAMI.2003.1227984. 62, 69, 71
E. Loupias and N. Sebe. Wavelet-based salient points for image retrieval.
Technical Report Technical Report RR 99.11, Laboratoire Reconnaissance
de Formes et Vision, 1999. URL http://citeseer.comp.nus.edu.sg/article/
loupias99waveletbased.html. 90
D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int.
J. Comput. Vision, 60(2):91–110, 2004. ISSN:0920-5691. doi:10.1023/B:VISI.
0000029664.99615.94. 72
R. Marée, P. Geurts, J. Piater, and L. Wehenkel. Biomedical image classifica-
tion with random subwindows and decision trees. In Proc. ICCV workshop
on Computer Vision for Biomedical Image Applications (CVIBA 2005), volume
3765 of LNCS, pages 220–229. Springer-Verlag, oct 2005. URL http://www.
montefiore.ulg.ac.be/services/stochastic/pubs/2005/MGPW05b. 100
J. Mercer. Functions of positive and negative type, and their connection with
the theory of integral equations. Philosophical Transactions of the Royal Society
of London. Series A, Containing Papers of a Math. or Phys. Character (1896-1934),
209(-1):415–446, January 1909. doi:10.1098/rsta.1909.0016. 35
G. E. Moore. Cramming more components onto integrated circuits. Electronics,
38(8), April 1965. URL ftp://download.intel.com/museum/Moores_Law/
Articles-Press_Releases/Gordon_Moore_1965_Article.pdf. 1
T. Ojala, M. Pietikäinen, and T. Mäenpää. Multiresolution gray-scale and
rotation invariant texture classification with local binary patterns. IEEE
Trans. Pattern Anal. Mach. Intell., 24(7):971–987, 2002. ISSN:0162-8828. doi:
10.1109/TPAMI.2002.1017623. 91, 92
J. Platt, N. Cristianini, and J. Shawe-Taylor. Large margin dags for multiclass
classification. Advances in Neural Information Processing Systems, 2000. 83
S. Ray and R. Turi. Determination of number of clusters in k-means clustering
and application in colour image segmentation. In Proceedings of the 4th Inter-
national Conference on Advances in Pattern Recognition and Digital Techniques,
pages 137–143, 1999. 58
S. Rho, E. Hwang, and M. Kim. Music information retrieval using a ga-based
relevance feedback. In 2007 International Conference on Multimedia and Ubiq-
uitous Engineering (MUE 2007), 26-28 April 2007, Seoul, Korea, pages 739–744,
2007. ISBN:978-0-7695-2777-2. doi:10.1109/MUE.2007.161. 2
120 BIBLIOGRAPHY
J. T. Robinson. The k-d-b-tree: a search structure for large multidimensional

dynamic indexes. In SIGMOD ’81: Proceedings of the 1981 ACM SIGMOD
international conference on Management of data, pages 10–18, New York, NY,
USA, 1981. ACM. ISBN:0-89791-040-0. doi:10.1145/582318.582321. 6
J. J. Rocchio. Salton: The SMART Retrieval System, Experiments in Automatic

Document Processing, chapter 14, pages 313–323. Prentice Hall, 1971. 23
Y. Rubner and C. Tomasi. Perceptual Metrics for Image Database Navigation.

Kluwer Academic Publishers, Norwell, MA, USA, 2001. ISBN:0792372190.
23
S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice-

Hall, Englewood Cliffs, NJ, 2nd edition edition, 2003. 66
M. Schael. Methoden zur Konstruktion invarianter Merkmale für die Texturanalyse.

PhD thesis, Albert-Ludwigs-Universität Freiburg, October 2004. 91, 92
B. Schölkopf, J. C. Platt, J. C. Shawe-Taylor, A. J. Smola, and R. C. Williamson.

Estimating the support of a high-dimensional distribution. Neural Comput.,
13(7):1443–1471, 2001. ISSN:0899-7667. doi:10.1162/089976601750264965.
36, 67
H. Schulz-Mirbach. Invariant features for gray scale images. In G. Sagerer,

S. Posch, and F. Kummert, editors, 17. DAGM - Symposium “Mustererken-
nung”, pages 1–14, Bielefeld, 1995. Reihe Informatik aktuell, Springer. ISBN:
3-540-60293-3. DAGM-Preis. 43
N. Sebe and M. S. Lew. Comparing salient point detectors. Pattern Recogn.

Lett., 24(1-3):89–96, 2003. ISSN:0167-8655. doi:http://dx.doi.org/10.1016/
S0167-8655(02)00192-7. 90
L. Setia and H. Burkhardt. Learning taxonomies in large image databases. In

ACM SIGIR Workshop on Multimedia Information Retrieval, Amsterdam, Hol-
land, 2007. 81
L. Setia and H. Burkhardt. Feature selection for automatic image annotation. In

Proceedings of the 28th Pattern Recognition Symposium of the German Association
for Pattern Recognition (DAGM 2006), Berlin, Germany. LNCS, Springer, 2006.
62
L. Setia, A. Teynor, A. Halawani, and H. Burkhardt. Image classification using

cluster-cooccurrence matrices of local relational features. In Proceedings of
the 8th ACM International Workshop on Multimedia Information Retrieval, Santa
Barbara, CA, USA, 2006. 89
BIBLIOGRAPHY 121
L. Setia, A. Teynor, A. Halawani, and H. Burkhardt. Grayscale medical image

annotation using local relational features. Pattern Recogn. Lett., 29(15):2039–
2045, 2008. ISSN:0167-8655. doi:10.1016/j.patrec.2008.05.018. 88
N. Shadbolt, T. Berners-Lee, and W. Hall. The semantic web revisited. IEEE In-
telligent Systems, 21(3):96–101, 2006. ISSN:1541-1672. doi:10.1109/MIS.2006.
62. 61
S. Siggelkow. Feature Histograms for Content-Based Image Retrieval. PhD thesis,

Albert-Ludwigs-Universität Freiburg, 2002. 43
A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-

based image retrieval at the end of the early years. IEEE Trans. Pattern Anal.
Mach. Intell., 22(12):1349–1380, 2000. ISSN:0162-8828. doi:10.1109/34.895972.
10
M. J. Swain and D. H. Ballard. Color indexing. Int. J. Comput. Vision, 7(1):11–32,

1991. ISSN:0920-5691. doi:10.1007/BF00130487. 16
H. Tamura, S. Mori, and T. Yamawaki. Texture features corresponding to visual

perception. IEEE Trans. Systems Man Cybernet, 1978. 63
J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric frame-

work for nonlinear dimensionality reduction. Science, 290(5500):2319–2323,
December 2000. ISSN:0036-8075. doi:10.1126/science.290.5500.2319. 39
A. Teynor, E. Rahtu, L. Setia, and H. Burkhardt. Properties of patch based ap-

proaches for the recognition of visual object classes. In Proceedings of the 28th
Pattern Recognition Symposium of the German Association for Pattern Recognition
(DAGM 2006), Berlin, Germany. LNCS, Springer, 2006. 89
T. Tommasi, F. Orabona, and B. Caputo. Discriminative cue integration for

medical image annotation. Pattern Recogn. Lett., 29(15):1996–2002, 2008.
ISSN:0167-8655. doi:10.1016/j.patrec.2008.03.009. 102
S. Tong and E. Chang. Support vector machine active learning for image
retrieval. In MULTIMEDIA ’01: Proceedings of the ninth ACM international
conference on Multimedia, pages 107–118, New York, NY, USA, 2001. ACM.
ISBN:1-58113-394-4. doi:10.1145/500141.500159. 35
A. Vailaya, A. Jain, and H. J. Zhang. On image classification: City vs. land-

scape. In CBAIVL ’98: Proceedings of the IEEE Workshop on Content - Based
Access of Image and Video Libraries, page 3, Washington, DC, USA, 1998. IEEE
Computer Society. ISBN:0-8186-8544-1. 63
122 BIBLIOGRAPHY
V. N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York,

Inc., New York, NY, USA, 1995. ISBN:0387945598. URL http://portal.acm.
org/citation.cfm?id=211359. 21, 67
J. Vogel. Semantic Scene Modeling and Retrieval. Number 33 in Selected Read-

ings in Vision and Graphics. Hartung-Gorre Verlag Konstanz, 2004. ISBN:
9783896499677. 62
W. Wan Nural Jawahir. Content-based trademark retrieval using zernike mo-

ments and color-spatial techniques. Master’s thesis, Kolej Universiti Sains
dan Teknologi Malaysia, 2006. 41
R. Wang, M. Naphade, and T. Huang. Video retrieval and relevance feedback

in the context of a post-integration model. In IEEE Fourth Workshop on Mul-
timedia Signal Processing, pages 33–38, 2001. 2
M. Weber, M. Welling, and P. Perona. Unsupervised learning of models for

recognition. In ECCV ’00: Proceedings of the 6th European Conference on
Computer Vision-Part I, pages 18–32, London, UK, 2000. Springer-Verlag.
ISBN:3-540-67685-6. 89
J. Weston, A. Elisseeff, B. Schölkopf, and M. Tipping. Use of the zero norm

with linear models and kernel methods. J. Mach. Learn. Res., 3:1439–1461,
2003. ISSN:1533-7928. 31
X. S. Zhou and T. S. Huang. Relevance feedback in image retrieval: A

comprehensive review. Multimedia Systems, 8(6):536–544, April 2003. doi:
10.1007/s00530-002-0070-3. 19
Appendix A
Colour Images
Selected colour images are reproduced in the following pages in order to en-
able grayscale printing of the remaining work.
123
124 Colour Images
Table A.1: Table 3.1 (Page 40): Shape ground truth used for the experiments,
with a total of 429 images in 8 different categories. Sample images
are shown alongside the class name
# of mem-
Class Sample Images
bers
circle 263
diamond 11
shape-1 67
shape-2 16
shape-3 16
shape-4 24
shape-5 21
triangle 11
125
Table A.2: Table 4.1 (Page 70): Sample annotation results from the system.
The query images are taken from the Corel Collection but did not
belong to the training set. It can be observed that while the original
category was sometimes not found, it was due to the fact that the
categories often overlapped, as the top matches do indeed contain
very similar categories, leading to robust annotation results.
Original Final
Query Image Cate- Top 8 Matches
Annotation
gory
wildlife_rare,
grass, animal,
architect,
dog, rareanimal,
shells, dogs,
shell,
africa mammals,
mammal,
newzealand,
NewZealand,
197000,
pastoral
pastoral
plants, green,
foliage,
plant, flower,
wl_ocean can_park,
US_garden, flora,
green, foliage,
leaf, flora
texture13,
flower2
tribal, 239000,
thailand,
people, cloth,
189000 189000,
groups, perenial,
guard, face,
life, tribal
indonesia,
work
rural_UK, forest, tree, forest,

zionpark, flowerbeds, flower,
holland plants, forests, ruralEngland,
perenial, Zion, flowerbed,
flower2 perenial
microimg, design1,
textures, texture,
lizard1 texture1, skins,
texture7,
natural,
microimage
texture9,
food2
canyon_park, isles2, Alaska,

US_parks, alaska, mountain,
yosemite 126000, park, landscape,
rural_UK, garden, house,
gardens, cal_sea California
126 Colour Images
Figure A.1: Figure 2.3 (Page 11): An example of the semantic gap problem.
The two images possess very similar colour and texture charac-
teristics, but differ vastly as far as the semantics are concerned.
8 7 8 10
7 7 9
6
6 6 8
5 5 5 7
4 4 6
4
3 3 5
3
2 2 4
1 2 1 3
0 0 2
1
−1 −1 1
−2 −2 0
−2 0 2 4 6 8 −2 0 2 4 6 8
Figure A.2: Figure 2.7 (Page 16): Ranking induced by similarity functions.
The left image shows the spherical ranking due to Euclidean dis-
tance, and the right image shows that due to the Manhattan dis-
tance. The point (3, 3) is used as the query point.
(a) (b) (c)

Figure A.3: Figure 3.5 (Page 38): One-Class SVM applied on a 2-D toy data
set. (a) Using a linear kernel. (b) and (c) Using a gaussian kernel
with different ν and γ values. As can be seen, the one class SVMs
are quite flexible based on their parameters.
127
Figure A.4: Figure 3.1 (Page 24): 2-D image distribution obtained for the
shown images by using simple visual features together with the
MDS algorithm
128 Colour Images

+++ +++ +++ +++ +++

+++ +++ +++ +++ +++

+++ +++ +++ +++
530.jpg 8.jpg
−−− −−−
Result Images
0 0.85596 1.3248 1.6025 1.7168

2.0004 2.0537 2.1205 2.1472 2.4451

2.5658 2.8443 2.8513 3.1206 3.1603

3.2268 3.2442 3.3503 3.3867 3.4662
Figure A.5: Figure 3.9 (Page 46): Second round results after feature selecting
using correlation coefficients. The best 20 features were used to
generate the results
129
churches
0.045
0.04
Color
Texture
0.035
Shape
0.03
0.025
Weight
0.02
0.015
0.01
0.005
0
0 5 10 15 20 25 30 35 40 45 50
Dimension
0.08
space Color
0.07 Texture
Shape
0.06
0.05
Weight
0.04
0.03
0.02
0.01
0
0 5 10 15 20 25 30 35 40 45 50
Dimension
0.035
forests
0.03
0.025
0.02
0.015
0.01
0.005
0
0 5 10 15 20 25 30 35 40 45 50
flora
0.03
0.025
0.02
0.015
0.01
0.005
0
0 5 10 15 20 25 30 35 40 45 50
Figure A.6: Figure 4.4 (Page 68): Feature weights for four sample Corel cat-
egories. Each row plots the feature weights and a few sample
images from the category. The order of features in each graph is
as follows 1) 18 colour features: mean, variance and skewness of
the hue (H) layer, followed by that of S, V, HS, SV and VH layers.
2) 12 texture features: Entropy of H, V and D first level decom-
position, followed by 2nd and 3rd levels 3) 18 edge features: Bins
starting from 0◦ degrees in anti-clockwise direction with each bin
having a span of 20◦ .
130 Colour Images
Figure A.7: Figure 4.8 (Page 82): Sample images from the Corel categories
used for the experiments
autumn
dogs wildcats
wl_afric
plants
flora
texture1
texture2 peoplei
aviation warplane
car_race women
car_old1
mountain
children ruins
churches
fashion1
roses
flower1
harbors
forests beach
fractals
men
people1

Thesis PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thesis PDF

Uploaded by

Copyright:

Available Formats

Machine Learning Strategies

zur Erlangung des Doktorgrades

Lokesh Setia, M.Sc.

Prüfungskommission: Prof. Dr. Bernd Becker (Vorsitz)

Datum der Disputation: 19. Juni 2008

Living in a foreign country can be difficult, especially for an extended pe-

Not because of victories

Not for victory

textbasierter Suchsysteme eingehend analysiert. Hierzu wird die Möglichkeit

2 Fundamentals of Image Retrieval 5

3 Relevance Feedback in CBIR (Online Methods) 19

3.7 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . 41

4 Offline methods in CBIR 57

5 Radiograph Annotation and Retrieval 85

5.4.2 Classification Results . . . . . . . . . . . . . . . . . . . . . 101

List of Figures 107

List of Tables 113

A Colour Images 123

We live in the digital age, burdened by information overdose. According to a

Figure 1.1: Information Retrieval Scenarios

Scope of this thesis

1.1 Structure of the Document

1.2 Contributions of this Thesis

• Theoretical Goal of Relevance Feedback We define the theoretical ex-

• Relevance Feedback Algorithms We propose new relevance feedback al-

• Automatic Image Annotation Although relevance feedback is a power-

• Visual Taxonomy Generation An algorithm to automatically generate a

Fundamentals of Image Retrieval

Let there exist a database D consisting of digital images. A user intends to

2.1 Components of a CBIR System

Figure 2.2: Block diagram of a content-based image retrieval system

2.2 Feature Extraction

Given a transformation group G with an element g acting on the data S, a

S0 = gS ⇒ F (S0 ) = F (S), ∀ g ∈ G (2.1)

- the group of translations

- the group of rotations

- the group of Euclidean motion

- the group of similarity transformations

- the group of affine transformations

- the group of monotonic intensity transformation

for the case of translation, and

for the general case of an affine transformation.

Normalization In this method one tries to find distinctive elements of a class

Invariance towards desired transformations alone does not guarantee a good

F (S0 ) = F (S) ⇒ S0 = gS, g ∈ G . (2.6)

advantage that even unrelated features can be grouped together, as no special

Non-Standard Similarity Measures

2.2.1 Semantic Gap

“The semantic gap is the lack of coincidence between the information

However, different researchers have taken somewhat differing interpreta-

word semantic as in semantic labelling or semantic classification when clearly the

Figure 2.4: Sample images from a hypothetical visual class cars

⇒ mood( x ) = `playful' (R2)

Although probabilistic logic would be more natural, the above example

Figure 2.5: Sample images from a hypothetical semantic class cars

2.3 Performance Measurement

which consists of a white rose in a garden may be interpreted as (considering

a) Return only images of white roses.

b) Return all roses, irrespective of their color.

c) Return all white flowers, of any kind.

2.4 Initial Scenario: Single Query Image

Figure 2.6: Interpretation of Precision and Recall in set notation