Download as pdf or txt
Download as pdf or txt
You are on page 1of 126

ADDIS ABABA UNIVERSITY

SCHOOL OF GRADUATE STUDIES

DESIGN AND IMPLEMENTATION OF AMHARIC

SEARCH ENGINE

By: Tessema Mindaye Mengistu

A Thesis Submitted to the School of Graduate Studies of the Addis Ababa

University in partial fulfillment for the Degree of Master of Science in

Computer Science

July, 2007
ii
ACKNOWLEDGMENTS

Many people have contributed in one way or another on this thesis work. First of all, I would
like to express my gratitude to my advisor Dr. Solomon Atnafu for his guidance and
support. I also would like to thank Ato Abnet Tale for giving me reading materials that
helped a lot during the early stage of this research, Ato Alemayhu Gurmu whose linguistic
expertise and advice keep me on the right track Mr. Daniel Yacob for making the document
collections available for crawling and for his fast and to the point responses. Dr. Kodama
Shigeaki and his team are acknowledged for providing me with the source code of their
language identification module. My family (my mom, my dad, Beshi, my sisters) is the one
who always supports me in every way. I also want to express my appreciation to my cousins
(Addis (special thanks), Abiy, Mesay) and my Aunt (Aster, as usual) for their material and
financial support. A special thanks goes to Tseday G/Selassie, a long term friend who
always make me feel that I am lucky to have her as a friend. I also would like to thank my
friend Solomon Nega for his support and help. Last but not list, a particular thanks goes to
my friends Hanna Shiferaw and Kefyalew Shiferaw.

iii
Table of Content

CHAPTER ONE

INTRODUCTION..............................................................................................1

1.1. BACKGROUND .........................................................................................1


1.2. MOTIVATION ...........................................................................................4
1.3. STATEMENT OF THE PROBLEM .................................................................6
1.4. OBJECTIVE ...............................................................................................7
1.5. METHODS ................................................................................................7
1.5.1. Literature Review................................................................................7
1.5.2. Development Environment and Programming Tools .........................8
1.5.3. Experimentation..................................................................................8
1.6. SCOPE OF THE STUDY ..............................................................................9
1.7. ORGANIZATION OF THE THESIS ................................................................9

CHAPTER TWO

LITERATURE REVIEW................................................................................10

2.1. INFORMATION RETRIEVAL .....................................................................10


2.2. INFORMATION RETRIEVAL ON THE WEB ................................................15
2.3. SEARCH ENGINES ..................................................................................18
2.3.1 The Crawler Component...................................................................20
2.3.2 The Indexer Component....................................................................26
2.3.3 The Query Engine Component..........................................................27
2.4. FOCUSED CRAWLER ..............................................................................29
2.4.1. Language Focused Crawling............................................................31
2.5. LUCENE .................................................................................................34

iv
2.6. THE AMHARIC LANGUAGE .................................................................... 36
2.6.1 The Amharic Writing System ............................................................ 36
2.6.2 Problems of the Amharic Writing System......................................... 37
2.6.3 Representation of Ethiopic in a Computer ....................................... 38

CHAPTER THREE

RELATED WORKS........................................................................................ 39

3.1. SEARCH ENGINES FOR NON-ENGLISH LANGUAGES................................ 39


3.1.1 Search Engines for Indian Languages ............................................. 39
3.1.1.1 Tamil Search Engine.................................................................. 39
3.1.1.2 WebKhoj: Indian language IR from Multiple Character
Encodings ................................................................................................ 40
3.1.2 A Search Engine for Portuguese ...................................................... 41
3.1.3 A Search Engine for Indonesian Language...................................... 43
3.2. FOCUSED CRAWLER FOR THAI LANGUAGE............................................ 45
3.3. INFORMATION RETRIEVAL WORK ON AMHARIC LANGUAGE ................. 46
3.4. LESSONS LEARNED ................................................................................ 48

CHAPTER FOUR

DESIGN OF THE AMHARIC SEARCH ENGINE .................................... 50

4.1. DESIGN REQUIREMENTS ........................................................................ 50


4.2. ARCHITECTURE OF THE AMHARIC SEARCH ENGINE .............................. 51
4.3. THE CRAWLER ...................................................................................... 54
4.3.1 Crawler ............................................................................................. 55
4.3.2 Categorizer ....................................................................................... 56
4.3.3 The crawling process........................................................................ 56
4.4. THE INDEXER COMPONENT ................................................................... 60

v
4.4.1 The Normalization Component.........................................................60
4.5. QUERY ENGINE COMPONENT.................................................................63
4.5.1 Ranking .............................................................................................64

CHAPTER FIVE

THE “ ” SEARCH ENGINE ...................................................................65

5.1. DEVELOPMENT ENVIRONMENT ..............................................................65


5.2. SEARCH ENGINE ........................................................................66
5.3. THE AMHARIC CRAWLER ......................................................................67
5.3.1 Crawler .............................................................................................68
5.3.2 Categorizer........................................................................................68
5.3.3 The crawling Process........................................................................70
5.4. THE INDEXER.........................................................................................73
5.4.1 Pre-Processing..................................................................................73
5.4.1.1. Tokenization...............................................................................74
5.4.1.2. Short words ................................................................................74
5.4.1.3. Stopwords Removal ...................................................................75
5.4.1.4. Stemming ...................................................................................76
5.5. QUERY ENGINE COMPONENT.................................................................82
5.6. EXPERIMENTAL RESULTS ......................................................................84
5.7. LIMITATION ...........................................................................................91

CHAPTER SIX.................................................................................................93

CONCLUSIONS AND RECOMMENDATIONS.........................................93

6.1. CONCLUSIONS .......................................................................................93


6.2. RECOMMENDATIONS .............................................................................96

REFERENCES .................................................................................................98

vi
LIST OF TABLES

Table 1-1 Results from search engines for selected Amharic queries (June,
2007) ............................................................................................................. 5
Table 5-1 Comparison of Stemmers................................................................... 78
Table 5-2 Identification Result of Unicode encoded pages............................... 85
Table 5-3 Identification Result of non- Unicode encoded pages ..................... 85
Table 5-4 Precision-Recall Evaluation Result .................................................. 86

vii
LIST OF FIGURES

Figure 2-1 A Sample Inverted Index [Cas04]....................................................11


Figure 2-2 Schematic Representations of Search Engine Components ............19
Figure 2-3 Logica1 view of Lucene index..........................................................35
Figure 4-1 Architecture of Amharic Search Engine ..........................................53
Figure 4-2 Algorithm of the Crawler.................................................................58
Figure 4-3 Algorithm for Crawling Gap ...........................................................59
Figure 4-4 Algorithm of the Normalizer............................................................61
Figure 4-5 Algorithm of Character Replacer ....................................................62
Figure 4-6 Algorithm of Word Expander...........................................................63
Figure 5-1 Flow of the Crawler........................................................................72
Figure 5-2 Lucene Amharic Indexing Structure ................................................79
Figure 5-3 Snapshot of the index using Luke.....................................................82
Figure 5-4 Screen Shot of the Search Interface of Search Engine...........84
Figure 5-5 Screen Shot of a Result for the Query“ ...................................88
Figure 5-6 Screen Shot of the Result for of the Query” ............................89
Figure 5-7 Screen Shot of the Result for the Query “ ...............................90
Figure 5-8 Screen Shot of the Result for the Query ..................................90
Figure 5-9 Screen Shot of the Result for the PhraseQuery “ ” .....91

viii
LIST OF APPENDICES

APPENDIX I : Amharic Alphabets..................................................................................... 106


APPENDIX II : Amharic Numerals .................................................................................... 107
APPENDIX III : Amharic Punctuation Marks [Haz55]...................................................... 107
APPENDIX IV : Sample log of our Crawler ....................................................................... 108
APPENDIX V : List of Classes........................................................................................... 110
APPENDIX VI : The Result of a Routine on the Variants of the Stopword ......... 113
APPENDIX VII :List of Short Words.................................................................................. 115

ix
ABSTRACT
The World Wide Web, a major information publishing and retrieving mechanism, is one of
many applications that run over the Internet. The Web is a huge repository of information in
the form text, image, audio, and video. Search engines, such as Google, Yahoo, etc, are the
first port of call for the discovery of resources from this huge repository. These general
purpose search engines are designed and optimized for English language. They fell short
when they are used for locating resources of interest of other languages on the Web. This is
mainly due to the specific features of the language that are not considered by those search
engines. Amharic, which is a family of Semitic languages, is the official language of the
federal government of Ethiopia. Currently, there is significant number of Amharic
documents on the Web. These documents need a search tool that considers the typical
characteristics of the language.

In this study an attempt is made to design and implement a search engine for Amharic
language web documents. The research came up with a complete language specific
(Amharic) search engine that has a crawler, an indexer and a query engine component.
These components are optimized for the language they are designed, Amharic language.

The crawler (Language Specific Crawler) crawls the Web and collects Amharic web
documents with Unicode encoding and stores them in a repository. The next component, the
Indexer, processes the documents and stores them in a structure that is efficient and
appropriate for searching. The Query Engine component gives an interface that the user can
enter his/her information need in Amharic language using Ethiopic script. It then parses the
query, search the index for the query, select the documents that are relevant to the query,
and return the relevant documents according to their rank.

Two runs of crawling have been done to test the crawler. Moreover, to measure the
effectiveness of the system, retrieval experiments have been performed for some queries and
a Precision-Recall test is done. The system is tested with selected queries that reflect some
specific features of the language. The developed system considers the typical features of the
language and meets its design requirements.

Key words: Information Retrieval, Search Engine, Amharic language, World Wide Web.

x
Chapter One

Introduction

1.1. Background

Information, defined as a collection of facts from which conclusions may be drawn, is an


important ingredient of the current modern world we live in. Information is the result of
processing, manipulating and organizing data in a way that adds to the knowledge of the
person or the organization receiving it [wiki06]. Nowadays, information is becoming an
important resource, like money and energy, which can be spent both on business and in
home. The prices of most industrial products that we consume daily include the so-called
information cost.

Information is stored in some physical media – print, film, magnetic, optical and
disseminated through some channels – telephone, radio and TV, books and magazine, and
the Internet. Plenty of information exists in different format at different corners of the world.
New information is created and added to the existing information resource every single day.
This rate of growth has exponential nature. Some studies [LV00, LV03] tried to answer the
question “How much information is created each year?”. One of such studies [LV00]
estimated that in 1999 the world produced between 1 and 2 exabytes (1018 or 260 ) of unique
information. This new information was estimated to be 5 exabytes in 2002 [LV03]. This
shows that the amount of new information that was produced per year was almost doubled in
three years time. According to the studies, among the electronic channels, Internet is the
newest and fastest growing medium of all time.

The advent of Internet, which is network of networks, changed the way information is
published, distributed, and accessed. The Internet has its root in research institutions but it
became the ubiquities way of accessing information and means of communication world
wide. People are shifting to electronic publishing from paper based publishing for their
information need. The number of Internet users is increasing worldwide. As of 2002, there
were a worldwide Internet population of 580 million [SC03]; as of June 2007, this figure

1
increased to more than 1.1 billion [IWS07]. The importance of the Internet grows rapidly in all
fields of human life, including research and education, marketing and trade, health, entertainment
and hobbies, etc. Generally it is used by people from all walks of life for different purposes.

The World Wide Web (the Web) is one of many applications that run over the Internet. From
its origin in 1991 as organization wide collaborative environment at CERN for sharing
research documents in nuclear physics, the Web has grown to encompass diverse
information resources: personal home pages, online digital libraries, virtual museums,
product and service catalogs, government information for public dissemination, research
publications, etc [GRGG97]. Normally, the Internet and especially the Web has made major
collection of information accessible to everyone, anywhere.

The World Wide Web is a huge repository of information in the form text, image, audio, and
video. The volume of information on the public web in 1999 was 20 to 50 terabytes (1012 or
240); in 2002, it was measured at 167 terabytes – this show a growth by more than triple
[LV00, LV03]. This trend continued and is expected to continue in the future. Indexible web
(sometimes called public web) is a part of the Web which is considered for indexing by
search engines. On the contrary, Hidden web is a part of the Web which is beyond the reach
of current search engines. Researches show that the number of web pages on the Web counts
in billions. As of January 2005, the indexible web was at least 11.5 billion pages [GS05]. On
the other hand the hidden web is estimated to be 400 to 500 times the indexible web
[RGM01].

Before the Internet came into existence, people search a library catalog or ask the librarian if
they need to find new information. The introduction of the Internet and the Web made
accessing information easier. However, due to the huge amount of information (In 2002 the
World Wide Web contains about 170 terabytes of information on its surface; in volume this is
seventeen times the size of the Library of Congress print collections [LV03]) and some inherent
properties of the Web, accessing information on the Web is not a trivial task. In the early days
of the Web, people usually follow the link structure of web documents in hopes of finding
the information they sought [Pin94]. This approach might be enough when the Web was
small and its documents shared the same fundamental purpose. But it is definitely
inappropriate for the Web in its current status, which is very huge in size and very
heterogeneous and diverse in content.

2
There are few approaches to access information on the Web: the first approach is the user
has to know the precise address of the document s/he sought; the second one is to browse the
prelisted human made hierarchical directories such as Yahoo; the last one is using a search
engine. The first approach is prohibitive to find new information from information collection
as large as the Web. Human maintained lists cover popular topics effectively but are
subjective, expensive to build and maintain, slow to improve, and cannot cover all topics.
Searching the Web using search engines is a feasible solution.

Search engines are software programs that retrieve information from the Web. To use search
engines, users usually submit a query, typically a list of keywords, and receive a list of web
pages that may be relevant to their query, usually pages that contain the keyword.

As more and more people depend on electronic media (especially Internet), for their
information needs, language issues come into surface. Since information technology invades
more and more of the world, people are becoming increasingly unwilling to speak the
language of information technology – they want the technology to speak their language
[Gil02]. English was the lingua franca of the Web in the first five years of its history. In mid
2000, the combined content from all other language exceeded English for the first time
[BBH06]. The growth of this non-English content continues to outstrip the growth of English
content. As of June 2007, English contain only 28.9% of Internet users [IWU07]. This shows
that non-English users dominate the Internet.

Search engines become a first choice for finding new information on the Web. With the Web
threatening to become a ubiquitous channel for delivering information, the interfaces through
which users discover information have a major economic and political significance [VT04].
Major search engines on the Web are currently owned by few private organizations and
having a good search site is a business with multi-billion USD1 revenue. These giant general
purpose search engines, such as Google, Yahoo, etc, are optimized for English language. The
general purpose search engines handle English queries more or less the same way, but their
handling of non-English queries is greatly different from how these queries are handled by
non-English search engines i.e., engines that are designed for specific languages [Mou03].

1
United States Dollar

3
Researches have been done to evaluate the performance of these general purpose search
engines for non-English languages [Mou03, BG03, MC05]. These researches compared the
performance of general purpose search engines with that of language specific search engines
and concluded that general purpose search engines are not effective in handling non-English
queries and retrieving the relevant documents at all. In general, these researches show having
a language specific web portal facilitates the information retrieval process on the Web which
in turn helps the user to get “relevant” information for his/her need.

Amharic, which is a family of Semitic languages, is the official language of the federal
government of Ethiopia. Some estimates indicate that Amharic language is the mother
tongue of around 15 to 30 million Ethiopians [AAG03]. Amharic uses Ethiopic alphabets
(called Fidel), which is a script developed from Ge’ez script.

With a recent development of general purpose search engines to support multilanguage,


accessing and retrieving Ethiopic documents on the Web through some general purpose
search engines is possible, since these search engines index documents that are written in
Ethiopic. Google (www.google.com), for example, gives an interface called Google Ethiopia
(www.google.com.et) that accepts Amharic queries written in Unicode based font and
returns relevant documents. Even if they don’t give Amharic interface, other general purpose
search engines such as Yahoo and MSN do index Amharic and allow searching for
information from their index database. However, there is no means to evaluate their
performance so far. One of the reasons for this is the lack of local search engines, which take
the language’s special characteristics into consideration, to compare with.

1.2. Motivation
According to [IWS07] Ethiopia took 0.5 % of Internet users out of Africa’s share in 2007.
The statistics also shows that there was an increase of users of Internet in Ethiopia by 1030%
during the years 2000-2005. Due to this increase in Internet population within the country
and large number of population that speaks the language in Diaspora, the number of web
documents that are written in Amharic language and Ethiopic script is increasing. However,
there is no search tool that can be used to search this increasing number of Amharic
documents on the Web.

4
Amharic is a Semitic Language of the Afro-Asiatic Language Group that is related to
Hebrew, Arabic, and Syrian. Amharic, a syllabic language, uses a script which originated
from the Ge’ez alphabet [AAG03]. The language has 34 basic characters with each having 7
forms for each consonant-vowel combination. Unlike English, Amharic is a morphologically
rich language. This has strong impact on the effectiveness of the retrieval of the language’s
document [NW03].

Most general purpose search engines are originated from USA and optimized for English
language. However, the World Wide Web is not only for English. Currently, the number of
Web pages in non-English languages is more than that of English pages [BG03].

Considering the different nature of Amharic language with that of English and the fact that
most general purpose search engines are designed and developed keeping English in mind, it
is logical to ask questions such as “Do general purpose search engines handle Amharic
queries effectively?”, “Do general purpose search engines consider the specific
characteristics of the Amharic language?”. To answer the above questions the following
single and compound word queries were given to three general purpose search engines,
Google (www.google.com.et), MSN (www.msn.com), Yahoo (www.yahoo.com), and results
were found as shown in Table 1. Unicode based fonts were used.

Table 1-1 Results from search engines for selected Amharic queries (June, 2007)

Query Google MSN Yahoo


37,100 5,769 26,700
28,200 1,221 16,500
469 6 143
! 1,230 20 656
! 6 3 37
" # 24,800 335 5,050
" # 380 14 89
" #$% 55 4 6
12,200 511 2,790
& 1,060 6 28

The above table clearly shows that all the evaluated general purpose search engines do not
consider the special characteristics of the Amharic language. The first three queries are the

5
variants of the word with prefix ' and suffix ' and with out suffix. If proper
stemming had been applied, the results for these three words would have been comparable (if
not the same) with that of the results of the word in each search engine. In
Amharic, it is common to write some words in a short form using ‘/’, for example
! can be written as ! ( Being the case like this, the two queries that
contain the above mentioned compound word must have given the same result. But all three
search engines that are considered for this testing failed to support this feature of the
language. In Amharic language among the 34 alphabets, 7 of them don’t have unique sound
and can be used interchangeably [AAG03]. and & are two of the alphabets that have the
same pronunciation and can be used interchangeably in natural language. Search engines
must take the special feature of those 7 letters into consideration for their retrieval. All the
general purpose search engines that are considered for this evaluation purpose failed to meet
this requirement. The same holds true for words, " # " # and " #$%( If
proper stemming is applied, the prefix ' and the suffix -)% will be stripped off and give
the same or comparable result for the three queries.

Therefore, from the above evaluation and from the fact that there is no search tool that is
designed specifically for Amharic web documents, it is obvious that there must be a search
engine that properly handles the unique characteristics of Amharic language to retrieve the
language’s documents from the Web.

1.3. Statement of the Problem


General purpose search engines are designed mainly for English language. Non-English
queries submitted to these search engines usually are not handled properly [BG03, Mou03,
MC05]. This is mainly because of the languages’ special characteristics that are not taken
into consideration by those search engines.

Amharic language has many distinct features that affect the information retrieval of the
language’s documents on the Web. General purpose search engines don’t handle Amharic
queries well, as they don’t other non–English queries such as Arabic or Chinese. Moreover,
there is no search engine that is developed to handle Amharic web documents. So there
should be a local search engine that considers the special characteristics of the language and
the script it uses.

6
1.4. Objective
The general objective of this thesis work is to design and implement a search engine for
Amharic Web documents that are written in Unicode based fonts. The system:

Should consider and incorporate the specific characteristics of the Amharic language
that affects the retrieval of the documents on the Web.
Should apply the necessary language specific techniques that can affect the retrieval
process of the documents.
The specific objectives of the study are:
To explore the typical characteristics of Amharic language that can affect the
retrieval of the language’s documents on the Web.
To design and implement a Web crawler algorithm that crawls the Web and
downloads Amharic documents that are written in Unicode based fonts.
To explore the existing techniques for identifying the language of a web document
and select one if it fits to our need or design and implement a software that filters
those web pages that have Amharic content.
To identify an indexing and searching scheme and adopt it to the specific features of
Amharic language.
To design an interface for accepting user’s query in Amharic.
To design and implement an algorithm for ranking the retrieved documents.
To design an Amharic web search engine that incorporates all the necessary
components (crawler, indexer, and searcher).
To demonstrate the effectiveness of the developed search engine.

1.5. Methods
The following methods have been followed to achieve the general and specific objectives of
this research work.

1.5.1. Literature Review


Literature reviews have been done on different areas pertinent to this thesis work. Amharic
language has been studied with respect to its features that affect the information retrieval
process. This included the script the language uses, different problems the language has, the
representation of Ethiopic in computer, etc.

7
Information retrieval has been studied to understand the concepts and techniques of the
retrieval process. Information retrieval on the Web in general, search engines in particular
was reviewed in detail. Web Information retrieval works on other languages as well as on
Amharic language have been reviewed in hoping to find appropriate tools and techniques
that can be applied in our work. References are made to journals, books, research works and
electronic documents.

1.5.2. Development Environment and Programming Tools


Java programming language is used for the programming part. Java is selected for its
suitability in developing web based applications like this work. The three components of the
system are fully written using this language. The crawler collects web document collections.
The collection is then subjected to different pre-processing operations. The processed data is
stored (indexed) in a structure that is appropriate for searching. Lucene, a java based IR
library, is used for the indexing and searching. Different third party tools have been used to
prepare the data (document collection) in a format that is appropriate for lucene, for example
text extracting tools: NekoHtml, PDFBox, etc.

Java Server Page (Jsp) is used for creating the interface component. A form that accepts
users query in Amharic and a page that displays retrieved results is developed using Jsp.
Internet Explorer 6 (IE6) is used to browse the Web.

Java Servlet is used to access the data in the lucene file system. Tomcat, a Jsp/Servlet
container, is used to integrate the work of Jsp and the Servlet.

A single PC and a bandwidth capacity of 6MB/sec is used to develop and test the system.

1.5.3. Experimentation
Two runs of crawling have been done to test the crawler. The crawler started its operation
from a carefully selected seed URLs. The component that classifies a web document based
on its language is tested using 990 documents. Seventy five web documents are chosen from
the document collection that is collected during the second crawling experience. The
documents are news articles that are collected from different Amharic newspapers. The
content, the title, and URL address of each document in the collection are indexed.

8
Eleven queries are collected from experienced users of search engines. The collected
documents (75 in number) and the queries are used to measure the retrieval effectiveness of
the system using Precision and Recall. Moreover, some carefully selected queries are given
to the developed system in order to test whether the system meets its design requirements.

1.6. Scope of the Study


Though, there are Amharic web documents on the Web with different kinds of fonts that use
different encodings, this work considers only Amharic web pages written in Unicode based
fonts.

Although information contained in a web page is different kind (text, audio, video, and
image) this study took only text documents in to consideration.

1.7. Organization of the Thesis


This thesis is organized as follows. Chapter two deals with the literature review that have
been done for this work. It includes Information Retrieval, Search Engines, Amharic
Language and other related topics that are required for a better understanding of the research
domain. Chapter Three, entitled “Related Works” includes works related to the theme of the
thesis. It reviews works on search engines for non-English languages. It also includes works
in Amharic Information Retrieval that have been done so far. The design of our search
engine is explained in chapter four. Chapter five states the implementation of the designed
search engine and its effectiveness measured using experimentation. Finally, conclusions are
drawn from the work and recommendations are given on possible future works in chapter
six.

9
Chapter Two

Literature Review

2.1. Information Retrieval


Nowadays, information is an ingredient that we consume for our daily life. For that, we seek
or search for information. Information seeking behavior is a purposive seeking for
information as a consequence of a need to satisfy some goal [Wil00]. In the course of
seeking, we may interact with manual information systems (such as a newspaper or a
library), or with computer-based systems (such as the World Wide Web). The World Wide
Web is a very large distributed digital information space. This information is useful to
millions, if not to billions. The Web is used by different people from different walks of life.
People do some information search on the Web. Search engines for the Web are one of the
most publicly visible realizations of information retrieval technology on the Web.

In this section we discuss Information Retrieval (IR) from the search engine perspective, i.e.
those parts of IR that are relevant to search engine technology.

Information retrieval, as defined by Salton [SM83], deals with the representation, storage,
organization of, and access to information items. Any information retrieval system works as
follows. The user first translates the information need in to query term(s) which can be
processed by the IR system. This translation usually results in a set of keywords. Given the
query, the main task of the IR system is to retrieve “relevant” documents for the user from
the documents collection. For this, the IR system will do some similarity calculation and
ranking operation.

Before any retrieval task begins, the collection to be searched must exist. In traditional IR,
this is usually the logical representation of the documents in the collection [BYRN99].
Typically documents in a collection are frequently represented through a set of index terms
or keywords. In general form, an index term is simply any word which appears in the text of
a document in the collection. But usually, an index term is a keyword which has some
meaning of its own (i.e., which usually has the semantics of a noun). There are few indexing

10
structures, among them, inverted files, suffix arrays, and signature files are the common ones
[BYRN99]. Inverted file, which is currently the best choice for most IR applications, is
discussed below.

Inverted files

An inverted file (or inverted index) is a word-oriented mechanism for indexing a collection
in order to speed up the searching task [BYRN99]. The inverted file structure is composed of
two elements: the vocabulary and the occurrences. The vocabulary is the set of all different
words in a text. For each word, a list of all the text positions where the word appears is
stored, called occurrences. Each of the occurrences uniquely identifies the document for
which the term is representative. Searching for documents for a given query is searching the
inverted file for the query terms and selecting the documents using the occurrences of the
terms. Figure 2.1 shows this process schematically.

Figure 2-1 A Sample Inverted Index [Cas04]


The space required by the vocabulary is small. While the vocabulary grows sub-linearly with
the collection size, the list of occurrences can be very large. The size of the vocabulary can
be decreased by applying some operations, such as stopwords removal and stemming.

Stemming is a technique of reducing words to their grammatical roots. For example, words
reduces, reducing, reducer have the same root reduce. The major goal of stemming is to
improve retrieval performance. Furthermore, it has a secondary effect of reducing the
indexing structure. Baeza-Yates [BYRN99] distinguishes four kinds of stemming strategies.
Affix removal, table lookup, successor variety and n-grams. Table lookups consists simply of

11
looking for the stem of a word in a table. Successor variety is a technique that is based on the
identification of morpheme boundary, using the grammatical rules of the language. Affix
removal functions by removing suffixes, prefixes and infixes from a word. N-gram
stemming is based on identification of digrams and trigrams in a word.

Stopwords are words that occur frequently in text of a document, hence, have low
discriminating value. Stopwords usually don’t have meaning by themselves. According to
[BYRN99] 40-50 % words in an English text are stopwords.

IR Models

One issue regarding information retrieval systems is the issue of predicting which documents
are relevant to the user queries. Such decision is dependent on the ranking algorithm, which
orders the retrieved documents in some order, the system uses. A ranking algorithm operates
according to basic premises regarding the notion of document relevance [BYRN99]. The
premise that the system uses is the IR model it uses. IR models can be classified into four
types: set theoretic, algebraic, probabilistic, and hybrid models [GRGG97]. Each of these
models differ from each other on the representations of documents and queries, matching
strategies for assessing the relevance of documents to a user query, and methods for ranking
query output. These models are discussed below.

Set theoretic models

Boolean model is one of the models in this category. It is a simple retrieval model based on
set theory and Boolean algebra. The model represents documents by a set of index terms,
each of which is viewed as a Boolean variable and valued as true, if it is present in a
document. No term weighting is allowed. Its retrieval strategy is based on a binary decision
(a document is predicted to be either relevant or non-relevant) without any notion of grading
scale, which prevents good retrieval performance.

Another model in this category is Fuzzy-set model which is based on fuzzy-set theory. This
model allows partial membership in a set, as compared with conventional set theory, which
does not. Nevertheless, IR systems based on the fuzzy-set model have proved nearly as
incapable of discriminating among the retrieved output as systems based on the Boolean
model [GRGG97]. The strict Boolean and fuzzy-set models are easy to implement and have
minimum computational requirements when compared to other models.

12
Algebraic model

The vector-space model (simply vector model), the famous and the most widely used, falls in
this category. The vector model assigns non-binary weights to index terms in queries and in
documents. These weights are used to compute the similarity between each document stored
in a system and the user query. The vector model takes documents that match the query
terms only partially into consideration. In this model each document will be represented by
an n-dimensional vector and a user query is similarly represented by an n-dimensional
vector, where n is the number of terms in the vocabulary.

Let the query vector q is represented by:

q = (w1,q, w2,q, w3,q,…, wt,q)

Where t is the total number of index terms in the system and


wi,q is the weight associated with the ith term in query q.

Similarly the vector of the document dj is represented as:

dj = ( w1,j, w2,j, w3,j,…, wt,j)

Therefore, the similarity between a query q and the document d is quantified by calculating
the cosine of the angle between the two vectors as:

wt ,q × wt ,d
Sim( q ,d ) = t

w 2 t ,q × w 2 t ,d

Since the term weights are greater than or equal to zero, sim(q,d) varies between 0 and 1.
Hence, instead of attempting to predict whether a document is relevant or not, the vector
model ranks documents according to their degree of similarity to the query, i.e. a document
that matches only partially can be retrieved.

How can we get the values of term weight (wi,j)? There are various methods of calculating
term weights.

Term Frequency (TF) - The term frequency of a given term is the number of times that the
term appears in a given document. The TF is usually normalized with respect to document
length, that is, the frequency of term t divided by the frequency of the most frequent term in

13
document d. This helps to avoid long documents to be ranked higher than short ones during
ranking. Therefore, the normalized TF is given as:

freq t ,d
tf t ,d =
max l freql , d

Where tƒt,d is the frequency of term t in document d

Inverse Document Frequency (IDF) - Inverse document frequency reflects how frequent a
term is in the whole collection. The underlying principle is that a term that appears in a few
documents gives more information than a term that appears in many documents. This means
a term that appears in many documents has a weak discriminating power to select relevant
documents over a document collection. If N is the number of documents and ni is the number
of documents containing the query term t, then

Idf t = log N
ni
The most commonly used term weight is the composite of TF and IDF which is called Term
Frequency-Inverse Document Frequency (TF_IDF):

Tf _ Idf = f ij × log N
ni

The composite weight gives importance both to the distribution of a term in the document in
the collection and the frequency of occurrence of the term in the individual documents
[SM83].

Probabilistic model

The probabilistic model attempts to capture the IR problem within a probabilistic framework
[CRGG97]. The vector-space model does not take the existing term relationships in a
document into account. The probabilistic model takes these term dependencies and
relationships into consideration. The whole idea is, given a user query there is a set of
documents which contain exactly the relevant documents and no other.

Evaluation of Retrieval Process

Before the actual deployment of an information retrieval system, an evaluation of the system
must be carried out. IR systems can be evaluated with respect to efficiency (operational

14
issues like cost, time factor, space, etc.) as well as effectiveness (how well the retrieved
documents satisfy the user request) [SM83]. Recall and Precision are two measures of
evaluating the effectiveness of an IR system.

Recall: is the fraction of relevant documents which has been retrieved.

Precision: is the fraction of the retrieved documents which is relevant.

Recall indicates completeness. Where as, precision refers to the system’s ability to find only
“relevant” documents. High precision means that the retrieved documents are highly
“relevant” to the subject of the query. The following example might shade some light on the
concept of Recall-Precision. There might be 100 relevant documents in a document
collection for the given query, but the system may only find 75 of them. Then, the recall of
the system is 75%. Similarly, if the system lists 75 documents found to match a query but
only 25 of them are relevant to the user, then the precision is 33.3%.

2.2. Information Retrieval on the Web

A critical goal of successful information retrieval on the Web (Web IR), as any information
retrieval, is to identify which pages are of high quality and relevance to a user’s query
[SMBR04].

Although, there were an important body of Information Retrieval techniques published


before the invention of the World Wide Web, there are unique characteristics of the Web that
made them unsuitable or insufficient for information retrieval on the Web. According to
Yates [BYRN99], these unique and inherent properties of the Web that make traditional IR
algorithms insufficient are the following:

• Distribution of Data: data spans over many computers and platforms all over the
world.

• Volatile Data: Due to the dynamic nature of the Internet, new computer and data can
be added or removed easily.
• Quality: Data can be false, out of date, poorly written and can contain errors because
there is no single body that does an editorial work.

15
• Heterogeneity: data exists in different format, multiple media types, and different
languages.
• Unstructured and Redundant data: each HTML page is not well structured; much
Web data is repeated or very similar.
• Large Volume: the Web is increasing with exponential rate of growth.

Due to the above characteristics of the Web, there are many aspects of web IR that
differentiate and make it somewhat more challenging than traditional IR problems, which are
exemplified by the TREC1 competition.

Traditional IR algorithms were developed for relatively small and coherent collections such
as newspaper articles or book catalogs in a library. The Web, on the other hand, is massive,
much less coherent, changes more rapidly, and is spread over geographically distributed
computers [ACGP+01]. Inverted lists in the case of the Web contain a term and a location
that consist of a page identifier and the position of the term in the page. Index in Web
information retrieval also may contain some weights and some formatting attributes found in
the HTML structure of the page. Conceptually, building an inverted index involves
processing each page to extract postings, sorting the postings first on index terms and then
on locations, and finally writing out the sorted postings as a collection of inverted lists on
disk.

For relatively small and static collections, as in the environments traditionally targeted by
information retrieval systems, index build times are not very critical. However, when dealing
with Web-scale collections, index build schemes become unmanageable and require huge
resources, often taking days to complete. As a measure of comparison with traditional IR
systems, the authors of [ACGP+01] compared 40 million page of WebBase repository that
represents only about 4% of the publicly indexible Web, but is already larger than the 100
GB of very large TREC-7 collection, the benchmark for large IR systems. The following
paragraphs deal with information retrieval areas that need special attention when applied on
the Web.

User query

1
Text REtrieval Conference – http”//trec.nist.gov/

16
Information retrieval is well studied for more than three decades. Despite its maturity, until
recently, IR was seen as a narrow area of interest mainly to librarians and information
experts [BYRN99]. During those periods, since the IR systems are used by experts such as
librarians, the queries were well formulated. Hence, it was easier to get relevant documents.
With the inception of the WWW, IR is pushed to people from all walks of life. People use
search engines and other searching portals for their information needs, i.e. they retrieve
information. Without detailed knowledge of the collection make-up and the retrieval
environment of search sites, most users find it difficult to formulate queries which are
suitable for retrieval purpose. As observed with web search engines, the users might need to
spend large amount of time to reformulate their queries, otherwise they would get a junk
result for their information need.

Indexing

Only text based indexing is not enough in Web IR. Some structure of the document might be
useful. For example, search algorithms often make use of additional information about the
occurrence of terms in a web page. For instance, terms occurring in bold face (within <B>
tags), in section headings (within <H1> or <H2>tags), or as anchor text might have extra
information to be indexed. Moreover, link structure in the Web has valuable information
[BP98].

Ranking

Ranking retrieved documents based on only term weight, as in traditional IR, is a major
problem in web IR. It leads to search engine spamming, malicious attempts to get an
undeserved high ranking in the results. This has created a whole branch of Information
Retrieval called “adversarial IR”, which is related to retrieving information from collections
in which a subset of the collection has been manipulated to influence the algorithms [Cas04].
Some web page authors may insert some usually used query terms in their page to get a
higher rank to be retrieved for queries that contain those terms, even if the content of the
page is irrelevant with respect to the query. To the extreme, the whole invisible dictionary
can be inserted in the page to be retrieved by the search engine for every possible query.
Cloaking, the practice of sending different content to a search engine (crawler) than to a
normal visitor (browser) of a web site, is another form of spamming. The partial solution to

17
this is to use the link structure of the Web, i.e. use link based ranking methods like PageRank
[PBMW98, PB98] or HITS [Kle99].

2.3. Search Engines


A search engine is a program that searches documents for specified keyword(s) and return a
list of documents where the keyword(s) were found. People use search engines to find new
information, search for specified web resource, etc. As of March 2003, search sites
accounted for more than 13.4% of global internet traffic [Sta03]. Among users looking for
information on the Web, 85% submit information requests to various Internet search engines
[Sav02]. According to [Cas04], 40% of users who are arriving to web sites for the first time
clicked a link from a search engine result.

Presently, there are few popular general purpose search engines that give valuable service to
the Web community. Almost all these search engines are owned by commercial companies
that reside in USA, for example, Google (www.google.com), Yahoo (www.yahoo.com),
MSN (www.msn.com), etc. In addition to these general purpose search engines there are
many topic specific and language specific search engines as well as quite a number of site
specific one.

There are different kinds of search engines: Meta-Search Engines, Web Portals, Full Text
Search Engines, etc. The following discussion applies to full text search engines.

The Web search process has two main parts: off-line and on-line [Cas04].

• The off-line part is executed periodically by the search engine and consists of
downloading a subset of the web to build a collection of pages, which is then
transformed into a scalable and searchable index.

• The on-line part is executed every time a query is executed. It uses the index to select
some candidate documents that are stored according to estimation on how relevant
they are for the user need.

Generally, search engines contain three components:

• Crawler component

• Indexer component

18
• Query Engine component

The typical design of search engines is a “cascade”, in which a web crawler creates a
collection which is indexed and searched. In this cascade model, operations are executed in
strict order: first crawling, then indexing, and finally searching. Figure 2.2 shows such an
engine schematically.

Figure 2-2 Schematic Representations of Search Engine Components


Every engine relies on the Crawler component to provide the grist for its operation (shown
on the left in Figure 2.2). Crawlers play a major role in this component. Crawlers are small
programs that browse the Web on the search engine'
s behalf, similarly to how a human user
would follow links to reach different pages. The programs are given a starting set of URLs,
whose pages they retrieve from the Web, called seed URLs. The crawlers extract URLs
appearing in the retrieved pages, and based on the extracted URLs it determines what links to
visit next (some of the functionality of the crawler component may be implemented by
another additional sub-component). The crawlers also pass the retrieved pages into a page
repository which is used by another component. Crawlers continue visiting the Web until

19
local resources, such as storage, are exhausted or some conditions are met (e.g. maximum
number of documents are fetched).

The Indexer module extracts all the words from each page (parsing), and records the URL
where each word occurred. The result is a large lookup table that can provide all the URLs
that point to pages where a given word occurs (the search index in Figure 2.2). During a
crawling and indexing run, search engines must store the pages they retrieve from the Web
and the processed pages. The repository and the search Index in the above figure represent
these storages.

The Query Engine module is responsible for receiving search requests from users. The
engine relies heavily on the indexes, and sometimes on the page repository. Because of the
Web'
s size, and the fact that users typically only enter one or two keywords, result sets are
usually very large. The query engine component has the task of sorting the results in a way
that results near the top are the most likely ones to be what the user is looking for. The query
module is of special interest, because traditional information retrieval techniques have run
into selectivity problems when applied without modification on Web searching, as most
traditional techniques rely only on measuring the similarity of query texts with texts in a
collection'
s documents. The following sub-sections discuss the above components in detail.

2.3.1 The Crawler Component


The Crawler component retrieves pages from the Web for later analysis by the indexer
component. As discussed earlier, this component starts off with an initial set of URLs, S0. It
first places S0 in a queue, where all URLs to be retrieved are kept and prioritized. From this
queue, the crawler gets a URL (in some order), downloads the page, extracts any URLs in
the downloaded page, and puts the new URLs in the queue. This process is repeated until the
crawler decides to stop.

Crawlers are the main ingredient in the crawler component in particular and for search
engine in general. Crawlers, sometimes called robots, wanderers, worms etc, traverse the
World Wide Web information space by following hypertext links and retrieving web
documents by standard HTTP protocol [FGMF+99]. Unlike their names crawlers simply
download a given page using HTTP Get request. They typically exploit the Web’s

20
hyperlinked structure to retrieve new pages by traversing links from previously retrieved
pages.

There are two basic ways that a given crawler traverses the Web, Breadth first and Depth
first. In Breadth first case, the crawler first extracts (crawls) all links in a given page before it
moves to the next page, then extracts all links on the first link of the first page and so on,
until each level of links has been exhausted. In Depth first scheme, the crawler starts by
finding the first link in the first page. It then crawls the page associate with that link, finding
the first link in the new page, and so on until the end of the path has been reached.

The basic algorithm executed by any web crawler takes a list of seed URLs as its input and
repeatedly executes the following steps: Remove a URL from the URL list, determine the IP
address of its host name, download the corresponding document, and extract any links
contained in it. For each of the extracted links, ensure that it is an absolute URL, and add it
to the list of URLs to download, provided it has not been seen before. If desired, process the
downloaded document in other ways (e.g., index its content). This basic algorithm requires a
number of functional components:

• a component (called the URL frontier) for storing the list of URLs of web resources
to download.

• a component for determining whether a URL has been encountered before.

• a component for resolving host names into IP addresses (DNS resolve).

• a component for downloading documents using the HTTP protocol.

• a component for extracting links from HTML documents.

The goal of a general purpose search engine is to index a sizable portion of the web in its
database, independent of the topic and domain [Hen00]. Real crawlers are not able to
download the entire web due to the following reasons: [BYCMR05]

• Network bandwidth and disk space of a web crawler are neither infinite nor free.
• Page changes over time, and a web crawler’s copy of the web becomes quickly
obsolete.

21
• The amount of information on the Web is finite, but the number of pages is infinite.
If we define a “web page” as everything that has a URL, then the number of web
pages is infinite, due to dynamic page, and then the crawling process never ends.

Due to the above reasons no search engine indexes the entire web. Lawrence and Giles
[LG99] showed that no search engine indexes more than 16% of the Web. Apart from this,
crawling is a costly process. According to a study conducted in 2004 [CCHM04] the network
cost of crawling Google’s three billion pages is 1.5 million USD. So, as a crawler always
downloads just a fraction of the Web, it is highly desirable that the downloaded fraction
contains the most relevant pages.

Generally, there are two properties of the Web that makes web crawling difficult: its huge
size and rate of change. The huge size implies that the crawler can only download a fraction
of the Web within a given time, so it needs to prioritize its download. The high rate of
change implies that by the time the crawler is downloading the last pages from a site, it is
very likely that new pages have been added to the site, or pages have already been updated or
even deleted. Many researches have been carried out on crawlers and it is still a hot research
area. In general the research on web crawling can be classified into the following categories
[CC03].

• Speed and efficiency. In this category, researchers study different ways to increase the
harvest speed of a spider, i.e. how to download a maximum number of pages within a
specified time by applying program optimization techniques to operations such as I/O
procedures, IP address lookup, etc. Currently, sophisticated spiders can download
more than 10 million documents per day on a single workstation [BCSV02].

• Spidering policy. Researches in this category study the behaviors of spiders and their
impacts on other individuals and the Web as a whole. A well-designed spider should
avoid overloading Web servers. Also, Webmasters or web page authors should be
able to specify whether they want to exclude particular spiders’ access. There are two
standard ways of restricting robots (crawlers) access to a web site: using Robot
Exclusion Protocol and Robots META Tag.

• Information retrieval. These studies investigate how different spidering algorithms


and heuristics can be used in such a way that spiders can retrieve relevant information

22
from the Web more effectively. Many of these studies apply to Web spiders
techniques that have been shown to be effective in traditional information retrieval
applications, e.g., the Vector Space Model

In general, the behavior of a Web crawler is the outcome of a combination of policies


[CC03]:
• A selection policy that states which pages to download.
• A re-visit policy that states when to check for changes to the pages.
• A politeness policy that states how to avoid overloading Web sites.

Selection policy

Given the huge size of the Web, even giant commercial search engines cover only a portion of the
publicly available content. As of 2005, out of the estimated 11.5 billion pages, the giant search
engines Google, Yahoo, and MSN claim to index a little more than 8, 4, and 5 billion pages
respectively [GS05]. The Web has an exponential rate of growth. As the number of pages grows, it
will be increasingly important to focus on the most valuable pages, as no search engine will index the
complete Web. This requires a metric of importance for prioritizing web pages. Designing a
good selection policy has a difficulty because it must work with partial information, as the
complete set of web pages is not known during crawling. Given a web page p, we can
determine the importance of the page, I(p), in one of the following ways [CGMP98].

• Similarity to a Driving Query Q: I(p) is defined to be the textual similarity between


p and Q. Similarity has been well studied in the IR community and has been applied
to the Web environment, e.g. vector space model.

• Backlink Count (IB(p)): The value of I(p) is the number of links to page p that
appear over the entire Web. Intuitively, a page p that is linked to by many pages is
more important than one that is seldom referenced. Note that evaluating IB(p)
requires counting backlinks over the entire Web. A crawler may estimate this value
with IB’(p), the number of links to p that have been seen so far.

• PageRank: The IB(p) metric treats all links equally. Thus, a link from the Yahoo
home page counts the same as a link from some individual’s home page. However,
since the Yahoo home page is more important (it has a much higher IB count), it
would make sense to value that link more highly. The PageRank backlink metric,

23
IR(p), recursively defines the importance of a page to be the weighted sum of the
importance of the pages that have backlinks to p. Such a metric has been found to be
very useful in ranking results of user queries [BP98].

• Location Metric: The IL(p) importance of page p is a function of its location, not of
its contents. If URL u leads to p, then IL(p) is a function of u. For example, URLs
ending with “.com” may be considered more useful than URLs with other endings.

Re-visit policy

Another important topic related to Web crawling is the freshness of documents locally stored
with regard to the current version of the document in the original server. Crawling a fraction
of the Web need some time. By the time the crawler finishes its crawling many things may
happen (addition of new page, deletion or modification of existing pages, etc.). The study by
Lawrence and Giles about search engine performance [LG99] reported that on average 5.3%
of the links returned by search engines point to deleted pages. The most used cost functions,
introduced in [CGM00b], are freshness and age.

Freshness: is a binary measure that indicates whether the local copy is accurate or not. The
freshness of a page p in the repository at time t is defined as:

1 if p is equals to the local copy at time t


F p (t ) =
0 otherwise

Age: is a measure that indicates how outdated the local copy is. The age of a page p in the
repository, at time defined as:

0 if p is not modified at time t


Ap (t ) =
t − modification time of p otherwise

The objective of a crawler is to keep the average freshness of pages in its collection as high
as possible, or to keep the average age of pages as low as possible.

Two simple re-visiting policies of crawlers were studied by Cho and Garcia-Molina for
keeping the search engine’s collection fresh [CGM03]:

Uniform policy: This involves re-visiting all pages in the collection with the same frequency,
regardless of their rates of change.

24
Proportional policy: This involves re-visiting the pages that change more frequently more
often. In this policy, the visiting frequency is directly proportional to the (estimated) change
frequency.

Politeness policy

Crawlers use other party’s resource. For example, when they ask a web server for a certain
page they are using the CPU cycle of the server. Web crawlers also require considerable
bandwidth. Generally, they put a strain on network and processing resources all over the
world. Poorly written crawlers can crash a server or a router (e.g. rapid fire, which is
requesting a resource with a speed that a server can’t handle). Koster [Kos94] wrote a
standard called Robot Exclusion Protocol. This protocol is not a “must to follow” standard,
but it is a general consensus that a “good” crawler should follow this standard. It is a
standard for administrators to indicate which parts of their Web servers should not be
accessed by robots. The standard has the following format:

# comment

"<field>:<optionalspace><value><optionalspace>".

The field name is case insensitive. The record starts with one or more User-agent lines,
followed by one or more Disallow lines. The User-agent is the name of the crawler or
browser. For example, the following "/robots.txt" file specifies that no robots should visit
any URL starting with “/tmp/”, or a resource referenced by “/kk.html” :

# robots.txt for http://www.example.com/


User-agent: *
Disallow: /tmp/ # these will soon disappear
Disallow: /kk.html

The robots.txt file should be kept in the root directory of the server. When a “good” crawler
first visits the web server it first asks the server for the “robots.txt” file from the root
directory and abides by the rules.

The other thing that the “good” crawler should consider is the interval between two
consecutive requests of a resource from the same server. The first proposal for the interval
between connections was given in [Kos94] and was 60 seconds. According to [Cas04] if we

25
download pages at this rate from a Web site with more than 100,000 pages over a perfect
connection with zero latency and infinite bandwidth, it would take more than 2 months to
download only that entire Web site. This does not seem acceptable. Many developers use
different values, for example the WIRE crawler [Cas04] uses 15 seconds, The Mercator Web
crawler [HN99, NH01] follows an adaptive politeness policy i.e., if it took t seconds to
download a document from a given sever, the crawler waits for 10×t seconds before
downloading the next page.

2.3.2 The Indexer Component


In Information Retrieval terminology, index is a representation of the documents in a
collection. This means the documents in a collection are represented by the index terms or
keywords. The process of creating this logical representation is called indexing. Indexing can
be done manually or automatically. For a document space as large as the Web, it is
impossible to generate the index manually.

With very large collection, like the WWW, even modern computers might have to reduce the
set of representative key words [BYRN99]. One method to accomplish this is elimination of
stopwords. The other method is Stemming. Both stopwords removal and stemming are
language dependent activities. The other operation (in addition to stemming and stopwords
removal) that is usually applied during indexing process is tokenization. Tokenization
involves dividing the stream of text into chunks of words. This operation must be performed
before the indexing takes place. Tokenization is easier for some languages such as English or
Amharic (because it is easy to split words using white space and punctuation marks), in other
languages especially oriental languages such as Chinese, it is relatively difficult to find
separate words.

Unlike the traditional IR index, the Web index contains two indexes, text (content) index
and structure (link) index.

Link index: uses the hypertext structure of the Web. To build a link index, the crawled
portion of the Web is modeled as a graph with nodes and edges. Each node in the graph is a
web page and a directed edge from node A to node B represents a hypertext link in page A
that points to page B. Often, the most common structural information used by search

26
algorithms is neighborhood information, i.e., given a page P, retrieve the set of pages
pointed to by P (outward links) or the set of pages pointing to P (incoming links).

Text index: text-based retrieval (i.e., searching for pages containing some keywords) is the
primary method for identifying pages relevant to a query. Indexes to support such text-based
retrieval can be implemented using any of the access methods traditionally used to search
over text document collections. Examples include suffix arrays, inverted files or inverted
indexes, and signature files [BYRN99]. Inverted indexes have traditionally been the index
structure of choice on the Web.

2.3.3 The Query Engine Component


This is the component that interacts with the actual user. The query engine component
accepts the user query (usually one or two keywords), consults the index database, selects the
documents with the keyword, made some ranking to the selected documents, and presents
the result to the user. The search interface and the result presentation interface of a search
facility is an important factor for the success of the facility. The simplicity and neatness of
the interface of the leading search engine of current time, Google that has a market share of
49.2% [Sul06], is a contributing factor for its success. Shneiderman writes [BYRN99]:

“…..Well designed, effective computer systems generate positive


feelings of success, competence, mastery, and clarity in the user
community. When an interactive system is well-designed, the interface
almost disappears, enabling users to concentrate on their work,
exploration, or pleasure…..”

Ranking is the function that the query engine component supports. Text based ranking is the
de facto technique of ranking in traditional IR. Some of the reasons why traditional
Information Retrieval techniques may not be effective enough in ranking query results in
Web IR are: First, the Web is very large, with great variation in the amount, quality and the
type of information present in web pages. Thus, many pages that contain the search terms
may be of poor quality or not relevant. Second, many web pages are not sufficiently self-
descriptive. Last but not least, using text based ranking is easy for spamming, so the IR
techniques that examine the contents of a page alone may not work well for Web IR. The
link structure of the Web contains important information, and can help in filtering or ranking

27
web pages. PageRank [PBMW98, PB98] and HITS [Kle99] can be mentioned as ranking
algorithms that use the link structure of the Web.

Generally we can classify the link based ranking algorithms in the Web as [Hen01]:

• Query-dependent

• Query-independent

Query-Independent Ranking

Query-independent ranking assigns fixed score to each page in a collection without taking
the query in to consideration. The interconnection of web pages gives valuable information
for ranking purpose. One of such algorithms is PageRank. PageRank tries to capture the
notion of “importance” of a page [BP98]. The PageRank algorithm is currently an important
part of the ranking function used by the Google search engine. The definition of PageRank is
recursive, stating in simple terms that “a page with high PageRank is a page referenced by
many pages with high PageRank”. For instance, the Yahoo homepage is intuitively more
important than the homepage of Addis Ababa University. This difference is reflected in the
number of other pages that point to these two pages, that is, more pages point to the Yahoo
homepage than to Addis Ababa University homepage. Brin and Page [BP98], the developers
of the algorithm, defined page rank as follows:

We assume page A has pages T1……Tn which points to it (i.e. are citations).
The parameter d is a damping factor which can be set between 0 and 1. We
usually set d to 0.85. And C(A) is defined as the number of links going out of
page A. The page rank of a page A is given as follows

PR(A) = (1-d) + d(PR(T1)/C(T1)+…+PR(Tn)/C(Tn))

Note that the PageRanks form a probability distribution over web pages, so
the sum of all web pages’ PageRank will be one.

PageRank calculate the “global worth” of each page i.e. with the absence of a query [NW01].

Query-Dependent Ranking

28
In query-dependent ranking, the starting point is a set of pages that are expected to be
relevant to the given query. HITS (Hypertext Induced Topic Search) is one example for this
category.

HITS algorithm was first proposed by Kleinberg in [Kle99]. In contrast to the PageRank
technique, which assigns a global rank to every page, the HITS algorithm is a query
dependent ranking technique. Moreover, instead of producing a single ranking score, the
HITS algorithm produces two, the Authority and the Hub score. Authority pages are those
pages that are most likely relevant to a particular query. For instance, the Addis Ababa
University homepage is an authority page for the query “Addis Ababa University”, while
some page that discusses the tourist attractions at Addis Ababa would be less so. The hub
pages are pages that are not necessarily authorities themselves but point to several authority
pages. For instance, the page “searchenginewatch.com” is likely to be a good hub page for
the query “search engine” since it points to several authorities, i.e., the homepages of search
engines.

The basic idea of the HITS algorithm is to identify a small sub-graph of the Web and apply
link analysis on this sub-graph to locate the authorities and hubs for the given query. The
sub-graph that is chosen depends on the user query and called focused sub-graph. The two
scores, “authority score” and “hub score”, have a mutually-reinforcing relationship: a page
with high authority score is pointed to by many pages with a high hub score and a page with
a high hub score points to many pages with a high authority score.

2.4. Focused Crawler


The volume of information on the Web is increasing at an exponential rate. Due to this huge
size of the Web, general purpose search engines can no longer satisfy the needs of those who
are searching for more information. It is becoming too difficult for search engine to index the
entire Web. As of 2005, Google, MSN, Ask/Teoma, and Yahoo appear to index around the
68.2%, 49.2%, 43.5%, and 59.1% of surface Web respectively [GS05]. General purpose
search engines and generic crawlers are like public libraries; they try to cater to everyone,
and do not specialize in specific areas [CMD99]. In addition to the information overload,
these engines contain out-of-date index due to the rapid rate of change and growth of the
Web. As of 1999, Lawrence and Gilles [LG99] estimated that up to 14% of the links in

29
search engines are broken. These problems are caused mainly due to the general purpose
search engine’s “one-size-fits-all” philosophy.

One way to deal with huge amount of Web content and out of date index is to build domain
specific search engines, each of them focusing on one or a limited number of topics, such
that they crawl the related hyperlinks and avoid traversing the irrelevant part of the Web.
Domain specific search engine is defined as: an information access system that allows access
to all the information on the Web that is relevant to a particular domain [BANP02].

Domain specific search engines, as general purpose search engines, need a crawler that
provides them with grist for their operation. Such crawlers are called focused crawler. A
focused crawler was proposed by Soumen Chakrabarti [CMD99] as a solution for the above
mentioned problems. He and his colleagues proposed to crawl only a specific part of the
web, i.e. focused crawling, as “covering a single galaxy can be more practical and useful
than trying to cover the entire universe” [Gil98]. According to [CMD99], focused crawler
seeks, acquires, indexes, and maintains pages on specific set of topics that represent a
relatively narrow segment of the Web.

Links in the Web do not point to pages at random but reflects the page authors’ idea of what
other relevant or interesting pages exist [Nov04]. When a page author creates a link from his
page to another page, s/he is referring that the page contains important related information
with her/his page. This information can be exploited to collect more on-topic data by
intelligently choosing what links to follow and what pages to discard. Topical locality is the
major assumption behind focus crawling. Focused crawlers can be used in search engines
that specializes in particular web site(s), topic(s), languages(s), and file type, etc. Because
their size is more manageable (much smaller than the entire web), domain specific search
engines usually provide more precise results and more customizable functions [CC03]. A
focused crawler contains three main components [CMD99]: Classifier makes relevance
judgment on the crawled pages to decide on link expansion. Distiller determines a measure
of centrality of crawled pages to determine visit priorities. Crawler crawls the Web with
dynamically configurable priority controls that is governed by the classifier and distiller.

30
According to [CMD99], central to a focused crawler is canonical topic taxonomy with
examples. To start the crawling, the user has to select and/or refine specific topic nodes in
the taxonomy, and can provide additional example URLs.

When the system is built, the classifier is pre-trained with a canonical taxonomy (such as
Yahoo, the Open Directory Project1, etc.) and a corresponding set of examples. The
canonical (coarse-grained) classification tree is part of the initial system. Then the user
collects URLs that are examples of her/his interest. These are submitted to the system. The
system proposes the most common classes where the examples fit best. The user can modify
the taxonomy, i.e. refine some categories and move documents from one category to another.
The classifier integrates the refinements made by the user into its statistical class models.
After these, the system starts crawling (resource discovery). Then the distiller starts its work.
The distiller applies a modified version of Kleinberg'
s algorithm [Kle99] to find topical hubs.
The system runs a topic distillation algorithm to identify pages containing large numbers of
relevant resource links, called hubs. The system reports the most popular sites and resource
lists, and the user can give feedback by marking them as useful or not. This feedback goes
back to the classifier and distiller. Two modes of operation are possible with the classifier:

Hard focused: If some already crawled page ‘d ‘ been marked good, URLs can be extracted
from page ‘d’, otherwise crawl is pruned at ‘d’.

Soft focused: it uses the relevance score of the crawled page to score the unvisited URLs
extracted from it. The scored URLs are then added to the frontier. After that it picks the best
URL to crawl next.

2.4.1. Language Focused Crawling


In order to develop language specific search engine we need language focused crawler. In
language focused crawling a given page is considered “relevant”, if it is written in the target
language. Such kinds of crawler will selectively seek out web pages that are written in the
specified language by following a URL with the highest probability of leading to other
relevant pages. The probability of leading to other relevant pages of a URL is assigned based
on attributes of parent pages (e.g. the language of the parent page), the attributes of the URL
itself (e.g. its server and domain), etc. To develop a language focused crawler language
1
http://Dmoz.org

31
locality on the Web is usually assumed. To determine the relevancy of a page for a focused
crawler, the language of the page must be detected. The following methods are usually used
for detecting the language of a page.

1. Language attribute of HTML tag

Section 8.1 of the HTML 4.0.1 standard defines a way of announcing the natural language
for the whole or a part of an HTML document using the lang attribute within tags [RHJ99].
The lang is usually used in the HTML start tag, <HTML>, declaring the language for the
whole document. Below are some examples:

<HTML lang="en"> <!-- English -->


<HTML lang="en_US"> <!-- American English -->
<HTML lang="ja"> <!-- Japanese -->

2. Content-Language HTTP header

RFC12616, in Section 14.12 [FGMF+99], defines an HTTP header called Content-Language


as:

“The Content-Language entity-header field describes the natural language(s) of the


intended audience for the enclosed entity.”

Below are some usage samples of the Content-Language entity-header:

Content-Language: en

Content-Language: en_US

Content-Language: ja

3. Content-Language in META Tag

As is the case for the Content-Language header, the META tag and the “http-equiv” attribute
in HTML specification can be used to substitute the real headers in HTTP communication.
Below are some examples:

<META HTTP-EQUIV="Content-Language" CONTENT="en">

<META HTTP-EQUIV="Content-Language" CONTENT="en_US">

1
Request For Comment

32
<META HTTP-EQUIV="Content-Language" CONTENT="ja">

Techniques number 1 and 3 need the parsing of the page to determine the language. Other
than the above three, there are also other widely used statistical automatic language
identification techniques: Word based method and N-gram based method [Lan02].

4. Word based method

In word based method, words found in the document are compared with frequency list of
words for the supported languages, and matches are counted. The language with the highest
match score, exceeding a predefined threshold, is identified. The word based technique is
only applicable, if words are marked in the text. It cannot be used for languages that do not
systematically mark word boundaries by white space or punctuation marks, such as Japanese,
Chinese or Korean. One limitation of using this technique is, building or obtaining
representative lexicons (collection of words that the word is compared with) is not
necessarily easy, especially for some of the lesser-used languages [Lan02].

5. N-gram method

N-gram analysis is a widely used approach for language identification of documents,


especially web pages. An n-gram is an n-units slice of a longer string [CT94]. N-grams are
units or grams where grams can be words, phonemes, characters, morphemes, syllables, etc.
The n-grams can be formed by considering n adjacent or non-adjacent units extracted from
the source. The value of n can vary from 1 to many (usually not larger than 7 or 8). There are
different types of n-grams differing in size. We have bi-grams (sometimes called di-grams),
tri-grams, tetra-gram, etc. For example, in the sentence “my name is Tessema”, the word bi-
grams are “my name”, “name is”, “is Tessema”. In the word “Tessema” the character bi-
grams are “te”, ”es”, “ss”, “se”, “em”, “ma”. Padding characters are usually used in forming
the n-grams. If n is the size of the n-gram, then n-1 padding characters are used both at the
beginning and at the end of a word before the n-grams are formed. Padding characters result
in better distinct representation of words that have identical n-grams [CT94]. For example,
the word “TEXT” would be composed of the following n-grams with padding:

bi-grams: _T, TE, EX, XT, T_

tri-grams: _TE, TEX, EXT, XT_

33
The basic idea of using n-gram method in language identification of a web page is to identify
n-grams whose occurrence in a document gives strong evidence for or against identification
of a text as belonging to a particular language. This method gives as much as 99.8%
accuracy [CT94].

2.5. Lucene
Lucene1 is an open-source, high performance, scalable Information Retrieval library, fully
implemented in Java [HG05]. It’s a member of the popular Apache Jakarta family of
projects, licensed under the liberal Apache Software License. It adds indexing and searching
capabilities to applications.
The library stores indexes in a well-known index structure called an inverted index. The
index contains the indexed terms and statistics about each term. The statistics contains: for
each term in the dictionary, the numbers of all the documents that contain that term and the
frequency of the term in each document. It also incorporates Term Proximity data: the
positions that each term occurs in each document. The information retrieval model Lucene
uses is Vector space model.
Lucene has simple but powerful APIs for indexing. The fundamental classes that are used for
indexing are: IndexWriter, Analyzer, Document, Field, and Directory. Using these classes of
Lucene and few others, we can create an index. Lucene doesn’t care about the source of the
data, its format, or even its language, as long as the data can be converted to text [HG05].
But in creating an index for non-English languages in Lucene, we have to take the following
points into considerations:
• Know the encoding of the documents that are going to be indexed.
• Identify the Analyzer that will be used or write a new one if none exists that fits the
purpose. Analyzing a text is a language dependent process. An analyzer that is
developed for one language definitely will not work for another langauge. A
language has its own set stopwords, stemming rules, etc.
Lucene supports incremental indexing, the ability that new indexes can be added to the
already existing index. The text content from the document collection is indexed by Lucene
and stored on the file system as a set of index files.

1
http://lucene.apache.org

34
Lucene accepts Document objects that represent a single piece of content, such as a web page
or a PDF file. Each document is composed of one or more Field objects. These fields consist
of a name and a value pair. Each field should correspond to a piece of information that is
needed to be either query against or display in the search results. For instance, if the title
would be used in the search results, it should be added to the Document object as a Field.
The fields can be either indexed or not indexed, and the original data can be optionally stored
in the index.
Lucene index contains one or more Segments. Each segment contains one or more
Documents. Documents are populated with Fields which are name-value pairs. Figure 2.3
shows the logical view of Lucene’s index.

INDEX

FD1
Doc1 FD2
FD3
Segment1
FD1
Doc1 FD2
FD1 FD3
Doc2 FD2
Segment2 FD3 ...FD1
Doc1 FD2
FD1 FD3
Doc2 FD2
FD3 ...
Segment3
FD1
Doc2 FD2
FD3 ...

Figure 2-3 Logica1 view of Lucene index

The other powerful feature of Lucene is searching. The basic search classes that the library
provides are: IndexSearcher, Query, Hits, Queryparser.
Lucene calculates the similarity of the query with that of the document using Vector Space
Model (VSM) with TF-IDF based weight. It uses the Boolean model to first narrow down
the documents that need to be scored based on the use of Boolean logic in the Query
specification. In Lucene, the objects we are scoring are Documents. After the scoring is
calculated, the retrieved documents will be returned ordered on their score using the Hits

35
class. Queries listed without an operator specified use an implicit operator, which by default
is OR. But Lucene also supports operators such as AND, NOT, +, and -.

2.6. The Amharic Language


Like most African nations, Ethiopia is a linguistically diverse country where more than 80
languages are used in day to day communication. Although many languages are spoken in
Ethiopia, Amharic is dominant in that it is spoken as a mother tongue by a substantial
segment of the population and it is the most commonly learned second language throughout
the country [BSC76]. The language is the official language of the Federal government of the
country and some regional governments (such as Amhara, Benshngul Gumz, Gambella
regional states, etc.). According to the 1998 census of the country [ECSA98], Amharic is the
first language of more than 17 million people and second language for more than 5 million
people.

2.6.1 The Amharic Writing System


According to [BSC76], three writing systems are in use in Ethiopia, the Amharic syllabary,
the Roman alphabet, and Arabic script. The Amharic syllabry, which is derived from the
writing system of ancient South Arabian inscriptions, is used for Ge’ez, Amharic, and
Tigrigna, with slight modification. The Amharic syllabry is a uniquely Ethiopian writing
system, used nowhere else in the world except Eritrea (which happened to be part of
Ethiopia) and Israel (by Ethiopian Jews). The Amharic writing system has a similarity with
some Semitic languages like Arabic in having vowel marks added to basically consonant
letters. The present writing system of Amharic is taken from Ge’ez. Ge’ez in turn took its
script from the ancient Arabian language mainly attested in inscriptions in the Sabean dialect
[BSC76]. Currently, Ge’ez is only the language of religious writings and worship for
Ethiopian churches mainly the Ethiopian Orthodox Church. Since the current script of
Amharic undergoes many changes, there are changes on the script in shape and number of
symbols,

The original Sabaean alphabet is said to have had 29 symbols. When Ge’ez became the
spoken and written language in common use in northern Ethiopia, it took only 24 of the 29
Sabaean symbols and modify most of them. In addition, two new symbols were added to

36
represent sounds of Greek and Latin loanwords not found in Ge’ez. These symbols are * &
+( The style of the writing was also modified to left to right. By the time Ge’ez ceased to be
a living spoken and writing language and replaced by Amharic, Tigerigna and other
languages, further changes took place. Amharic did not discriminate in adopting the Ge’ez
fidel, it took all of the symbols [Yem87] and added some new ones that represent sounds
not found in Ge’ez. These added alphabets are - . / 0 1 2 3 and 4(

Currently, the language’s writing system contains 34 base characters each of which occurs in
a basic form and six other forms known as orders. The seven orders represent syllable
combinations consisting of a consonant following vowel. This is why the Amharic writing
system is often called syllabic rather than alphabetic, even if there are some oppositions
[Yem87]. The 34 basic characters and their orders give 238 distinct symbols. In addition to
the 238 characters, there are forty others which contain a special feature usually representing
labialization e.g(5( In Amharic there is no Capital-Lower case distinction. There are also
punctuation (often omitted these days) and numeration system for one to ten, for multiples of
ten (twenty to ninety), hundred and a thousand. The numerical characters are derived from
Greek letters, some modified to look Amharic fidels, and each has a horizontal stroke above
and below [BSC76]. The punctuation marks indicates word boundary, sentence boundary
etc. For example, the punctuation 6 (hulet netib) demarcates words whereas the punctuation
77 (arat netib) shows end of a sentence. See Appendix I, II, and III for the Amharic alphabets,
numerals, and punctuations marks respectively.

2.6.2 Problems of the Amharic Writing System


There is a process of change in any language in many of its aspects: change of meaning,
change of syntax, phonetic change, etc [Hai67]. The case for Amharic is not different;
especially the script underwent changes when it was borrowed from Ge’ez. Through the
adaptation process and other factors the Amharic writing system got some problems.

The first problem is the presence of “unnecessary” fidels in the language’s writing system.
These fidels (alphabets) have the same pronunciation but different symbols. Although these
different fidels give each word different meaning in Ge’ez, in Amharic language they have
been used interchangeably. These fidels are and 8 9 and : and & ; and <( For

37
example, the word “sun” can be written as, 9; 9= :> := etc … all mean the
same, although they are written differently. Several solutions have been proposed. One of
them is to choose one letter (e.g. one of ; or <? and eliminate the unnecessary ones
[Hai67].

The other problem is in the formation of compound words. Compound words are sometimes
written as two separate words and sometimes as a single word. For example, the word
“kitchen” can be written as @AB or @A B ( There are many such compound words
which need some effort to have a standard way of forming them.

2.6.3 Representation of Ethiopic in a Computer


The global era and the advent of the Information Society demonstrate the need for Ethiopians
not only to develop their information and communication infrastructure but also create an
enabling environment for information and knowledge sharing with their own alphabet
[Yay04]. Although there are many indigenous endeavors in computerizing the Ethiopic
Alphabet, the work done so far is not yet entirely suited for real information processing.
Today, there are over 40 Ethiopic word processing software products available, each with its
own character set, encoding system, typeface names, and keyboard layout [Yay04].

There is an obvious lack of standards in the computerization of the Ethiopic Alphabet despite
the attempts of several associations to harmonize the Ethiopic Character sets in order to have
a unique encoding system. However, the effort that has been exerted on Amharic language is
an exemplary for other languages that has less representation in the Information Technology
world [UNESCO05]. One of such encouraging development is the inclusion of the Ethiopic
in the Unicode 3.0 standard released in the year 2000. Unicode provides a unique number for
every character, no matter what the platform, no matter what the program, no matter what the
language. Ethiopic has secured its block in Unicode code table. A number of word
processing software vendors have adapted their fonts to comply with the Unicode standard
(e.g. power geez, visual geez, Afenegus, etc).

38
Chapter Three

Related Works

3.1. Search Engines for non-English Languages

The one-size-fits-all philosophy of major general purpose search engines creates a problem
on their performance towards queries in non-English languages [BG03, MC05]. Many
language specific search engines have been developed to remedy these problems that are
created due to the specific nature of the language in use. This chapter tries to see some of
those language specific search engines and other related topics.

3.1.1 Search Engines for Indian Languages


India is a multi-language, multi-script country with 22 official languages and 11 written
script forms [PJV06]. More than a billion people in India use one or the other of these
languages as their first language. English, the most common technical language, is the lingua
franca of commerce, government, and the court system, but is not widely understood beyond
the middle class and those who can afford formal, foreign-language education [PJV06]. All
Indian languages are phonetic in nature because all the scripts that these languages use have
originated from the ancient Brahimi script, which is phonetic in nature [Pra99]. Two of such
languages are Tamil and Hindi. Hindi is spoken by 30% of the population [BGMP98]. A
significant number of web pages are available in different Indian languages, especially Hindi
and Tamil. The following two sections summarized the development process of search
engines for the two languages.

3.1.1.1 Tamil Search Engine


Tamil is one of the many languages that are spoken in India. There are many Tamil sites with
different fonts and an encoding for these different fonts out there on the Web. Tamil
language is a highly inflectional language; every word takes on many forms due to the
addition of suffixes indicating person, gender, tense, number, case, etc [DPG03]. According
to [DPG03], like many search engines Tamil search engine contains a crawler, a database
system and a search system.

39
The crawler component is responsible for retrieving web pages and hands them to the
database module. Seed URLs are given for startup of the crawling process. The crawler is an
incremental crawler, it crawls the web continuously for updates. The downloaded documents
are processed for index words. The index words can be multilingual (English and Tamil).
The search engine stores the Tamil and the English words separately. The search engine
converts all the different font to TAB font (a standard font that is enforced by the Tamilnadu
government) using a generic font converter. The font converter takes care of many popular
fonts.

Stemming and stopword removal are applied on the extracted words before the index is
stored in a database management module. The search engine has a graphical user interface.
The search module accepts the user query and applies stemming and stopword removal and
searches the database for the words. If a matching is found, it will be ranked. The ranking is
based on the frequency count of the word. The Boolean AND operator is the default operator
followed by OR. The search engine also incorporates phrasal search if the user query is
between quotation marks. The user interface contains two kinds of keyboard. Tamil
keyboard for experienced users and phonetic keyboard for novice users, i.e. users who know
English but never use Tamil keyboard. The search engine ranks retrieved documents based
on text similarity only.

The literatures [DPG03, San02] that describe the architecture of the Tamil search engine do
not describe how the search engine identifies whether the page is Tamil or not. The search
engine does not consider the link structure of the Web in to its ranking operation. By not
doing so it lost important information that should have been used for such purpose.

3.1.1.2 WebKhoj: Indian language IR from Multiple Character


Encodings
Webkhoj is a search engine which gives users an opportunity to search in 10 top Indian
languages (based on number of speakers). The engine supports Hindi, Telugu, Tamil,
Malayalam, Marathi, Kannada, Bengali, Punjabi, Gujarati, and Oriya [PJV06]. According to
[PJV06], the plurality of the encodings in Indian language web pages creates a problem for
the search of the language’s documents on the Web. Therefore, in order to search or process
Indian language websites, Webkhoj transliterate all the encodings into one standard encoding

40
and accept the user'
s queries in the same encoding and build the search index. The
developers of the search engine use Unicode for the common standard encoding.

The conversion of encodings has many tasks. First, all the various encodings in Indian
languages on the web should be identified. Since these encodings are non-standard, there is
no one comprehensive list of such possible encodings. The developers, therefore, somehow
identify such encodings and also classify these encodings into the existing types. The second
step is to build a transliteration mapping for the given encoding into a standard encoding,
which is UTF-8 and hence convert any page into a standard and index it. Third step is to be
able to accept user'
s queries in the same standard as that of the transliterated documents
which is UTF-8.

As the developer’s goal was to search web sites of specific languages (i.e. Indian), they are
looking for a relatively narrow segment of the Web. While their crawler is very similar to the
one mentioned in [CMD99], they used a language identification module instead of a
classifier and hence call it language focused crawling. The language identification module
returns the name of the language for a given web page. This module is aware of all the
proprietary encodings. This language identification module returns a language of the web
page as Indian only, if the number of words in a web page are above a given threshold value,
i.e. they used word based method for language identification. The engine uses TF-IDF based
retrieval technique.

WebKhoj is provided with a soft keyboard which displays the UTF-8 character set of the
language on the screen. So the user can type the query using the keyboard that is displayed
on the screen. The soft keyboards for 10 Indian languages are provided in the searching
interface. The user can change the keyboard to a different language by clicking on the
desired language hyperlink displayed on the interface. The engine has a component that
changes multiple variants of a spelling that a word has to some standard i.e. word
normalization. In the literature that discusses WebKhoj, there is no explanation about how
the search engine ranks its results.

3.1.2 A Search Engine for Portuguese


Tumba! is a search engine specially crafted to archive and provide search services to a
community Web formed by those interested in subjects related to Portugal and the

41
Portuguese people. Tumba! is being offered as a public service since November 2002
[Sil03]. The developers claimed that, although the search engine has a similar architecture
with that of a global search engines, it has a better knowledge of the location and
organization of Portuguese Web sites. The search engine classifies pages as “Portuguese
Web” if the web page satisfies the following conditions:

• Hosted on a site under a “.PT” domain which is the ccTLD1 of Portugal.

• Hosted on a site under the “.com”, “.net”, “.org” domains, written in Portuguese and
with at least one incoming link originating in a web page hosted under a “.PT”
domain.

Tumba!’s repository is built on a framework for parallel and incremental harvesting of web
data. The developers claimed that it can provide better rankings than global search engines,
as it makes use of context information based on their local knowledge of the Portuguese Web
and their handling of the Portuguese language. The architecture of tumba! follows the
pipelined (tiered) model of high performance information systems.

Information flows from Web publishing sources to end users through successive stages. At
each stage a different sub-system performs a transformation on the Web data [Sil03]:

• In the first stage, crawlers harvest web data referenced from an initial set of domain
names and/or web site addresses, extract references to new URLs contained in that
data, and hand the contents of each located URL to the Web repository. The software
of tumba! can crawl and index not only HTML files, but also the most popular data
types containing text information on the Web: Adobe PDF and PostScript, Microsoft
Office and Macromedia Flash.
• The Web repository is a specialized database management system that provides
mechanisms for maximizing concurrency in parallel data processing applications,
such as web crawling and indexing. It can partition the tasks required to crawl a Web
space into quasi-independent working units and then resolve conflicts if two URLs
are retrieved concurrently.

1
country code Top Level Domain

42
• The indexing engine has a set of services that return a list of references to pages in
the repository, matching a given list of user-supplied keywords. Creation of indexes
is performed in a parallel process that generates a set of inverted files, each providing
the list of pages where a range of the search terms can be found. The lists are pre-
sorted by the relative importance of each page.
• The ranking engine sorts the list of references produced by the indexing engine by the
perceived relevance of the list of pages.
• The presentation engine receives search results in a device-independent format and
formats them to suit multiple output alternatives, including Web browsers, mobile
phones, PDAs1 and Web Services. The presentation engine can also cluster search
results in multiple ways, giving users the possibility of seeing results organized by
topic, instead of based only on the relevance of each matched page.

The search engine provides many services such as related page search, clustering of search
results, a query spell checker, links to Portuguese language documents, etc.

3.1.3 A Search Engine for Indonesian Language


According to [VB01], among several hundreds regional languages and dialects that are used
in the Republic of Indonesia, the Indonesian language is spoken by an estimated of 200
million people. In addition, 20 million Malay speakers can understand it. The Indonesian
language, often referred to as Bahasa Indonesia, is the official language of the country.

The Indonesian language has officially adopted Roman alphabet. The large majority of
Indonesian documents on the Web use the ASCII character set. The Indonesian language
contains almost no diacritics except for some rare words assimilated from foreign languages.
Dash "-", the numeral two "2", and the square symbol "2" require a special handling. Plurals
in Indonesian are expressed by repeating the noun (e.g. "buku-buku" = books), where the
repeated noun can be adjoined by a dash. However, it is also a common practice to put the
number 2 or the square symbol behind the word to denote repetition in writing (e.g. "buku2"
or "buku2"). The repeated forms have also evolved to indicate repetitive action (e.g. "jalan-
jalan" = walking around) or other miscellaneous meaning (e.g. "mata-mata"= spy). To

1
Personal Digital Assistant

43
further complicate the matter, it is common to affix a repeated word or to repeat affixed
words. Thus, certain mechanism needs to be developed to cater to this language feature.

The language is a morphologically rich language. There are around 35 standard affixes
(prefixes, suffixes, and some infixes inherited from Javanese). Affixes can virtually be
attached to any word and they can be iteratively combined. The developers constructed two
algorithms for the stemming of the standard and extended sets of affixes by defining two sets
of grammar rules corresponding to the derivation and inflection laws of Indonesian language,
respectively. Their algorithms are based on the morphological rules only (i.e. without
dictionary). The developers stated that affixes in Indonesian are used to form derivations
(conceptual variations) rather than inflections (grammatical variations).

The developers claimed that they engaged in the design and development of a search engine
for the Indonesian Web, i.e. the Web of documents written in Indonesian, with two
objectives in mind. On the one hand, they aimed at deploying their own search engine. On
the other hand, they aimed at identifying and solving issues pertaining to the design and
development of non-English language search tools and to create results of interest for other
information retrieval or computational linguistic projects.

The search engine is composed of three main units: the web crawler, the indexing and
retrieval modules, and the user interface. The crawler gathers documents across the Web. It
filters the Indonesian documents based on the language identification algorithm. The
indexing module processes the fetched documents: documents are segmented into words,
stopwords are removed, and remaining words are stemmed. The resulting terms are used to
index the reference of the document (URL). Queries are processed similarly.

The retrieval module retrieves a ranked list of possibly relevant documents by comparing the
queries with the documents. In the current implementation they use the SMART1
Information Retrieval system. They claimed that they ultimately aim at implementing their
own retrieval engine. The Information retrieval model used is the Vector Space Model.

The developers devised an n-gram based language identification algorithm. They used a
technique based on the frequency of tri-grams in Indonesian words. The system first learns
from a list of Indonesian words. Then, a weighted sum evaluates the similarity between the
1
Salton’s Magic Automatic Retriever of Text

44
frequencies of the tri-grams in the reference set with those in candidate new document. A
notion of penalty is also introduced to weight down tri-grams that have never been seen
before. In their experiments they introduced three kinds of penalties.

3.2. Focused Crawler for Thai Language

As discussed in section 2.4.1, language focused crawlers bring the document collection from
the Web. One of such crawlers developed for Thai language is discussed as follows.
According to [STK05], the crawler has two components, a crawler and a classifier. The
crawler is responsible for the basic functionality of a web crawling e.g. downloading, link
extraction, URL queue management, and imposing a crawling strategy. The classifier
determines relevance of crawled pages. The relevance of a given page can be determined
from its language. If a web page is written in the target language (Thai) then its relevance
score will be equal to 1 (relevant), otherwise its relevance score will be equal to 0
(irrelevant).
The classifier used the combination of HTML Meta tag’s lang attribute and n-gram based
language classification method, based on TextCat1. They made some customization to
TextCat so that their language classifier can detect Thai web pages encoded in both utf8 and
non-utf8 character sets. The classifier was tested using 2000 test documents and gave them
an accuracy of 94%.
The Language Specific Web Crawling (LSWC) crawler for Thai web pages, which
incorporates the Thai language classifier and the knowledge about the Thai Web graph
structure, is evaluated using a web crawling simulator [STK06]. After developing the crawler
they made some analysis on the simulator and some statistics was drawn. The dataset used in
their analyses and experiments was created by crawling some portion of the Thai Web and
its neighborhood with a limit on the distance from the start seed URLs. According to the
evaluation results, the LSWC strategy achieves the highest harvest rate and comparatively
good crawl coverage. They crawled 14 million pages by starting from 3 popular web sites in
Thailand. From the crawling and from the analysis, they found that most of Thai web pages
(65%) are outside of “.th” domain (national country domain of Thailand).

1
an n-gram based language identification software written in C++

45
3.3. Information Retrieval Work on Amharic Language

Much has not been done on information retrieval generally on Ethiopic documents,
especially on Amharic documents on the Web. Two of such efforts are summarized below.

Saba Amsalu [Ams01] tried to explore the possibilities of applying information retrieval
techniques for Amharic documents on the Web. To meet her objective, she did some
research on Amharic language. Those features of the language pertaining to the purpose of
the research and the language’s writing system were studied. Moreover, different
information retrieval techniques and algorithms used on other language were reviewed to
determine the possibilities of applying them to Amharic documents on the Web.

Saba developed a prototype system that accepts an Amharic query and searches the database,
which contains the indexed web pages, for the terms in the query and returns those
documents that are relevant. The system used an extended Boolean IR model with the
operators “AND” and “OR”. She designed a database for the storage of the content’s of the
Web pages collected, suffix list and index files. An Amharic query interface was also
developed for entering of Amharic queries. She also developed a web page submission form
for the automatic indexing and storage of a web page to the database. Stemming and
indexing algorithms were developed for the work.

In the prototype system developed, two inverted index files are generated. One of the files
contains full words as index terms; while the other contains index terms with prefixes and
suffixes removed using the stemming algorithm she developed. In addition to the contents of
web pages the database also contains a list of around 70 concatenated and individual
suffixes. List of stopwords were obtained using Term frequency (TF) with threshold
technique. She took a sample of 313 web pages, for testing the prototype, which contains
articles on social, economic and political news from Walta Information Center.

To evaluate the performance of the system, 42 queries were carefully selected by two
journalists. Average precision recall value is calculated for each query. From her research
she found out that using terms with suffixes and prefixes removed resulted in better
performance than using words. Finally, conclusions were drawn based on the results

46
obtained and recommendations were given as to what further research could be done for the
development of Amharic information retrieval system on the Web.

Even if Saba’s work is the first of its kind with respect to Web information retrieval for
Amharic documents, it doesn’t consider many things. The first thing is, it doesn’t have a web
crawler. Without a Web crawler one cannot deal with the dynamic nature of the Web. It also
operated on a homogenous collection of documents (only news articles), which is not the
same as the heterogeneity of the Web. The collected documents are very few. In addition to
these, the work didn’t consider the typical characteristics of Amharic language such as the
need to handle alternative letters for the same word in the language’s writing system.

There are other works that are done on Amharic information retrieval such as the
development of stemming algorithm for Amharic information retrieval [NW02]. In this
work, Alemayehu and Willett present a stemmer for processing documents and query words
to facilitate searching database of Amharic text. The paper [NW02] stated the Amharic
language and its morphology. The authors discussed the inflectional and derivational
morphology together with the corresponding affixes of the language in detail. Other issues
such as repetitive verbs and their complication on the morphology of the language were also
stated.

For the development of the stemmer the authors first identified a set of stopwords by using
six text corpora and selected stopwords manually from high frequency words. For detection
of some non-content-bearing words that can appear in many forms, an automatic means were
developed. All in all 148 stopwords were identified. The stemmer removes affixes by
iterative procedures that employ a minimum stem length, recoding, and context sensitive
rules, with prefixes removes prior to the suffixes. A word was stemmed if it had more than
two Amharic consonants.

A routine that handles those fidels of Amharic that have same sound but different shape were
developed for the work. For reiterative verbs, if the first order of a character comes after its
fourth order then the fourth order will be removed. For example, the word C DE will stem to
CDE( The stemmer follows a context aware approach. It had five context conditions with two
actions. Affixes that do not need any associated condition(s) are removed as found;
otherwise the five conditions are checked and one of the two actions would be taken.

47
The algorithm was implemented in Pascal programming language. A word that is going to be
stemmed is first checked against the stopword list. If it is not in the stop word list, it will be
passed to prefix removal routine. Then the routine that handles the reiterative verb would
take its part. Finally, the word passed through the suffix removal routine.

A sample of 1221 words was selected to determine the accuracy of the stemming algorithm.
The result from the manual assessment revealed that more than 95% of the stemmed words
were linguistically meaningful with 4.1% error rate (2.7% over-stemming and 1.4% under-
stemming). The authors mentioned that it would be possible to add further conditions and
rules to increase the success rate of the current version of the algorithm. However, they
refrained from doing so in order to balance between simplicity and effectiveness.

For languages that are morphologically rich, stemming has an effect. A research conducted
by Alemayehu and Willett [NW03] for the effectiveness of stemming for IR in Amharic
clearly showed that stemming is necessary for the effectiveness of the retrieval of the
language’s documents. In their research they compared word based, stem based, and root
based searches of an Amharic document test collection and found out that stem based and
root based searches are significantly far better than word based system.

3.4. Lessons Learned


Almost all search engines are more or less similar architecturally. However, in the course of
developing a search engine, especially a language specific one, one should take many factors
in to consideration. In such effort, there is a need to pin point the features of the language
that can affect the retrieval of the language’s document. Apart from the language’s features,
the different encodings of the language’s electronic documents play a vital role. Different
search engines that are reviewed in this chapter revealed that the above two factors
(language’s features and encoding) were given serious thoughts in their endeavor.
The development of language specific search engine needs a language focused crawler for its
operation. The main hurdle in developing such crawlers is the language and encoding
identification issue. Different developers used different techniques and heuristics for the
task. In indexing process the language’s typical features that can affect the retrieval process
must be taken in to consideration. This chapter also showed us, what is done and what is left
to be done regarding Web information retrieval of Amharic web documents.

48
Chapter Four

Design of the Amharic Search Engine

“…Small is beauty…”, this is much more meaningful when we come to information retrieval on
the Web. Locating resources of interest on the Web in general case is at best a low precision
activity owing to the large number of pages on the Web [Hug06]. One of the approaches that are
used in order to overcome the problems that are caused by the large volume of the Web is by
developing a topic specific search portals. These topic specific search engines index only
documents of interest, i.e. only document relevant to the topic. In addition to increasing the
precision of the retrieval activity, indexing only some portion of the Web (only web pages of
specific language, in our case) has other advantages such as:
• being able to obtain web pages related to the country, language, or topic from anywhere
on the Web
• maintaining the freshness of the index will be easier
• save computation resources, etc
This chapter discusses the design of Amharic search engine. It states what typical features of
Amharic language are included in the design process. It also explains the details of different parts
of the search engine and other related issues.

4.1. Design Requirements


In designing a search engine, especially language specific one, typical features of the language
plays a pivotal role. The design and realization of every information retrieval system must
consider the language feature that it is intended for. Having efficient and effective indexing and
searching mechanisms are one of the requirements in designing every IR system. In addition, our
search engine takes the following typical feature of Amharic language that affects the retrieval of
the language’s document from the Web in to consideration.
Morphological variant of words
Amharic is a morphologically rich language where up to 120 words can be conflated to a
single stem [NW02]. This clearly shows that stemming has a profound effect on the
retrieval process. A morphological variant of a word can just differ in tense, case,

50
plurality, etc or can have a whole different meaning or class. A stemmer developed by
Nega Alemayhu and Peter Willett [NW02] is adapted for this work with some adjustment.
The original stemmer is an aggressive one which tries to stem a word for both
derivational and inflectional morphology. This work tries to stem only inflectional
morphology, which stems a word for tense, number, case, and gender.
Short form of compound words
In Amharic language, it is common to write some words in shorter form using “/” or “.”.
For example the word FGH IJK can be written as F(I(( A search engine designed
for Amharic language must consider this feature of the language.
Repetitive alphabets (fidels)
Chapter two, under section 2.6.2 stated that Amharic has some repetitive alphabets that
can be used interchangeably in writings without change in semantics. An IR system for
the language’s documents must give special attention to these alphabets. In our system, a
user can enter queries using these fidels interchangeably and gets the same result for those
queries.
Encoding issue
As outlined in section 2.6.3, there are different encodings to represent Ethiopic documents
in a computer. Unicode seems a choice of the representation of Amharic documents in the
future, as the world moves towards that direction. This work considers only Amharic
documents written using fonts that have Unicode encoding.
Boolean operators
A user may want to search for a single keyword or for more than one keyword. He/She may
want to combine these keywords using different Boolean operators. The search engine must
give the user an opportunity to use these different Boolean operators as needed.

4.2. Architecture of the Amharic Search Engine


Our search engine is designed to index and search only Amharic language documents available
out there on the Web. The search engine has three components as many other search engines:
The Crawler Component
The Indexer Component
The Query Engine Component

51
These components have unique sub-components that give special thought to the typical
characteristics of the language they are designed for. As mentioned in previous chapters,
Amharic has unique characteristics that affect the retrieval of the documents of the language from
the Web.
The Crawler is responsible for collecting only Unicode encoded Amharic language documents
from the Web. The Indexer component builds indexes from documents that it gets from the
Crawler. It tokenizes, removes stopwords, apply stemming to the words before it indexes. This
index is used by the Query Engine to satisfy user needs. The detail architectural design of the
Amharic search engine is shown schematically in Figure 4.1.

52
The Web
Rank

Query Processor
Downloader
Stemmer
Indexer
Head Verifier
Frontier
Normalizer Stop Remover

Repository
Robots.txt
Parser Stop remover Normalizer

downloader
Query Parser
Stemmer

Indexer
Link Canonnicalizer
Search Result
HistURL Interface Displayer

Text Extractor
Index

Link Extractor Categorizer AmhURL

THE CRAWLER THE QUERY ENGINE

Figure 4-1 Architecture of Amharic Search Engine

53
The following sections describe the details of each component of the search engine.

4.3. The Crawler

Traversing the whole country domain (ccTLD) is a common way to get web pages of a certain
language. This approach was used in [Sil03]. However, due to the internationalization force and
economical reasons, web pages relating to a given language or a given country are frequently
being put on web servers outside the national domain name [STK06]. So, traversing ccTLD for
getting web pages of a certain language or about a certain country will lose those pages that are
under gTLD1 such as “.com”, “.net”, “.org”, etc. This is much more pronounced in countries like
Ethiopia, which has a significant number of population in Diaspora. There are many web pages in
different languages of the country that are created by this Diaspora population and hosted under
domains other than “.et”, the national domain name of Ethiopia. These important web resources
out there on the Web should be fetched systematically for our purpose, which is developing
Amharic search engine.

The Crawler component of our search engine will traverse the Web and fetch those web pages
that have Unicode encoding and with Amharic content. Among the different encodings that exist
today for Amharic language web documents, Unicode (UTF-8) is selected because more and
more Internet standard protocols designate Unicode as the default encoding and there will be a
significant shift towards the use of the encoding on web pages [LM01]. There are two approaches
to get those pages (i.e. both Amharic and Unicode):

1. Use HTTP request header fields (such as “Accept”, “Accept-charset”, “Accept-language”,


etc) and request those pages whose character encoding is Unicode and at the same time
falls in the range of Ethiopic block of Unicode (U+1200 - U+137F). Moreover, their
language is Amharic (because there are other languages, such as Tigringna, that use the
same script).
2. Simply download the page corresponding to the given URL and filter those pages with the
required properties.

1
gTLD – generic Top Level Domain

54
Crawling is a bandwidth intensive process and bandwidth is a limited resource that should be
used in wisdom [Pat04]. In the course of developing a search engine, the optimal use of limited
resources, such as bandwidth is always a preferable approach. Therefore, the first option is a
good approach to save this limited resource. We can use the HTTP request header fields like this:
Accept: text/plain, text/html …
Accept-Language: am-ET
Accept-Charset: UTF-8
But this approach has two problems:
• Most servers and web pages either don’t have a language attribute or their
response for request about the attribute is unreliable. This result in faulty HTTP
response for the requested attribute.
• HTTP does not support Unicode block range request/response method. This
prevents us from getting pages that have only Ethiopic content, i.e. the character
set of the page falls in Ethiopic block.

The solution for these problems is to download all pages and filter those pages that are written in
Amharic and Unicode encoding. In light with this, our Crawler is a focused one (language
focused crawler) with two sub-components:
• Crawler
• Categorizer

4.3.1 Crawler
Crawler is responsible for downloading a page. This sub-component first uses the HTTP Head
method, before sending the Get method for actual downloading. The HTTP Head method is
identical to the GET method except the server must not return a message body in the response.
Instead it returns the meta-information about the resource requested. This helps to make sure
whether a page that is specified by the URL address meets the specified attribute criteria. The
main reason for using Head method before Get method in our crawler is for saving bandwidth by
not downloading unwanted pages. This sub-component asks the server if the page (specified by
the URL) is text using the HTTP Head method (using Accept attribute). For example:

55
Head / \ HTTP/1.1
Connection: keep alive
User agent: the crawler
Host: URL
Accept: text/html, text/plain,…

If the server response status code is OK (200), then the crawler will use the Get method to fetch
the page.

4.3.2 Categorizer
After the page is fetched we have to determine if the page is important for our purpose, i.e. if it is
“relevant” or not. If the page is Amharic page, it is relevant, otherwise it is irrelevant. Those
pages that are Amharic will be processed further, the rest will be discarded. The Categorizer sub-
component classifies a page based on the language it is written (i.e. whether the page is Amharic
or not). The further processing includes links extraction, indexing, etc. Discarding non-Amharic
pages will help as a heuristic to avoid non-important pages. Discarding non-Amharic pages will
help not to extract links from those pages that most likely are non-Amharic pages. This tally with
the assumption of language locality on the Web that language focused crawlers is usually based
on. This heuristic is the same approach that hard-focused crawlers use.

4.3.3 The crawling process


To start the crawling process, seed URLs must be selected carefully. Those web sites that are
frequently visited and that contain many Amharic documents must be selected for this purpose.
The crawler has the following major data structures.
• Frontier : stores URLs to be visited
• History queue: stores URLs that are already visited
• Amharic queue: stores downloaded URLs that are Amharic
• Repository: stores downloaded pages.
It has the following major components.
• A component that send the “Head” method and do the actual downloading if the response
is HTTP OK (200).
• A component that checks the page is relevant or not (Amharic or not) - categorizer

56
• A component that parses the downloaded page for words and links.
• A component that canonicalize the URLs
o Checks if it starts with http://
o Change the URL to lower case
o Remove the” www” part
o Add trailing “/”
o Change relative URLs to absolute
• A component that checks whether the parsed link is visited before or not
• DNS resolver
• A component that downloads and parses the Robots.txt and guide the crawling
(downloading) process accordingly
• A component that controls the crawling process.
o Stop/resume the process
o Checkpoint taking
• A component that enforces the politeness policy and other policies
o Limiting the time interval between two requests from the same host
o Determine the timeouts for delayed network connections
• A component that determines the ordering of the crawling
The algorithm of the crawler is shown in the following figure (Figure 4.2) follows:

1. Populate (initialize) the frontier with seed URLs


2. Initialize history queue to null
3. Initialize Amharicqueue to null
While frontier is not empty
4. Pick a URL from the frontier
(The ordering component takes part here)
5. Send request to the server using Head method
If response OK
Go to 6
Else
Go to 4
6. Check the Robots.txt cache for the specified host

57
If found
6.1 Check the URL in the list of URLs not to be crawled
If found
go to 4
Else // if URL is not found in the cache
Send request to the server using the Get method
Add the URL to the history queue
Save the downloaded page in the Repository
Parse the page for text using text extracting tools
Send the parsed text to the classifier
If Amharic
Add URL to Amharicqueue
Extract links in the page
Canonicalize the URL
Check if the URL is in the history queue
If not found
Check if it is in the frontier
If not found
Add to the frontier
(the ordering component perform its work here too)
Else //it is found in the frontier
Discard
Else // found in the history queue
Discard
Else //if the robots.txt is not in the cache
Ask for the robots.txt from the server’s root directory and
Add list of URLs in the cache
Go to 6.1
7. Go to 4

Figure 4-2 Algorithm of the Crawler

58
This process will continue till some threshold numbers of pages are downloaded or till resources
(storage or memory) exhausted. The crawling order is Breadth first crawling order, as this
ordering gives important pages at the beginning of the crawling [NW01].

The refreshing process will be performed by changing the Amharic queue to the frontier and
begin the crawling process again. This refresh policy is obviously a uniform one, i.e. applied to
all pages with equal frequency. This uniform policy has a superior performance than the
proportional policy, which re-crawls pages based on their rate of change or frequency of change
[CGM03].

Before the crawler downloads the page using the GET method it must first ask the server for the
standard for robot exclusion (Robots.txt) file from the server’s root directory. If the server sends
the response, it will be parsed and the crawler will act accordingly.
Poorly written crawlers usually crash servers. One of such causes is “rapid fire”. “Rapid fire” is
sending many requests that are more than the web server’s request handling capacity. Hence,
there must be some gap between two consecutive requests from the same server. Usually less
than 15 seconds is considered impolite [Cas04]. We use 20 seconds. The following algorithm
(Figure 4.3) shows this process.

While (frontier is not empty)


1. Pick a URL from the frontier
o Get the host part from the URL
2. Pick a URL from the history queue
o Get the host part from the URL
3. Compare the two host names for equality
o If equal
Wait for 20 sec before sending the download request
o Else
Download

Figure 4-3 Algorithm for Crawling Gap

59
4.4. The Indexer Component

The output of the Crawler component, mainly the repository, is an input for the Indexer
component. The Indexer component is responsible for creating the logical representation of the
documents in the repository. In this component the parsed words will be stored in a format that is
appropriate for the searching process.
The indexer component has the following sub-components:
• A component that stores (indexes) a word (term) with some necessary information such as
term Frequency (TF), Inverse Document Frequency (IDF), etc. This is represented as the
index in Figure 4.1
• a normalization component – a Normalizer in Figure 4.1
o Replace Amharic alphabets (fidels) with their common replaceable fidels.
o Deal with compound word (separated by “/”, “.”)
• Stopwords removal component
• Stemming component
From the above components the normalization component, the Stopwords removal
component, and the Stemming component are the one that are given special attention in this
research work. This is because these three components address the Amharic language unique
characteristics with respect to information retrieval of the language’s document from the
Web. The following sections explain the sub-components of the Indexer component in detail.

4.4.1 The Normalization Component


This component handles most of the Amharic language writing problems that are discussed in
section 2.6.2 and other issues related to the language. It handles
• The tokenization of words based on white space, 6Lhulet netib), 66 (arat netib), M

(netela serez) NLdereb serez), O (kaleagano) and other Amharic punctuation marks.
• The replacement of Amharic alphabets that have same pronunciation and use, but
different representation , with common alphabet (fidel)
• Shorter form of words that are usually written using forward slash (“/”) and period
(“.”)

60
The algorithm for the normalization component is depicted in Figure 4.4.
While (! End of text)
1. Set a buffer to empty
2. Read a character
i. If the character is one of ; < & : 9 8 or their orders
call a component that can handle their replacement.
1. add the character to the buffer
2. go to step 2
ii. Else if the character is “/” or”.”
1. call a component that can handle such characters
2. return the string as a word
3. go top step 1
iii. Else if the character is a white space or one of the punctuation marks
1. return what is in the buffer as a word
2. go to step1
iv. Else
1. add the character to the buffer
2. go to step 2
Figure 4-4 Algorithm of the Normalizer

61
The algorithm for replacing the alphabets (fidels) is shown in Figure 4.5 below.
1. Read a character
• If the character is one of ; , or < replace them with ;
(The same applies for the orders, i.e. the orders of and <
will be replaced by the corresponding orders of ; )
Return the replaced character
• If the character is one of or & replace them with
(The same applies for the orders)
Return the replaced character
• If the character is one of 9 or : replace them with 9
(The same applies for the orders)
Return the replaced character
• If the character is one of or 8 replace them with
(The same applies for the orders)
Return the replaced character
• If the character is P replace it with Q
Return the replaced character
• if the character is R replace it with S
Return the replaced character
• if the character is T replace it with
Return the replaced character

Figure 4-5 Algorithm of Character Replacer


The algorithm that handles shorter forms of a word (that uses forward slash (“/”) or period
(“.”)) is as follows:
Words that are commonly written in shorter form using such constructs will be stored in a
structure. For example, IJ B is written in a shorter form as B ( To expand
B as IJ B first there will be a data structure that stores expanded as
IJ ( Based on this, the algorithm is shown below (Figure 4.6).

62
1. Read a character before ”/” or “.”
2. Search the storage for that character
If found
Return the corresponding expanded word
Else
Return the original word

Figure 4-6 Algorithm of Word Expander


After a word is processed, i.e. normalized, stopword removed, stemmed, it will be stored in the
index (Figure 4.1). The index should be a structure that is suitable for fast search. Inverted index
is the choice many IR systems, including search engines. Therefore, our search engine will index
words in an inverted index. Some statistics, such as TF, IDF, etc, will also be stored together with
the processed term.

4.5. Query Engine Component


The main function of the Query Engine component is to accept user query (ies), send to the
Indexer component for searching and display the relevant document set for the user information
need that will be send back from the Indexer component. The component has the following sub-
components:
• A facility that lets the user enter query(ies) in Amharic
• A component that parses the user query
• A component that apply the normalization on the query
• A component that apply stopword removal on the query
• A component that apply stemming on the query
• A component that calculates the similarity of the document in the index with that
of the query
• A component that rank the results
• A component that display the result to the user in appropriate format
The normalizer, stopword remover and the stemmer that are applied on the query (ies) are the
same as the ones that are applied on the Indexer component. The sub-component that calculates
the similarity of the document in the index and the query is the model that the system uses.

63
Among the different models that are discussed in section 2.1, vector space model is chosen for
our search engine. The model is chosen due to its popularity among different search engines and
its ability to handle partial matching. The composite of TF and IDF will be used to weigh terms
in query and in a document.

4.5.1 Ranking
Since users usually look only at the very first pages returned by a web search engine, it is very
important to effectively rank the results returned for the submitted queries [SHMM98]. The
combination of text based ranking and link based ranking will be used in our search engine to
rank “relevant” documents that are retrieved.
In this work backlink count based ranking is used in combination with text based ranking. Here is
how the backlink count works. After the crawler downloads all the pages, links in all pages will
be parsed. All the links will be analyzed and the number of incoming Amharic links that a page
has will be the rank of that page. The text based similarity is calculated using vector space model.
The two ranks will be added to give the final rank of a given page.

64
Chapter Five

The “ ” Search Engine


This chapter describes the implementation of the designed algorithms for Amharic search engine
whose details are given in the pervious chapter. Our search engine is named (Habesha).
search engine has three components as described in section 4.2. The Crawler component
crawls the Web and collects web pages that have Amharic content with Unicode encoding. It
stores the downloaded Amharic web documents in a file structure (repository) based on
document type.
The Indexer component processes the documents in the repository to create the index. Character
streams are tokenized into words based on Amharic punctuation marks and white space. Further
processing is done on the tokenized streams, such as substitution of repetitive Amharic alphabets,
and expansion of short words. Stopwords are removed from the tokenized stream in order to
eliminate words with little or no semantic. The remaining words are stemmed to reduce different
morphological variant of the word to the same stem. After all the processing, the words are
indexed in an inverted index structure with some statistics using Lucene.
The Query Engine component accepts the user query in Amharic and searches the Lucene based
index for the parsed query. After parsing the user query (ies) and applying pre-processing, the
query engine component first calculates the text based similarity of the query with that of the
document in the index. After getting this similarity, it will combine it with link based ranking and
present the result in descending order to the user.
The following sections describe the detail of each of these components and other issues that are
pertinent to the development process in detail.

5.1. Development Environment


Developing a full fledged, functional search engine is a resource intensive task. It needs a lot of
time and money to invest. Computer hardware and network bandwidth are the two important
resources that are needed to develop such a system. Crawling a part of Web is a bandwidth
intensive process while indexing and searching need a powerful CPU and a fast access
mechanism to hard disks. Crawling is usually done using many computers simultaneously,

65
downloading millions of page per day. Indexes can be stored in many servers located at different
locations. These servers may have same copy of the whole index or just segment of the index.
Serving results to the user requests requires accessing these indexes that are distributed at
different places. This process includes selecting the relevant documents for the given keyword(s)
from the indexes, rank them according to their relevancy to the user query and serve the user with
ranked results in appropriate format within a fraction of a second. This whole process indeed
needs huge computational resources in addition to efficient and robust algorithms.

Our system is developed and tested on a single PC of Intel® Pentium® IV CPU with 2.00GHZ
speed, 256 MB of RAM, 40GB of hard Disk capacity, with Microsoft Windows XP Professional
operating system. We used a bandwidth of 6MB/sec that is shared among thousands of the Addis
Ababa University community.

We have searched for the appropriate programming language for the implementation of the
search engine algorithms and we selected java. Java is selected because of the following reasons:
• Have a good networking capability
• Many reusable components are available on the Web
• is platform independent
• Experience of the language
Different third party tools have been used in the development process for example, text extraction
tools. Lucene is used for indexing and searching. Lucene is selected for: its data structures for the
index (inverted index), its ability to consider partial matching (vector model), its support of
different Boolean operators in queries, and its support of different kinds of queries.

5.2. Search Engine


As described in the Chapter that deals with the design of the Amharic search engine, our search
engine contains three major components: the Crawler, the Indexer, and the Query Engine
component. Each of these components incorporates the typical features of the language, which
they are designed and implemented for, into their functionalities. The following sections explain
the details of the development of each of these components.

66
5.3. The Amharic Crawler
The Crawler component is the component that brings the document collection that the whole
system relays on. There are different open-source and closed-source, high quality, and ready-to-
use crawlers out there on the Web. These crawlers are written in different programming language
and aimed from specific use up to a general crawling purpose. Why don’t we use one of these
ready-to-use crawlers as it is or with some customization instead of developing a new one from
scratch? In light with this question, we did some review on some of the popular open-source
crawlers in hoping to use them for our purpose. Two of the crawlers that we reviewed were
SPHINX1 and ht://dig

SPHINIX is a Java-based toolkit and interactive development environment for Web crawlers.
SPHINX explicitly supports crawlers that are site-specific, personal, and relocatable [MB98]. As
discussed in the documentation of the software under the FAQ2 section, the crawler is designed
for advanced web users and Java programmers who want to crawl over a small part of the web
(such as a single web site) automatically. Our system needs a crawler that can scale to a large
portion of the Web. Hence, SPHINX doesn’t meet this requirement of our system. Furthermore,
the classifier that the toolkit uses doesn’t specify how to classify a page based on its language.

As stated on their web site’s (www.htdig.org) documentation section, the ht://Dig system is a
complete indexing and searching system for a domain or intranet. The system is developed in
C++. It has its own crawler which is suitable only for a single site or intranet. Moreover, the
crawler doesn’t have a mechanism to classify a page based on language.

The above two crawlers fell short of our requirements. We then decided to develop our own
crawler. Developing a crawler from scratch gives us the opportunity to add all functionalities that
our crawler needs to have. Moreover, it gives us a finer control on its operation. As we discussed
in section 4.3, our crawler is a language focused one. The crawler searches the Web for web
pages that have Amharic content with Unicode based fonts.

The Crawler component contains two parts (as specified in Chapter Four): Crawler and
Categorizer.

1
Specific Processors for HTML INformation eXtraction
2
Frequently Asked Questions

67
5.3.1 Crawler
The main function of this sub-component is to download web pages and give them to the next
sub-component to be processed further. It mainly uses HTTP Get and Head methods. The Head
method is used to check if the page is text page. The Get method is used to download the
Robots.txt file and the page content, if the response for the Head method is OK (200). For this
work the crawler considers only certain file types (HTML, PDF, Ms-Word and text files). After
downloading the pages, it saves the contents in to different folders based on their file extension
(i.e. if a page is a PDF page, it will be stored in a folder that stores PDF files only). The storage
of web pages based on document type is applied only to facilitate and to make the task of text
extraction tools easier.

The next step is to determine the language of the document, i.e. determining whether the page is
Amharic or not. For this we need to separate the text of the document from other structures, for
example from different presentation tags in HTML. We used the following open-source tools to
extract the textual content out of the pages. The tools are selected because they are java based and
open-source.

PDFBox1: is java based software that parses and extracts useful information, including the body
text and title from PDF documents. Version 0.7.3 is used for this work.
NekoHTML2: is also java based program that parses and extracts useful information, including
the body text, keywords and title from HTML documents. Version 0.9.2 is used in this work.
WordExtract3 : is again java based program that parses and extract useful information, including
the body text, from Microsoft Word documents. This work uses version 0.4

After the content of the page is separated from presentation, the text is feed to the Categorizer.
The Categorizer determines the natural language of the document. The following section explains
the internal working of the Categorizer.

5.3.2 Categorizer
The Categorizer is the heart of language focused crawlers that distinguishes them from generic
crawlers. This sub-component identifies the language of a given document. A decision to discard

1
http://www.pdfbox.org
2
http//people.apache.org/~andyc/
3
http://www.textminning.org

68
or to process a downloaded page further is based on the output of the Categorizer. We searched
for software that can do the identification for us. We reviewed two programs; both based on n-
gram techniques and select the one which is appropriate for our purpose. N-gram based technique
is chosen for language identification, due to its high accuracy in identification, its resilience to
typographical errors, and its minimal data requirement for training [CT94].

The program that we selected is called Language Identification Module (LIM). LIM is java based
software developed by Japanese researchers for their project called Language Observatory
Project1 (LOP). LIM can identify Language, Script, and Encoding of a given document
simultaneously. The identification process is based on n-gram statistics of documents.

LIM contains two parts. First, the training component accumulates sets of n-grams from the
training data. The n-grams, called shift-codons, are any three bytes of a byte-sequence [SMO02].
Shift-codons are extracted starting from the first position, the second position … (n-2)th position
of the training data, where n is the length of the training data. The set of shift-codons thus created
are stored with the LSE2 tags into the reference database. The training data for original LIM are
mainly translations of the Universal Declaration of Human Rights (UDHR) provided by the
United Nation’s Office of Higher Commission for Human Rights (UNHCR).

The second component, the identification component, produces shift-codons of the document
whose language is going to be classified and then compares them with all sets of shift-codons
stored in the reference database. After comparison, the component calculates the matching ratio
of the shift-codons of the target text (number of matched codons divided by total number of
codons in target document). Then, the component returns the LSE with the highest ratio as a
result. The software is trained for around 280 different languages.

We made some modification on the software so as to make it suitable for our purpose. First we
trained the program. Since our work considers only Unicode encoded pages, we trained LIM with
Amharic document with “utf-8” encoding of size 34.8KB. “Ethiopic” is used for the script and
“utf-8” for the encoding. After training the program, we tried to test it on some documents. Since
we were running our testing in a workstation with 256MB of RAM, we faced a problem of
memory overrun. We alleviate the problem by deleting some trained languages from the

1
http://gii2.nagaokaut.ac.jp
2
Language Script Encoding

69
database. Second, we modify the program in such a way that it only returns the first match
instead of returning the default result set (i.e. the list of all matching documents).

Our LIM based classifier classifies a page as Amharic if the language is Amharic, the encoding is
“utf-8”, and the script is “Ethiopic”, and the matching is greater or equal to 60%. A matching of
60% is selected to incorporate multilingual documents with Amharic content occupies greater
proportion.
After classifying the page, two actions will be taken:
Action1: discard it, if the page is not Amharic
Action2: process it further, if it is Amharic.
The further processing of the page includes the extraction of links, canonicalization of extracted
links, applying URL seen test, addition of unseen links in the frontier. If the page is not Amharic,
it will be deleted from the repository. See Figure 5.1 for the flow of the crawler.

5.3.3 The crawling Process


The crawler starts its crawling operation with seed URLs. Getting Amharic seed URLs were one
of the problems that we faced during the course of developing search engine. This is
mainly due to two major reasons. The first and major one is the blocking of many Amharic web
sites hosted outside “.et” domain, due to political reasons, by the incumbent government. The
other problem was getting Amharic web sites that contain many Unicode encoded Amharic
documents. The sites that were selected as seed URLs were:
• am.wikitionary.com
• www.ena.gov.et – web site of Ethiopian News Agency, government owned news
agency
• www.waltainfo.com – web site of Walta Information Center, government owned
news agency.
• www2.dw-world.de/Amharic – Deutsche Welle website for Amharic language.
• www.ethiopiawakeu.net, and few others.

These sites were selected either they are most widely used one or they contain many Amharic
documents. The frontier is initialized with the above seed URLs. The crawler continues
retrieving pages until its queue is empty, unless the programmer cancels it by calling stop(),or
the crawl exceeds some predefined limit (number of pages visited).

70
As we mentioned before, for the downloading process we used only HTTP protocol. Although
the java.net package provides basic functionality for accessing resources via HTTP, it doesn'
t
provide the full flexibility or functionality that we need for our crawler. For example, it is
difficult to set the network time out of faulty connections or read time out for slow connections
reliably. So, we used apache Jakarta’s commons HttpClient1. HttpClient is an open-source,
standards based, pure Java, implementation of HTTP versions 1.0 and 1.1. Some of the unique
features of the package are:

• The ability to set connection timeouts


• Request output streams to avoid buffering any content body by streaming directly to
the socket to the server.

1
http://jakarta.apache.org/commons/httpclient

71
Start

Initialize frontier with


seed URLs

End
Check for termination [ fullfilled ]
[not fullfilled ]
Pick URL from frontier

[not ok ] Verify Head

[ok]
[not ok ] Check for Robots.txt

[ok ]
Fetch page

[not ok ] Check for Amharic

[ok]
Store page and parse
page

Check for URL seen

[no]
Add URL to frontier

Figure 5-1 Flow of the Crawler

72
5.4. The Indexer
The Indexer component is the component that creates the representation of the document
collection that was collected by our crawler( The Indexer extracts text from the downloaded
pages. The extracted text is passed through different processes before it is stored in the
Index. The character streams will be tokenized in to words by taking white space and
Amharic punctuation marks as word demarcations. Shorter forms of a word will be replaced
by its expanded form. Then, the words will be checked for stopwords. If the word is a
stopword or if it is a variant of a stopword, it will be removed. The remaining non-stopwords
will be stemmed to reduce them to their common form. The final result of all these processes
is stored as an index in a structure that is appropriate for fast searching and further
processing.

Lucene plays a pivotal role in this component. The indexing capability of the library is used
for creating the logical representation of the crawled pages. The main task in this component
of our search engine is then to integrate the Amharic language features with that of Lucene.

5.4.1 Pre-Processing
After the crawler collects the web pages, it stores them in a repository according to their
document types. The document types that are considered in this work are PDF, Ms-Word,
HTML, and Text files. Lucene can index and make searchable any data that can be changed
in to textual format [HG05]. In order Lucene to index the above document types, they have
to be changed to text format, i.e. only the text part of the documents must be extracted.
Different third party text extraction tools are used to pull the text out of documents.
Extraction tools are used for Word, PDF, and HTML file types. These extraction tools are
discussed in section 5.3.1. The tools can extract the raw text (stream of characters) and Meta
data such as Title, Author, etc of the document.

The extraction of the text from the downloaded pages is then followed by different pre-
processing on the text. The following sections describe the different pre-processing that are
applied on the text before it is indexed.

73
5.4.1.1. Tokenization
Tokenization breaks the stream of characters into raw terms or tokens. This process detects
word boundaries of a written text. Different issues must be considered to tokenize a given
text of a certain language.

• The encoding of the text of a given language must be known before tokenizing the
given stream of characters. Different encodings encode a given character differently;
some encode using single byte, some using double or more bytes.
• Different languages have different mechanism to demarcate a word. In some
languages detecting word boundaries is trivial; it is a matter of detecting some
punctuation marks or white space. Whereas, languages like oriental languages use
different mechanism to demarcate words instead of punctuation marks or white
space.

Amharic language has its own punctuation marks that demarcate words, sentences, etc.
These punctuation marks are discussed in section 2.6.1. Nowadays, it is not uncommon to
see the language’s writings (electronic or paper based) that use white spaces to demarcates
words instead of punctuation marks.

Internally, Java store characters as 16-bit Unicode characters. This complies with the 16-bit
representation of Amharic characters in Unicode. Since this work considers only documents
with Unicode encoding, specifically utf-8, identifying individual characters is a trivial task.

The tokenization component in this work demarcates words using Amharic punctuation
marks and white spaces. Hyphenated words such as @U' V BH'W XY are tokenized
as they are, i.e. as a single word. All the punctuation marks in Appendix III are considered
for the operation of the Tokenizer. Special consideration is given to forward slash (“/”) and
period (“.”). These two characters are used for writing words in a shorter form as discussed
in section 4.4. Forward slash is also used for writing dates. Next section states the handling
of these two characters.

5.4.1.2. Short words


In Amharic, compound words can be written in a short form using “/” or “.”, as described in
section 4.4. The short form of words can be expanded as single or a combination of two

74
words. which expanded as is an example for the latter. G is a short
form of the single word GIJ ( If two words are written in a shorter form using “/” or “.”,
these words are usually go together, i.e. they are phrases. Hence, if the shorter form of a
word is given as a query in information retrieval systems, the expanded form should be
treated as a phrase. If the user enters as a query he/she is interested in the phrase
not either or only. This means the query should be treated as phrase query.

Around 40 shorter forms of single and compound words are considered for this work. This
number would have been higher, if there was a comprehensive list of these words in Amharic
language. The mentioned numbers of words are collected from different literatures with the
consent of the language’s expert.

Here is how the module that handles such words work. The collected shorter forms of a word
are stored in a data structure (HashMap). The HashMap stores the beginning of the shorter
words as a key with their corresponding expanded form as a value. For example, the word
ZB is stored as Z as a key and Z " as a value, because Z B is expanded as Z "
B ( Some shorter forms of a word have the same key but different values. For Instance,
@ [\ @ [] @^V @ _ have the shorter forms @ \ @] @ and
@ _ respectively. These words have the same key @ ( In such cases the value is stored
in an ArrayList.

After storing the words, the module accepts a word. The word is checked for “/” or “.”. If the
word contains “/” or “.”, the character just before the forward slash will be parsed. The
module checks the HashMap for the corresponding expanded word, taking the parsed
character as a key. If a word is found it, will be returned otherwise the original word will be
returned. Words with the same key but different values are expanded as follows. The
characters after and before “/” are checked. If there is a match that begins with the character
before the “/” and ends with the character after the”/” (“.”), it will be returned. Otherwise
some other conditions will be tested.

5.4.1.3. Stopwords Removal


As mentioned in chapter two, stopwords have little or no discriminating value for IR
purpose. Therefore, they should be removed. Every language has its own list of stopwords.

75
There are different techniques for stopwords removal. One of such techniques is a dictionary
lookup of stopwords list. This technique is easier for well studied languages like English that
have a standard list of such words. Amharic language doesn’t have such a standard list.

For this work, a list of Amharic stopwords (77 in number) are selected with consultation
from a linguist (Ato Alemayhu Gurmu, Semitic languages expert at Ethiopian languages
study center, Addis Ababa University). In Amharic some of stopwords exist in different
forms due to additions of affixes. Some of those stopwords are mentioned in [NW02]. For
example, the stopword A (means in) may appear as: A ` A A
a ` a a a b ( This will make the list much longer, so a routine
is included to handle such behavior. See Appendix VI for the result of this routine applied on
different forms of the stopword (

5.4.1.4. Stemming
As discussed in section 4.1, Amharic is a morphologically rich language. This is the main
justification that we can give why we incorporate a stemming operation in our search engine
operation. Amharic can create many words by attaching different affixes to a stem.
Stemming can be applied on both derivational morphology and inflectional morphology or
on either of the two. Derivational morphology usually results in a change in class of a word
which in turn may result in some loss of semantic (i.e. change in meaning). This semantic
loss of a word may create a negative effect on the performance of an information retrieval
system. For example, in English the word work and worker may reduce to work during
stemming, if the stemmer is too aggressive. However, the two words have a completely
different meaning. The user who is looking for a work might not be interested in the history
of the workers. The same applies for Amharic, for instance the word cd (a Judge or an
arbiter) and ce (the profession of a Judge) has the same stem ce( But these two words
have different meaning. In Amharic language some prefixes and combination of some
prefixes and suffixes create negative meaning when they are applied on the given stem or
root. This kind of semantic loss is not usually witnessed during inflectional morphology,
which usually involves grammatical features such as, singular/plural, tense and case.

For this thesis work, a stemming algorithm that was developed by Alemayehu and Willett
[NW02] is adopted. We did some re-work on the algorithm so as to fit it for our purpose.

76
The original algorithm is an aggressive algorithm that tries to stem a given Amharic word for
both inflectional and derivational morphology. The adaptation of the stemmer is done as
follows.

First, we only want to stem inflectional morphology of the language. Particularly, we want to
stem number (singular/plural), tense, gender, and case differences. We carefully selected
those affixes that are usually used for inflectional morphology. Some of them are the
following:

Number: ')% like in f% → + )%


'f% like in gf% → g + f%
Case: ' like in I!c → I!c +
'J like in CK J → CK + J
Tense: ' like in → + future
'h like in i→ + 'h past

We selected and used 33 suffixes and 17 prefixes for this work.

From the literature that discussed the algorithm [NW02] and from a face to face discussion
with one of the authors (Dr. Nega Alemayhu) we found out that the original stemmer is
based on the idea that Amharic language writing system is alphabetic, i.e. vowels and
consonants are written separately. The suffix and the prefix removing algorithms assume that
vowels are represented in writings. As stated in section 2.6.1, there is a different school of
thought that says Amharic writing system is syllabry, i.e. vowels are not represented in
writings of the language. To the best of my knowledge the second group argument is the one
that is manifested in the written documents of the language, whether electronic or not.
Unicode consortium gives codes to Ethiopic alphabet based on the second group argument
[Gil02]. Since our system is implemented in java, which represents characters as Unicode
characters, implementing the original algorithm creates a problem.

How can we remove vowels that are not even represented in the writings in our collection?
One solution that we have was, to change the documents in a format that is appropriate for
the algorithm. This means to change the one that is written in syllabry writing system to the

77
alphabetic writing system. This clearly is a tedious and time consuming task. The approach
that we followed to solve this problem is stated below.

After selecting the suffix and affix words that we are going to use for our work, we identified
those with vowels. We study the effect of those affixes with vowels on a word before and
after their removal. From our careful examination, suffixes with vowels do create problems
for our purpose. In the original algorithm when the suffix with a vowel is removed, it
changed the last character of the word to sades (six order of a character). Based on this, we
select those suffixes with vowels, we removed the vowels and take only the consonant as a
suffix, for example the affix 'j%! will become '%! after removing the vowel j( In light
with this, the modified algorithm works as follows: If the word contains these suffixes
(suffixes with vowels), the suffixes will be removed and the last character of the word (after
removing the affix) will be changed to sades. The following table (Table 5.1) shows the
effect of the original stemmer and the modified one on some words.

Table 5-1 Comparison of Stemmers


word Stem + suffix After removing After changing
suffix to sades
Our Stemmer ck% ck l % ck ce
Original Stemmer ck% ce l )% ce Not Applicable

Our Stemmer Z %! Z l %! Z Z
Original Stemmer Z %! Z l j%! Z Not Applicable

From the above table we can clearly see that the out put of the two approaches are the same.
Indexing

All the pre-processing operations (tokenization, stopword removal, stemming, etc) are
applied on the text before it is indexed in Lucene index. These pre-processing are called
Analysis in Lucene terminology. Analysis, in Lucene, is the process of converting text into its
most fundamental indexed representation, terms [HG05]. Terms are used to determine what
documents match a query during searches. Analyzers are the encapsulation of the analysis
process.

78
Lucene has built-in analyzers, most of them for English language. It has also analyzers for
non-English language such as German and Russian. There are also contributions for
analyzing other languages, for example, Brazilian, Chinese, etc. Since there is no analyzer
for Amharic, we developed our own. Figure 5.2 shows the Amharic indexing structure of
Lucene.

Microsoft
PDF HTML Word Text

PDF HTML Ms-Word Text


Parser Parser Parser Parser

Text Content Text Content Text Content Text Content

AmharicAnalyzer

Index

Figure 5-2 Lucene Amharic Indexing Structure


Here is the code of the AmharicAnalyzer class.

public class AmharicAnalyzer extends Analyzer {

public AmharicAnalyzer() {
}
public final TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new AmharicTokenizer(reader);
StringBuffer sb=new StringBuffer();

79
try{
for (Token token = result.next(); token != null; token = result.next())
{
String term=token.termText();
shortword expand=new shortword();
term=expand.Expander(term);
sb.append(term+" ");
}
}
catch (Exception e){}
String sentence=sb.toString();
TokenStream results=new AmharicTokenizer(new StringReader(sentence));
results = new AmharicStopFilter(results);
results=new AmharicStemFilter(results);
return results;
}
}

The Analyzer first tokenizes a stream of characters, and handles repetitive alphabets; the
tokenized stream is checked for shorter form of words. After replacing the shorter forms with
the expanded word the result token is feed back to the Tokenizer. After expanding a word, a
word might be replaced by two words. If these words are not tokenized they will be
considered as a single word, this is why we have to apply tokenization again. After
tokenization the tokens are passed through AmharicStopFilter class which filters stopwords
from the token stream. Finally the remaining tokens will be stemmed and they are ready to
be indexed. The orders of the processing are critically important during the analysis. Each
step relies on the work of a previous step.

After the analyzer does its work, the resulting tokens contain the processed Tokens. These
tokens contain the text value, start offset, end offset, position increment, and token type. The
start offset is the character position in the original text where the token text begins, and the
end offset is the position just after the last character of the token text. The token type is a

80
String. The position relative to the previous token is recorded as the position increment
value.

When tokens are posted in the index, the text value and the position increment are the only
information carried through to the index. Position increments are used in processing phrase
queries.

When Lucene’s Directory class is described in chapter two, it was pointed out that one of its
concrete subclasses, FSDirectory, stores the index in a file-system directory. The internal
structure of Lucene is a group of files. A Lucene index consists of one or more segments,
and each segment is made up of several index files. Lucene index contains one segment file,
which stores the names of all existing index segments. Before accessing any files in the
index directory, Lucene consults this file to figure out which index files to open and read.

As discussed in chapter two, Lucene stores its index in an inverted index structure. All terms
(dictionary) in a segment are stored in the Term Dictionary file. There are also Term
Frequency and Term position files which stores term frequencies of each word in each
document and the position of each term within a document respectively. Our index contains
three fields, content, title, url. Content contains the text that is contained in the document.
Title field is the title of the document that is extracted using the text extraction tools. The url
is the URL address of the document. The first two fields are analyzed i.e. they passed
through the AmharicAnalyzer. The last field, which is the url, is stored as it is, i.e. stored as a
keyword. The following figure (Figure 5.3) shows the snap shot of our index using Luke, a
tool that allows displaying and modifying Lucene index.

81
Figure 5-3 Snapshot of the index using Luke

5.5. Query Engine Component


This component lets the user enter queries in Amharic and retrieve the “relevant” documents
for the given query. It has a result displaying sub-component that displays the retrieved
“relevant” documents.

The interface that accepts the user’s query is developed using Java server page (Jsp). It has a
text box that lets the user to enter his/her query in Amharic. It has a CYm (submit) button.
Figure 5.4 shows the screen shot of the search interface. For the typing task we need a
keyboard layout that handles the mapping of Roman alphabets in to the corresponding
Amharic fidels. We searched for the existing keyboard layouts. We tried a keyboard layout
that is used for Power Ge’ez (text processing software). However, when it is used for our
purpose (entering a query), it repeats all orders of Amharic fidels except the first order

82
during typing. For example, to type the word ` K it types as ` K( However, if we want to
type no it types as `n oK ( So, we used a keyboard layout called Tavultesoft Keyman
6.0 developed and supported by Tavultesoft1 that solves the abovementioned problem. The
query is a free form query, i.e. the user can enter natural language query.

Lucene plays an important rule in this component too. The library has APIs that parses the
user query, searches the already indexed documents for the parsed query, calculates the
similarity of the document with that of the query, and returns the retrieved documents sorted
by their rank according to the calculated similarity.

Lucene’s built in query parser is used to parse the user query. One of the sub-classes of the
QueryParser class, MultiFieldQueryParser, is used to parse user entered queries. This class
parses the query with respect to the fields that are given as a parameter. Since we want to
search the given query in the title and content field, we give these two fields to the parser as
a parameter. If the query word is found in the title field, it will be given higher weight than if
it is found in the content field. The query is analyzed with the same analyzer that the index is
analyzed. This is mainly to have the same kind of processing being applied on the text in the
index and in the query, so as to facilitate searching.

After the query is processed, the analyzed query is searched for matches in the index. To
search for the matches the engine uses Lucene’s IndexSearcher class that accepts the
location of the index as a parameter and its search () method with the query as a parameter.
After calculating the similarity, it returns the scored results (documents) as Hits objects.

Since Lucene uses only text based scoring, we added a link based scoring in addition to the
text based similarity. The rank is calculated as follows. The link score of each document in a
document collection is first calculated using the algorithm specified in section 4.5.1 and
stored. A given URL is stored with its corresponding rank in a data structure. After Lucene
retrieved the relevant documents, we retrieved the URL of each result from the url field in
the index. We then fetch the link based rank of the retrieved URL. Finally, we add the link
based rank with that of Lucene’s score and rearrange the results in descending order of their
score. The arranged result is then given to the result displaying component.

1
http://www.tavultesoft.com

83
The result is displayed in an HTML page. Again Jsp is used to implement the result page.
The result page contains for each relevant document; the title of the document, its URL, and
an excerpt form the original document. The excerpt contains two lines from the original
document with the query words highlighted (in Bold). Lucene’s highlighter class together
with supporting classes is used to highlight and calculate the excerpt. The user can click on
the title of the result to connect to the original document on the Web. The results are
displayed 10 results per page. A button is paced at the bottom of the page to see the next 10
results.

Figure 5-4 Screen Shot of the Search Interface of Search Engine

5.6. Experimental Results


As discussed in chapter two, an IR system must be evaluated to test its effectiveness. Our
Web search system is evaluated using precision-recall to test its effectiveness. It is also
tested with some selected queries to make sure whether it meets the design requirements of
this thesis work. Moreover, the crawler is also tested individually.

The Crawler

The first run of the crawler started its crawling operation on Thursday May 24th, 2007 and
finished on Monday May 28th, 2007. It processed more than 7500 pages and selected 1803
Amharic pages. There were two electricity power shortages at the time of crawling. During
its processing, manual intervention was involved in order to avoid unnecessary processing of
Web pages, i.e. deleting non-Amharic pages from the frontier. It stopped the crawling when
the frontier became empty.

84
The second run of the crawling was stared on Friday June 29th, 2007 and stopped on
Thursday July 5th, 2007. The crawling was done on a single host
(http://archives.ethiozena.net). In this crawling no language identification was done, because
we already knew that the pages on the server are Amharic pages. The crawler processed
22,276 pages and found 12,276 pages with Amharic content. The rest pages were just link
pages. We did communicate with the administrator of the server and found out that our
crawler was “polite” during its operation. It did ask for the Robots.txt before downloading
and there was a minimum of 20 Seconds gap between two consecutive downloads (a gap to
avoid rapid fire). It downloads 2.3 pages per minutes in average. The crawling was faster at
night than during day time for obvious reasons. The downloading rate at night was twice as
fast as during day time. See Appendix IV for sample log of our crawling operation that is
fetched from the server.

Categorizer

To test the accuracy of the categorizer, we tested LIM on 990 Amharic news documents. Out
of the 990 document collections, 675 of them had UTF-8 encoding the rest had non-Unicode
encoding. The following table shows the result of the test.

Table 5-2 Identification Result of Unicode encoded pages


Total Amharic + Identified as Accuracy
Unicode encoded pages Unicode+Amharic
675 671 99.41%

Table 5-3 Identification Result of non- Unicode encoded pages


Total Amharic + non-Unicode Identified as Accuracy
encoded pages Unicode+Amharic
315 0 100%
The four Unicode encoded pages that are identified as non-Amharic pages are HTML pages
without proper DTD1 . This makes the parser to parse them incorrectly that in turn results in
faulty classification.

1
DTD – Document Type Definition

85
Precision –Recall
IR systems are usually evaluated using Precision and Recall. These two evaluation measures
are usually used in a controlled environment. To evaluate our system with the above two
measures, we selected 75 news documents and 11 queries were also collected from
Computer Science post-graduate students who are regular followers of news and have
experience in using web search engines. The queries and the relevance of the document
collection were evaluated by Ato Ermias Wubshet who is a journalist by profession. The
following table (Table 5.4) shows the result of the precision and recall of each query.

Table 5-4 Precision-Recall Evaluation Result


Query OR AND

T-Rel Ret R-Ret T-10 P P R Ret R-Ret P R


"! S p! 10 24 10 .90 .42 1 9 8 .89 .90
!#
D !" 12 16 12 .90 .75 1 1 1 1 .08
D qn r
s
7 9 7 .78 .78 1 1 1 1 .14
^I t
"G :=
8 10 8 .80 .80 1 2 2 1 .25
m
u"
4 18 4 .40 .22 1 3 3 1 .75
II
" Vv 15 19 14 .90 .74 .93 7 7 1 .47
H.w] W_
4 4 4 1 1 1 3 3 1 .75
^n
xg y 5 6 5 .83 .83 1 4 4 1 .80
!1 G!
5 2 4 .50 .50 .80 1 1 1 .20
W! z_
{!
4 4 4 1 1 1 4 4 1 1
)|G}n
~!u Vv 3 14 2 .20 .14 .67 1 1 1 .33
Where T-Rel – total number of documents that relevant to the query in the collection
Ret - total number of documents retrieved by the system
R-Ret – total number of returned documents that are relevant to the query
T-10 P – precision with respect to the top 10 relevant documents retrieved
P – Precision
R- Recall

86
From the above table the Average precision and average recall for “OR” operator is 65% and
95 % respectively. For the “AND” operator the average precision and recall is 99% and 52%
respectively. The average precision for the top 10 results is 75%, which is higher than the
precision with respect to the total relevant documents returned. This shows displaying only
10 results has an effect on the precision, which result in better user “satisfaction”. It also
shows the AND operator gives greater precision than OR operator.
Meeting Design Requirements
The motivation of this work is the failure of general purpose search engines to handle queries
in Amharic properly. As stated in chapter one, these search engines fell short when they are
given queries that reflect the typical features of the language. Our search engine tries to
incorporate the typical features of the language both in its design and implementation. Some
of these features and how the search engine behaves for those features are discussed below.
Morphological variant of a word
A user can enter a morphological variant (inflectional morphology) of a word and get the
same result. For instance results for queries B$% (houses), B$•- ! (their houses)
B€! (his house), etc must give the same results with that of the query B (house), which
is the stem of the words. The following screen shot (Figure 5.5) shows the result for the
query ( From the result we can see that (with bold format) the inflectional morphology
of the query such as ! (case), (case) f% (number) are also included in the
result.

87
Figure 5-5 Screen Shot of a Result for the Query“
Interchangeable Alphabets (fidels)
Interchangeable fidels are mentioned as one of the problems of the Amharic writing system
in chapter two and a solution for this problem is described in chapter four. Our search engine
incorporates the solution for the abovementioned problem in its internal working. Below

88
(Figure 5.6) is a result for query (means work). From the result we can see that both
and • are included in the set of relevant documents(

Figure 5-6 Screen Shot of the Result for of the Query”


Short Words

Words in Amharic can be written in a shorter for using “/” or ”.”. Earlier in this chapter, it is
mentioned that if this shorter forms of a word are expanded, they should be treated as a
phrase. The following screen shots (Figure 5.7, Figure 5.8, and Figure 5.9) show the result
for two pair of queries and Lwhere the first is the shorter form of the
second) and for the word (

89
s

Figure 5-7 Screen Shot of the Result for the Query “

Figure 5-8 Screen Shot of the Result for the Query

90
Figure 5-9 Screen Shot of the Result for the PhraseQuery “ ”
From the above screen shots we can clearly see that our search engine indeed meets the
design requirements that are discussed in section 4.1.
Ranking
As discussed in section 4.5.1 and section 5.5, we used both link based (Backlink count) and
text based (Lucene score) to rank the retrieved documents. From the experiment, we noticed
that the link based ranking is not given equal weight with that of text based ranking. In link
based ranking the rank goes to as high as 1084, where as text based ranking ranges from 0 to
1. This gives much more weight to link based rank. Some normalization must be applied on
the link based rank, so as to give the two ranks equal weight.

5.7. Limitation
There were some limitations during the course of developing our search engine. The
computational limitation and the limited bandwidth that we used make our crawler slower.
Moreover, due to time limitation, the effectiveness of the refreshing policy is not tested.
Short words that are written using more than one “/” or “.”, for example W(X( ‚
etc, are not considered in this work. The relevance judgment during the evaluation process is

91
done by a domain expert (a journalist) instead of the users of the system. This may create a
problem due to the subjunctive nature of the notion of relevancy.

92
Chapter Six

Conclusions and Recommendations

6.1. Conclusions
Since its invention, the WWW has become a vital means of facilitating global
communication, publishing, e-commerce, and a huge repository of information of any kind.
Generally, the Web is the ultimate source of any information. Making this wealth of
information accessible to the user is what information retrieval tools try to do.

Many languages have been represented in the cyber space and others are still struggling to
join this universe. Amharic language, a family of Semitic languages, is one of the languages
that has a representation in the global information space. Even if there is no exact statistics
about the number of web documents of the language, there are a significant number of them
on the Web. The increasing trend of the number of Internet users in Ethiopia, the presence of
a significant number of Ethiopians in the Diaspora and the inclusion of the language’s
alphabets in Unicode will foster the creation of web documents using the language script in
the future.

With the Web becoming an omnipresent channel for delivering information, the interfaces
through which users discover that information have a major economic and political
significance [MC05]. Major general purpose search engines are moving to a direction to
cater the need of their users in different languages. However, these search engines fell short
when they deal with languages that are different from English, the language that they are
aimed at originally. For the case of Amharic, these engines try to make only pattern match
during searching without considering the features of the language.

This research attempts to design and develop a web search tool for Amharic web documents.
The tool incorporates the typical features of the language in its design as well as in its
implementation.

93
The research came up with a complete language specific search engine that has a crawler, an
indexer and a query engine component. These components are optimized for the language
they are designed, Amharic language.

The crawler crawls the Web and collects Amharic web documents with Unicode encoding. It
uses an n-gram based categorizer to detect the document’s language, the encoding and the
script it is written with. Our crawler tries to follow the standards that a “polite” crawler
follows. It asks “Robots.txt” from the root directory of a server before downloading the given
resource from that server. It also tries to avoid “Rapid fire”. Two runs of crawling have been
done to test the crawler. In the first run 1803 Amharic web documents were downloaded
after processing more than 7500 pages. In the second run more than 22,000 pages have been
downloaded and more than 12,000 Amharic web documents were found from a single site.
From the two crawls, we found out that our crawler was “polite” enough.

Communicating with an environment that is as huge, as dynamic, as heterogeneous in both


format and content, as unstructured as the Web needs a robust and flexible approach. From
the experiments on our crawler, we found out that our crawler is robust enough to interact
with the Web. Crawling is a resource intensive task; it needs a lot of bandwidth and
hardware resource. Our crawler is a single threaded one which is tested in an environment
with limited bandwidth. Current world class crawlers can download up to 10 million pages
per day. Our crawler only downloads a little more than 2 thousand 5 hundred pages per day.

Our crawler needs to decide on the fate of the given URL before it tries to download the next
URL in the frontier. This means, it waits for the categorizer to finish its work. This is one of
the reasons that make the speed of our crawler slower. Making the two components
independent may help in increasing the downloading rate of the crawler.

The categorizer is tested to evaluate its accuracy. From the evaluation we found out that it is
accurate enough for our purpose with more than 99% of accuracy. Taking regular
checkpoints of the operation of the crawler was a solution for the frequent power shortage
that we faced during the two runs of the crawling.

After the crawler collects the web pages, it stores them in a repository. The next component,
the Indexer, processes the documents and stores them in a structure that is efficient and
appropriate for searching.

94
Lucene was a good choice for the Indexer component. Its efficient algorithms, data structures
and its simple but powerful APIs spares us time and effort that would have been spend in
developing the indexer from scratch otherwise. The features of the language that affect the
information retrieval processes are easily incorporated in Lucene using the abstract classes
that the library offers for this purpose. Tokenization, expansion of shorter forms of a word
(s), stopword removal, and stemming are the pre-processing that are applied on the raw text.

Lack of standard stopword list and comprehensive list of shorter forms of words in Amharic
language were some of the problems that we faced in developing this component.

Lucene also plays a major role in the Query Engine component. This component is the one
that interacts with the actual user. It gives an interface that the user can enter his/her
information need in Amharic language using Ethiopic script. It uses handful classes of
Lucene to parse the query, search the index for the query, select the documents that are
relevant to the query, calculate the similarity that the document has with the query, and
return the relevant document in descending order based on the calculated similarity.

The Lucene based rank of the returned documents is integrated with the link based ranking.
The documents are re-arranged based on the integrated rank and send to the result display
component. This ranking is proved to be inefficient due to unproportional weight that is
given to link based and text based ranks. Some normalization must be applied on the link
based rank to remedy the problem. The result display component displays the relevant
documents 10 pages at a time with a button to display the next 10 results, if the result set
contains more than 10 documents. It shows the number of total retrieved documents for the
given query and numbering of the results that are displayed in the current page. For each
relevant document the title, the URL address and the excerpt from the content is displayed.
Java servlet and Java Server Page are used for handling interaction between the indexer and
query engine component as well as for the interfaces. See Appendix V for the list of classes
that are used in our search engine.

The default operator, which is OR, is used for combining queries with more than one word,
if other operators are not specified. The system supports Phrase queries.

The system is evaluated using precision recall. A document collection of news articles are
carefully selected. Queries are collected from experienced users of web search tools. The

95
“relevancy” of the document collection as well as the queries was evaluated by the domain
expert (a journalist). Recall-Precision was calculated for each query and the average measure
showed acceptable level of precision and recall. The Recall-Precision measure shows that
displaying 10 relevant results per page has an effect on precision (10% increase). From the
experiment, we found out that our search engine incorporates the typical features of the
Amharic language on which general purpose search engines fail to handle.

The out put of this research work can be applied in different areas. Searching an Intranet is
one area that this search tool can be applied. Moreover, the crawler component of this project
can be used in Web archiving, preserving the content of the Web, of Amharic documents In
order to preserve the informational, cultural heritage of the Amharic documents of the World
Wide Web.

6.2. Recommendations
An effort has been exerted in this research work to develop a search engine for Amharic web
documents. Developing a full functional and efficient search engine is a team effort that
needs a lot of coordinated effort from a linguist, a computer science professional, an
information science expert, etc. Additional features can be added or the existing components
of our search tool can be modified in order to increase the performance of the tool. In light
with this the following recommendations are made for further research and improvement.
• A significant proportion of Amharic web documents have a non-Unicode encodings.
These documents must be available for searching in the engine’s index.
• The indexing and searching scheme can be changed to see if it brings an
improvement on the performance of the search engine.
• The crawling order can be changed to test if it brings improvement on the
performance of the crawler.
• The effect of using thesaurus, which can expand the query using synonyms, must be
investigated.
• There are different ways of writing a given word in Amharic that is a result of
regional differences and other factors. For example the word “morning” can be
written as Aƒ ƒ aƒ etc. If there is a component that handles this
property of the language, the performance might be better.

96
• Typographical errors are common during entering a query. These typographical
errors should be detected and handled appropriately.
• The Ranking formula can be changed to test if it can give better results.

97
REFERENCES
[AAG03] Atelach Alemu, Lars Asker, and Mesfin Getachew. Natural Language
Processing for Amharic: Overview and Suggestions for a Way Forward. In
Traitement Automatique des Langues Naturelles, Batz-sur-Mer, June 2003.
[ACGP+01] Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and
Sriram Raghavan. Searching the Web. ACM Transactions on Internet
Technology (TOIT), 1(1):2–43, August 2001.
[Ams01] Saba Amsalu. The application of Information Retrieval Techniques to
Amharic Documents on the Web. Masters Thesis, Addis Ababa University.
Addis Ababa, Ethiopia, 2001.
[BANP02] A. Abdollahzadeh Barfourosh, M. L. Anderson, H. R. Motahary Nezhad and
D. Perlis. Information Retrieval on the World Wide Web and Active Logic: A
Survey and Problem Definition. 2002.
[BBH06] Timothy Baldwin, Steven Bird, Baden Hughes. An Intelligent Search
Infrastructure for Language Resources on the Web. January 2006.
Available at: http://lt.csse.unimelb.edu.au/projects/langsearch.
[BCSV02] Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna.
Ubicrawler: A Scalable Fully Distributed Web Crawler. In Proceedings of the
8th Australian WWW Conference, Australia, 2002.
[BG03] Judit Bar-Ilan, and Tatyana Gutman. How do Search Engines Handle Non-
English Queries? - A case study. In Proceedings of the Alternate Papers Track
of the 12th International WWW Conference, Budapest, Hungary, 2003.
[BGMP98] G. E. Burkhart, S. E. Goodman, A. Mehta, and L. Press. The Internet in India:
Better times ahead? Commun. ACM, 41(11):21-26, 1998.
[BP94] P. M. E. De Bra and R. D. J. Post. Information Retrieval in the World Wide
Web: Making Client-based Searching Feasible. In Proceedings of the 1st
International WWW Conference, Geneva, Switzerland, 1994.
[BP98] Sergey Brin and Lawrence Page. The Anatomy of a Large-Scale Hypertextual
Web Search Engine. In Proceedings of the 7th International WWW
Conference, Brisbane, Australia, April 1998.

98
[BSC76] Marvin L. Bender, Head W. Sydeny, and Roger Cowley. The Ethiopian
Writing System. In Bender et al (Eds.) Language in Ethiopia. London, Oxford
University press, 1976.
[BYCM+05] Ricardo Baeza-Yates, Carlos Castillo, Mauricio Marin, and Andrea Rodriguez.
Crawling a Country: Better Strategies than Breadth-First for Web Page
Ordering Proceedings of the 14th international conference on WWW, Cuba,
2005.
[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information
Retrieval. ACM Press / Addison-Wesley, 1999.
[Cas03] Carlos Castillo. Cooperation Schemes between a Web Server and a Web
Search Engine. In Proceedings of Latin American Conference on World Wide
Web (LA-WEB), pages 212–213, Santiago, Chile, 2003.
[Cas04] Carlos Castillo. Effective Web Crawling. Ph.D. Thesis, University of Chile.
Chile, November 2004.
[CC03] Michael Chau and Hsinchun Chen. Personalized and Focused Web Spiders.
2003.
Available at: http://www.business.hku.hk/~mchau/papers/WebSpiders.pdf
[CCHM04] Nick Craswell, Francis Crimmins, David Hawking, and Alistair Moffat.
Performance and Cost Tradeoffs in Web Search. In Proceedings of the 15th
Australian Database Conference, pages 161–169, Dunedin, New Zealand,
January 2004.
[CDKR+99] S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, S. Rajagopalan, A.
Tomkins, D. Gibson, and J. Kleinberg. Mining the Web'
s Link Structure.
Computer, 32(8):60-67, August 1999.
[CGM00a] Junghoo Cho and Hector Garcia-Molina. The evolution of the Web and
Implications for an Incremental Crawler. In Proceedings of the 26th
International Conference on Very Large Databases (VLDB 2000), Cairo,
Egypt, 2000.
[CGM00b] Junghoo Cho and Hector Gracia-Monila. Synchronization a Database to
Improve Freshness. In proceedings of ACM International Conference on
Management of Data (SIGMOD), p117-128, Dallas, Texas, USA, May 2000.

99
[CGM03] Junghoo Cho and Hector Garcia-Molina. Effective Page Refresh Policies for
Web Crawlers. ACM Transactions on Database Systems, 28(4), December
2003.
[CGMP98] Junghoo Cho, Hector Garc´ýa-Molina, and Lawrence Page. Efficient
Crawling through URL Ordering. In Proceedings of the seventh conference
on World Wide Web, Brisbane, Australia, April 1998.
[CMD99] Soumen Chakrabarti, van den Berg, M., and Byron Dom. Focused Crawling:
A New Approach to Topic-Specific Web Resource Discovery. In Proceedings
of the 8th International WWW Conference, Toronto, Canada, May 1999.
[CMRB04] Carlos Castillo,Mauricio Marin, Andrea Rodr´ýguez, and Ricardo Baeza-
Yates. Scheduling Algorithms for Web Crawling. In Latin American Web
Conference (WebMedia/LA-WEB), pages 10–17, Riberao Preto, Brazil,
October 2004.
[CT94] W. B. Cavnar and J. M. Trenkle. N-gram-based Text Categorization. In
Proceedings of SDAIR-94, the 3rd Annual Symposium on Document
Analysis and Information Retrieval, pages 161-175, Las Vegas, Nevada,
U.S.A, 1994.
[Dav00] Davison B. D. Topical Locality in the Web. In Proceedings of the 23rd Annual
International Conference on Research and Development in Information
Retrieval (SIGIR 2000), Athens, Greece, 2000.
[DMPS04] Michelangelo Diligenti, Marco Maggini, Filippo Maria Pucci, and Franco
Scarselli. Design of a Crawler with Bounded Bandwidth. In Alternate track
papers & posters of the 13th international conference on World Wide Web,
pages 292–293. ACM Press, 2004.
[DPG03] J. Deepa Devi, Ranjani Parthasarathi and T.V. Geetha. Tamil Search Engine
Sixth Tamil Internet conference, Chennai, Tamilnadu, August 2003.
[ECSA98] Ethiopian Central Statistical Authority (ECSA). The 1994 Population and
Housing Census of Ethiopia: Results at Country Level. Vol.1, Statistical
Report 44, Addis Ababa, Ethiopia, 1998.
[FGMF+99] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and Tim
Berners-Lee. RFC 2616 - HTTP/1.1, the Hypertext Transfer Protocol. 1999.

100
Available at: http:/w3.org/Protocols/rfc2616/-rfc2616.html
[Gil98] Dan Gillmor. Small Portals Prove that Size Matters. December, 1998.
Available at:
http://www.cse.iitb.ac.in/~soumen/focus/DanGillmor19981206.htm
[Gil02] Richard Gillam. Unicode Demystified: A Practical Programmer'
s Guide to
the Encoding Standard. Addison-Wesley, 2002.
[GS05] A. Gulli and A. Signorini. The Indexable Web is More than 11.5 billion
pages. In Proceedings of the 14th International World Wide Web Conference,
Chiba, Japan , 2005.
[GRGG97] Venkat N. Gudivada, Vijay V. Rahavan, William I. Grosky, Rajesh K. Gottu.
Information Retrieval on the World Wide Web. IEEE Internet Computing,
Volume 1, Issue 5, p. 58-68, September-October 1997.
[Hai67] Getachew Haile. The Problems of Amharic Writing System. Unpublished,
1967.
[Haz55] u ^6G „6<[!6@DV6… † 66 w d6 ƒ 66 ‡n6wH
6B M ˆ( ‰Š‹ŒM 6 M•Ž•• ‘ F(I(
[Hen00] Monika R. Henzinger. Web Information Retrieval an Algorithmic
Perspective. European Symposium on Algorithms, Germany, 2000.
[Hen01] Monika R. Henzinger. Hyperlink Analysis for the Web. IEEE Internet
Computing , Volume 5, Issue 1, p 45-50, January 2001.
[HG05] Erik Hatcher and Otis Gospodnetic. Lucene in Action. Manning Publications
Co, 2005.
[HN99] Allan Heydon and Marc Najork. Mercator: A Scalable, Extensible Web
Crawler. World Wide Web Conference, 2(4):219–229, April 1999.
[Hug06] Baden Hughes. Towards a Web Search Service for Minority Language
Communities. Open Road, 2006.
[IWS07] Internet World Stats, Usage and Population Statistics 2007.
Available at: http://www.InternetworldStats.com/stats.htm.
[IWU07] Internet World Users by Language. Available at:
http://www.internetworldstats.com/stats7.htm.

101
[IYA02] Izumi Suzuki, Yoshiki Mikami, Ario Ohsato. A Language and Character Set
Determination Method Based on N-gram Statistics. ACM Transactions on
Asian Language Information Processing, Vol. 1, No. 3, p 269-278, September
2002.
[Kle99] J. Kleinberg. Authoritative Sources in a Hyperlinked Environment. Journal of
the ACM, 46(5):604-632, 1999.
[kos94] Martin Koster. A Standard for Robot Exclusion. 1994.
Available at: http://www.robotstxt.org/wc/norobots.html.
[kos95] Martin Koster. Robots in the web: threat or treat ? ConneXions, 9(4), April
1995.
[Kos96] Martin Koster. Evaluation of the Standard for Robots Exclusion. 1996.
Available at: http://www.robotstxt.org/wc/eval.html
[Lan02] Stefan Langer. Natural languages and the World Wide Web.
Available at:
http://www.cis.uni-enchen.de/people/langer/veroeffentlichungen/bulag.pdf
[LG99] Steve Lawrence and C. Lee Giles. Accessibility of Information on the Web.
Nature, 400, 107-109, 1999.
[LM01] Shanjian Li, Katsuhiko Momoi. A Composite Approach to
Language/Encoding Detection. In Proceedings of the 19th International
Unicode Confrence, San Jose, USA, September 2001.
[LV00] Peter Lyman and Hal R. Varian. How Much Information. 2000.
Available at: http://www.sims.berkeley.edu/howmuch-info/
[LV03] Peter Lyman and Hal R. Varian. How Much Information. 2003.
Available at: http://www.sims.berkeley.edu/howmuch-info-2003
[MB98] Robert C. Miller, Krishn Bharat. SPHINIX: A Framework for Creating
Personal, Site-Specific Web Crawlers. Proceedings of the Seventh
International World Wide Web Conference (WWW7), Brisbane, Australia,
April 1998.
[MC05] H. Moukdad and H. Cui (2005). How Do Search Engines Handle Chinese
Queries? Webology, 2 (3), Article 17.
Available at: http://www.webology.ir/2005/v2n3/a17.html

102
[Mou03] H. Moukdad. Lost In Cyberspace: How Do Search Engines Handle Arabic
Queries? The 12th International World Wide Web Conference, Budapest,
Hungary, May 2003.
[NH01] Marc Najork and Allan Heydon. High-performance Web Crawling.
Handbook of Massive Data Sets. Kluwer Academic Publishers, Inc., 2001.
[Nov04] Blaz Novak. A Survey of Focused Web Crawling Algorithms. SIKDD 2004 at
multiconference IS 2004, 12-15, Ljubljana, Slovenia, October 2004.
[NW01] Najork, M. and Wiener, J. L.. Breadth-first Search Crawling Yields High-
quality Pages. In Proceedings of the 10th International World Wide Web
Conference, Hong Kong, May 2001.
[NW02] Nega Alemaehu and Willet P. Stemming of Amharic Words for Information
Retrieval. In Literary and Linguistic Computing. Oxford, Oxford University
press, Vol. 17, No.1, pp 1-17, 2002.
[NW03] Nega Alemayehu and Willet P. The Effectiveness of Stemming for Information
Retrieval in Amharic. Electronic library and information systems, Vol. 37,
No. 4, pp 254-259, 2003.
[Pat04] Anna Patterson. Why Writing Your Own Search Engine is Hard. ACM Queue,
pages 49 – 53, April 2004.
[PBMW98] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The
Pagerank Citation Ranking: Bringing Order to the Web. Technical report,
Stanford Digital Library Technologies Project, Stanford University, Stanford,
CA, USA, 1998.
[Pin94] Brain Pinkerton. What People Want: Experiences with the WebCrawler. In
Proceedings of the 2nd International World Wide Web Conference, Chicago,
Illinois, USA, 1994.
[PJV06] P. Pingali, J. Jagarlamudi, and V. Varma. 2006. WebKhoj: Indian language
IR from multiple character encodings. In Proceedings of the 15th
International Conference on World Wide Web Edinburgh, Scotland, May
2006.
[Pra99] S. Prasannas. A Web Interface to Indian Languages. Masters Thesis, Indian
Institute of Technology. Kanpur, India, 1999.

103
[RGM01] Sriram Raghavan and Hector Garcia-Molina. Crawling the Hidden Web. In
Proceedings of the Twenty-seventh International Conference on Very Large
Databases (VLDB), pages 129–138, Rome, Italy, 2001.
[RHJ99] Dave Raggett, Arnaud Le Hors, and Ian Jacobs. HTML 4.01 Specification.
W3C recommendation December 1999.
Available at: http://www.w3.org/TR/1999/REC-html401-19991224
[San02] Baskaran Sankaran. Tamil Search Engine. Tamil Internet 2002 Conference
, Foster City, CA, USA, September 2002.
[Sav02] Jacques Savoy. Information Retrieval on the Web: a new Paradigm.
UPGRADE Vol. III, No. 3, June 2002.
[SC03] Peter Steyn, Eliza Chan. Global Internet Population Grows an Average of
Four Percent Year-Over-Year. February 2003.
Available at: http://www.netratings.com/pr/pr_030220_hk.pdf
[SHMM98] C.Silverstein, M.Henzinger, H. Marais, and M. Moricz. Analysis of a very
Large AltaVista Query Log. Technical report, Digital Inc., October 1998.
SRC Technical note 1998-14.
[Sil03] Mario J. Silva. The Case for Portuguese Search Engine. DI/FCUL TR,
03.03, Department of Informatics,University of Lisbon. March, 2003
[SMO02] Izumi Suzuki, Yoshiki Mikami, Ario Ohsato. A Language and
Character Set Determination Method Based on N-gram Statistics. ACM
Transactions on Asian Language Information Processing, Vol 1, No.3
P 269-278, September, 2002.
[SM83] Salton, G. and Micheal J. McGill. Introduction to Modern Information
Retrieval, New York: McGraw-Hill Book Company, 1983.
[SMBR04] Mehran Sahami, Vibhu Mittal, Shumeed Baluja, and Henery Rowley. The
Happy Searcher: Challenges in Web Information Retrieval. In trends in
Artificail Intelligence, 8th Pacific Rim International Conference on Artificial
Intelligence (PRICAI), Auckland, New Zealand, 2004.
[SR96] Penelope Sibun and Jeffery C. Reynar. Language Identification: Examining
the issues. In Proceedings of SDAIR-96, the 5th Symposium on Document
Analysis and Information Retrieval, pages 125-135, Las Vegas, USA, 1996.

104
[Sta03] StatMarket. Search Engine Referrals nearly Double Worldwide. 2003.
Available at: http://websidestory.com/pressroom/pressrelease.html?id=181
[STK05] K. Somboonviwat, T. Tamura, and M. Kitsuregawa. Simulation Study of
Language Specific Web Crawling. icdew, p. 1254.In Proceedings of the
SWOD’05, 2005.
[STK06] Kulwasww Sombonviwat, Takayuk Tamura, and Masaru Kitsuregawa.
Finding Thai Web Pages in Foreign Web Spaces. 2006.
Available at: http://www.db.soc.i.koyoto-u.ac.jp/DEWS2006/3A-i6.pdf
[Sul06] Danny Sullivan. Nielsen NetRatings- Search Engine Ratings. August 2006.
Available at:
http://searchenginewatch.com/showPage.html?page=2156451
[UNESCO05] Measuring the Linguistic Diversity on the Internet. UNESCO Publications
for the World summit on the Information Society, 2005.
[VB01] Vinsensius Berlian Vega, Sephane Bressan . Indexing Indonesian Web:
Language Identification and Miscellaneous Issues. Presented at Tenth
International World Wide Web Conference, Hong Kong, 2001.
[VT04] Liwen Vaughan, Mike Thelwall, Search Engine Coverage Bias: Evidence and
Possible Causes. Information Processing & Management, 2004.
[Wiki06] WIKIPIDIA the Free encyclopedia , 2006
Available at: http://en.wikipedia.org/wiki/Information
[Wil00] T.D. Wilson. Human Information Behavior. Informing science, Vol. 3 No. 2,
2000.
[Yay04] Kitaw Yayeh-Yirad. Cultural Identity and Local Content Development on the
World Wide Web, a Cyber Ethiopia Initiative. Paper presented at INFOCOM
2004 International Information and Communications Technology (ICT)
Exhibition and Forum, Addis Ababa, Ethiopia July 2004.
[Yem87] Baye Yemam. w dM ƒ 66 (G(w(w("( M 1987 F(I(66

105
APPENDIX I : Amharic Alphabets

’ “”•“– —˜™š˜›šœ•”
–ž Ÿ” “”
• Œ ‰ž¡ Šž¡ ž¡
¢ £ ž¡

; ¤ ¥ = ¦ J
D § ¨ © Y | ª
« ¬ - ® ¯ °
G s w ± I q ²
& ³ ´ µ ¶ • · ¸
K i ] g \ ¹
º r » _ ¼ ½
2 ¾ ¿ À p Á Â
à ˆ … z Ä † Å Æ Ç È É
o Ê B u ~ Ë
H € ‡ ^ { $ Ì
- Í Î • Ï % Ð Ñ
< Ò Ó > Ô Õ Ö × Ø Ù Ú Û
Ü Ý t ! Þ ß
0 à á d â e k ã
h j X W )
@ T ä ƒ å f
8 æ ç F „ è é
` ê ë ì í n S R î ï 5 ð
3 ñ ò ó ô õ ö
[ ÷ ø ù ú U û ü
4 ý þ

E ‚ v m Q P
V c "
/ #
a A
.
9 ! " # $ % & '
: ( ) y * + ,
* - . / 0 1 2
C 3 4 5 6 Z 7 8
+ 9 } : ; x <
1 = > ? @ A B

106
APPENDIX II : Amharic Numerals

C D E F G H I Ž

• Œ ‰ Š ¢ £ J ‹

• K L • M ‘ N O P • Q

10 20 30 40 50 60 70 80 90 100 1000

APPENDIX III : Amharic Punctuation Marks [Haz55]


6 - (hulet) netib – Amharic word space
R - (mulu) Arat netib – Amharic Full stop
M - Netela serez – Amharic comma
N - dereb serez – Amharic semicolon
“ “ - temiherte Tikes – Amharic Quotation mark
! - temiherte anekero – Amharic Exclamation mark
() - qenef – Amharic Bracket
_ - cheret – Amharic Underscore
- - neues chret – Amharic Hyphen
… - netebetab – Amharic etc.
? - timehrete teyaqie – Amharic Question mark
. - yzet (netib) – Amharic Dot
S ' timehrte silaq

107
APPENDIX IV : Sample log of our Crawler
213.55.95.4 - - [05/Jul/2007:05:44:28 +0300] "GET
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/09/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910709.5.html
HTTP/1.0" 200 36006
213.55.95.4 - - [05/Jul/2007:05:44:51 +0300] "HEAD
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/09/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910709.6.html
HTTP/1.0" 200 -
213.55.95.4 - - [05/Jul/2007:05:44:54 +0300] "GET /robots.txt HTTP/1.0"
404 304
213.55.95.4 - - [05/Jul/2007:05:44:54 +0300] "GET
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/09/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910709.6.html
HTTP/1.0" 200 1869
213.55.95.4 - - [05/Jul/2007:05:45:15 +0300] "HEAD
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/16/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910716.1.html
HTTP/1.0" 200 -
213.55.95.4 - - [05/Jul/2007:05:45:15 +0300] "GET
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/16/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910716.1.html
HTTP/1.0" 200 26644
213.55.95.4 - - [05/Jul/2007:05:45:37 +0300] "HEAD
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/16/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910716.2.html
HTTP/1.0" 200 -
213.55.95.4 - - [05/Jul/2007:05:45:37 +0300] "GET
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/16/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910716.2.html
HTTP/1.0" 200 11538
213.55.95.4 - - [05/Jul/2007:05:45:59 +0300] "HEAD
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/16/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910716.3.html
HTTP/1.0" 200 -
213.55.95.4 - - [05/Jul/2007:05:45:59 +0300] "GET
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/16/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910716.3.html
HTTP/1.0" 200 8845
213.55.95.4 - - [05/Jul/2007:05:46:20 +0300] "HEAD
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/23/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910723.1.html
HTTP/1.0" 200 -
213.55.95.4 - - [05/Jul/2007:05:46:20 +0300] "GET
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/23/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910723.1.html
HTTP/1.0" 200 40824
213.55.95.4 - - [05/Jul/2007:05:46:42 +0300] "HEAD
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/23/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910723.2.html
HTTP/1.0" 200 -
213.55.95.4 - - [05/Jul/2007:05:46:49 +0300] "GET
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/23/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910723.2.html
HTTP/1.0" 200 8299
213.55.95.4 - - [05/Jul/2007:05:47:11 +0300] "HEAD

108
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/23/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910723.3.html
HTTP/1.0" 200 -
213.55.95.4 - - [05/Jul/2007:05:47:12 +0300] "GET
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/23/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910723.3.html
HTTP/1.0" 200 7380
213.55.95.4 - - [05/Jul/2007:05:47:33 +0300] "HEAD
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/23/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910723.4.html
HTTP/1.0" 200 -
213.55.95.4 - - [05/Jul/2007:05:47:36 +0300] "GET
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/23/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910723.4.html
HTTP/1.0" 200 1573
213.55.95.4 - - [05/Jul/2007:05:48:04 +0300] "HEAD
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/23/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910723.5.html
HTTP/1.0" 200 -
213.55.95.4 - - [05/Jul/2007:05:48:08 +0300] "GET
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/23/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910723.5.html
HTTP/1.0" 200 1896
213.55.95.4 - - [05/Jul/2007:05:48:28 +0300] "HEAD
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/23/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910723.6.html
HTTP/1.0" 200 -
213.55.95.4 - - [05/Jul/2007:05:48:29 +0300] "GET
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/23/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910723.6.html
HTTP/1.0" 200 1469
213.55.95.4 - - [05/Jul/2007:05:48:50 +0300] "HEAD
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/30/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910730.1.html
HTTP/1.0" 200 -
213.55.95.4 - - [05/Jul/2007:05:48:50 +0300] "GET
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/30/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910730.1.html
HTTP/1.0" 200 27601
213.55.95.4 - - [05/Jul/2007:05:49:12 +0300] "HEAD
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/30/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910730.2.html
HTTP/1.0" 200 -
213.55.95.4 - - [05/Jul/2007:05:49:13 +0300] "GET
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/30/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910730.2.html
HTTP/1.0" 200 5550
213.55.95.4 - - [05/Jul/2007:05:49:34 +0300] "HEAD
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/30/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910730.3.html
HTTP/1.0" 200 -
213.55.95.4 - - [05/Jul/2007:05:49:34 +0300] "GET
/%e1%8c%a6%e1%89%a2%e1%8b%ab/%e1%88%9b%e1%8b%8d%e1%8c%ab/1991/%e1%88%98%e
1%8c%8b%e1%89%a2%e1%89%b5/30/%e1%8c%a6%e1%89%a2%e1%8b%ab.19910730.3.html

109
APPENDIX V : List of Classes
CRAWLER

Classifier – a class that identify the Language, Script, and Encoding of a page. It is used for
determining whether a page is Amharic or not. It uses Language Identification
Module (LIM).
Crawler – the main class that did the crawling by using all the classes in this component.
FileSaver – a class that saves similar files in one folder. The similarity is based on file
extension.
Headok- a class that checks whether the page corresponding to the given URL is ok to be
downloaded. It uses HTTP’s Head method.
LinkExtracter – a class that extracts all the outgoing links from a downloaded pages. It uses
java’s Regular expression for the extraction.
RobotsDownLoader – this class checks if the given URL is allowed to be downloaded by
the crawler, by checking Robots.txt and using Headok class. If it is
allowed, it downloads the pages. It also enforces network time out and read
timeouts using HttpClient package.
Sleep – enforces a 20 second gap between two consecutive downloads from the same host.
StopCrawling – a class that gives the administrator a utility that can stop the crawling by
taking the checkpoint of the crawling status.

INDEXER & QUERY ENGINE

Package engine.index

AmharicTokenizer - Tokenizes streams of characters. It uses Amharic punctuation marks


and white space to demarcate words. It extends Lucene’s CharTokenizer
class.
AmharicStemmer – stems a given Amharic word. It only stems inflectional morphology of a
a given word.
StopStemmer – it stems stop words. It tries to stem a given word for both inflectional and
derivational morphology. It is used to handle different variants of a stopword.

110
AmharicStopFilter- it returns tokens that are not stopwords. It removes stopwords using
StopStemmer and dictionary lookup. It extends Lucene’s TokenFilter class.
AmharicStemFilter – extends Lucene’s TokenFilter class. It creates tokens that are stemmed
using AmharicStemmer.
AmharicAnalyzer – this is the class that encapsulates all the pre-processing operations in
creating an index. It extends Lucene’s Analyzer class. It incorporates
tokenization, stopword removal, stemming, etc and uses AmharicTokenizer,
AmharicStopFilter, and AmharicStemFilter classes.
HtmlHandler – extracts text from HTML pages.
PDFHandler – extracts text from PDF pages.
WORDHandler – extracts text from Ms-Word pages.
IndexManager – an important class that creates the actual index. It uses different lucene
classes and AmharicAnalyzer plus the text extraction classes.
shortword – handles the expansion of Amharic compound words that are written using “/”or
“.”.
Package engine.search

SearchManger – a main class that accepts, a query searches the already existed index and
returns “relevant” documents to the given query. It uses different Lucene
classes, IndexManger, AmharicAnalyzer, and SearchResultBean classes.
Search () method is the main method that do most of the work of the class.
SearchResultBean – a java Bean class with set and get methods for different results from
SearchManager.

package engine.servlet

SearchController – extends HttpServlet class. It accepts the user query and other information
from the interface (searchs.jsp) using post() or get() methods. It invokes
SearchManger’s Search () method and dispatches the results to the
results.jsp.

111
Package engine.util

Ranking- a class that calculates the link based rank of URLs in the repository.
Utils – a class that is used for testing the correctness of the analyzer and make sure the
correct parsing of the query.

INTERFACE

Searchs– a java server page that lets the user to enter his/her information need in
Amharic language. it implements the Interface of the System.
Results- a java server page that displays results for a given query. It displays the title of the
page, excerpt from the page and URL of the page. It displays only 10 results at a
time, if the result is greater than ten. It has a button for the next 10 ewsults.

112
APPENDIX VI : The Result of a Routine on the Variants of the Stopword

T žUVW U“” X W U“” ˜Yž•“ T žUVW U“” X W U“” ˜Yž•“


X VV›ZšŸ[ ž¡• “U\žšŸ• X VV›ZšŸ[ ž¡• “U\žšŸ•

! !
! I !
Þ ! I !
! !
` ! D %I !
! D % !
Ü ! W!" ! !
` Ü ! W!V !
Ü ! D ! !
! D Ü !
G ! ! DI ! !
G Ü ! W! ! !
G ß ! W! Ü !
¨ ! ¨J G Ü! !
r ! ! Ü !
r Ü ! !
¨ Ü ¨ ! !
Ê ! ÊJ Ü !
! Ê !I ÊJ
Ü ! ìY !
% ! ìY Ü !
` % ! D ! !
% ! D Ü !
% ! D !
`G ! ! D Ü !
`G Ü ! D !
`G ß ! D Ü !
` ! D I !
! D ÜI !
D % ! ! !
` Ü ! Ü ! !
Ü ! W!V ! !
` % ! W!V Ü !
% ! % !
% ! Ü !
ìY % ! Ü !
DI ! ! ! !
DG Ü ! Ü ! !

113
T žUVW U“” X W U“” ˜Yž•“
X VV›ZšŸ[ ž¡• “U\žšŸ•

% !
DG ! !
Y ! !
` Ü !
` !
`I !
G - !
W!V Ü !
W!V % !
% ! !
I ! !

114
APPENDIX VII : List of Short Words
B

nZY
= Dz
= _
V[
V ^~
G
GB
G Dz
n `Hw
n ;E
@
@\
@]
@ _
Z _
ZB
%B
r
x
!

EE
Bn !
I
IB
H = wÞ

SY
±/ Y
u/ Y
© SDtY
¨ G!

GIJ
x
F(I
F(F
(

115
Declaration
I, the undersigned, declare that this thesis is my original work and has not been presented
for a degree in any other university, and that all source of materials used for the thesis
have been duly acknowledged.

Declared by:

Name: ______________________________________

Signature: ______________________________________

Date: ______________________________________

Confirmed by advisor:

Name: ______________________________________

Signature: ______________________________________

Date: ______________________________________

Place and date of submission: Addis Ababa, July 2007.

116

You might also like