Information Retrieval: A Perspective For Assamese Language

You might also like

You are on page 1of 32

Moinuddin Ahmed

10/13/12

Information Retrieval:
A Perspective For Assamese
Click to edit Master subtitle style

Moinuddin Ahmed 100302005 IST,Gauhati University

Moinuddin Ahmed

10/13/12

Content

1.Introduct

ion
22

Moinuddin Ahmed

10/13/12

INTRODUCTION

Tremendous growth of digital data

The introduction of web search engines has boosted the need for very large scale retrieval systems even further.

Internet is a great repository of information. With state of the art search engines like Google,Yahoo etc., the information is easily available.

The use of digital methods for storing and retrieving information has led to the phenomenon of digital obsolescence. 33

Moinuddin Ahmed

10/13/12

Introduction (contd..)
But there is still a problem. Most of the information in web is in English (56.4%). So, for a person who does not know English, this vast information is simply not available. This problem is more appropriate for multicultural and multi-lingual country like India. Cross-language information retrieval enables 44 users to retrieve documents originally created in

Moinuddin Ahmed

10/13/12

What is information retrieval(ir)?

Definition given by Manning:

Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

Art of presentation, storage, organization.

Art because the quality of the IR system depends on experience of the human user.

The goal is to satisfy users information need.

55

Moinuddin Ahmed

10/13/12

Basic Components in an IR system


Qu ery
Docume nt Collectio n

Crawl ing

Index ing

Rank Sear ing ch

1.

2.

3. 4.

Resu lt Releva nce Feedb The system browses the document collection and ack fetches documents. - Crawling The system builds an index of the documents Indexing User gives the query The system retrieves documents that are relevant to the query from the index and displays to the user 66 Set of URLs to be crawled

Moinuddin Ahmed

10/13/12

Basics of Information retrieval

77

Moinuddin Ahmed

10/13/12

Boolean Model

Documents are modeled as a bag of index terms: {w11,w12,,w1t}


t = number of index terms wij = 0 or 1 depending on whether the term appears in the document or not

Query is a Boolean Algebra equation

e.g.

88

Moinuddin Ahmed

10/13/12

Boolean Model - Comments

Advantages:

Simple, efficient, easy to implement Precise get exactly what is specified The earliest retrieval model to be implemented still widely used in small scale retrieval (email search)

Disadvantages:

Partial match is not possible Very difficult to form query


99

Moinuddin Ahmed

10/13/12

Vector Based Model

Document and query both are = index represented as vectors toftotal number of index terms in vocabulary terms: index term is given term weights according to how important it is to represent the meaning of document

Each

Two widely used parameters for fixing

1010

Moinuddin Ahmed

10/13/12

Vector Based Model

Documents are retrieved and ranked based on the similarity between document vector and query vector Similarity can be calculated using cosine similarity measure

1111

Moinuddin Ahmed

10/13/12

Vector based model


Advantage:
Cosine

similarity measure returns value in the range 0 to 1. Hence, partial matching is possible. is possible. well with general collections.

Ranking Works

Disadvantage:
Index

terms are considered to be mutually independent.

1212

Moinuddin Ahmed

10/13/12

Probabilistic Model

The basic idea is to retrieve the documents according to the probability of the document being relevant Ranks according to the probability of relevance given or, the document and the query

P ( R = 1| d ,q)

The document is termed relevant if its probability of being relevant is greater than its probability of being non relevant

1313 Probabilities are estimated as accurately as possible

Moinuddin Ahmed

10/13/12

Probabilistic model
Advantage:

Strong Ranking according to probability of relevance is possible.

Ranking

Disadvantage:

implemented only in small scale IR tasks like library catalog search outperformed by vector1414

Generally

Moinuddin Ahmed

10/13/12

Architecture of Sandhan

The CLIA system can be broadly categorized into: Offline Processing Online Processing

1) 2)

1515

Moinuddin Ahmed

10/13/12

Offline processing
carried out in a offline mode. before the actual query

It consists of the following modules


Crawling Document Processing Indexing

1616

Moinuddin Ahmed

10/13/12

Online processing
- carried out in online mode
-

after the query has been fired by the user.

It consists of the following modules


Input Processing Subsystem Search & Retrieval Ranking Snippet generation Summary generation
1717

Assamese Monolingual Search Engine


Resources
Nutch Lucene Eclipse Lukeall

Moinuddin Ahmed

10/13/12

used

Galileo

1818

Moinuddin Ahmed

10/13/12

NUTCH

Nutch is an open source web search coded entirely in the Java programming language, but data is written in languageindependent formats. highly modular architecture, allowing developers to create plug-ins The crawler has been written specifically for 1919

Moinuddin Ahmed

10/13/12

Crawling in nutch

2020

Moinuddin Ahmed

10/13/12

Crawling was set at a depth of 2 and topN 50

2121

Moinuddin Ahmed

10/13/12

Configuration files (XML)


Required

user parameters

http.agent.name

http.agent.description http.agent.url http.username http.password

2222

Moinuddin Ahmed

10/13/12

Crawl URL Filter


Regular

expression to filter URLs during crawling

E.g. :

To ignore files with certain suffix:

-\.(gif|exe|zip|ico)$

To

accept host in a certain domain

+^http://([a-z0-9]*\.)*apache.org/ +^http://([a-z0-9]*\.)*gauhati.ac.in/

2323

Moinuddin Ahmed

10/13/12

Lucene
Free

and open source in Java, supported by the

Written

Apache
Lucene

is a mature and highperformance flexible


2424

highly

Moinuddin Ahmed

10/13/12

Architecture of Lucene:

2525

Moinuddin Ahmed

10/13/12

Resources

made 2 simple search engine that will be able to retrieve Assamese documents. Assamese Stemmer, stop word list by GU Crawling was set at a depth of 2 and topN 50 Lukeall 1.1 was used to verify the indexing procedure.
2626

Moinuddin Ahmed

10/13/12

list of approx 500 documents were crawled and indexed successfully

2727

Moinuddin Ahmed

10/13/12

The gui: Nutch & Lucene

2828

Moinuddin Ahmed

10/13/12

result

2929

Moinuddin Ahmed

10/13/12

The gui: nutch & solr

3030

Moinuddin Ahmed

10/13/12

RESult

3131

Moinuddin Ahmed

10/13/12

Future & conclusion


In

Indian context the need of such system becomes more evident that being multi-lingual country, the people here are familiar with more than one language. 21.09% of the total population of India can speak English. Compared to urban areas,
3232

Only

You might also like