Information Retrieval: A Perspective For Assamese Language

Moinuddin Ahmed
10/13/12
Information Retrieval:
A Perspective For Assamese
Click to edit Master subtitle style
Moinuddin Ahmed 100302005 IST,Gauhati University
Moinuddin Ahmed
10/13/12
Content
1.Introduct
ion
22
Moinuddin Ahmed
10/13/12
INTRODUCTION
Tremendous growth of digital data
The introduction of web search engines has boosted the need for very large scale retrieval systems even further.
Internet is a great repository of information. With state of the art search engines like Google,Yahoo etc., the information is easily available.
The use of digital methods for storing and retrieving information has led to the phenomenon of digital obsolescence. 33
Moinuddin Ahmed
10/13/12
Introduction (contd..)
But there is still a problem. Most of the information in web is in English (56.4%). So, for a person who does not know English, this vast information is simply not available. This problem is more appropriate for multicultural and multi-lingual country like India. Cross-language information retrieval enables 44 users to retrieve documents originally created in
Moinuddin Ahmed
10/13/12
What is information retrieval(ir)?
Definition given by Manning:
Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).
Art of presentation, storage, organization.
Art because the quality of the IR system depends on experience of the human user.
The goal is to satisfy users information need.
55
Moinuddin Ahmed
10/13/12
Basic Components in an IR system

Qu ery
Docume nt Collectio n
Crawl ing
Index ing
Rank Sear ing ch
1.
2.
3. 4.
Resu lt Releva nce Feedb The system browses the document collection and ack fetches documents. - Crawling The system builds an index of the documents Indexing User gives the query The system retrieves documents that are relevant to the query from the index and displays to the user 66 Set of URLs to be crawled
Moinuddin Ahmed
10/13/12
Basics of Information retrieval
77
Moinuddin Ahmed
10/13/12
Boolean Model
Documents are modeled as a bag of index terms: {w11,w12,,w1t}

t = number of index terms wij = 0 or 1 depending on whether the term appears in the document or not
Query is a Boolean Algebra equation
e.g.
88
Moinuddin Ahmed
10/13/12
Boolean Model - Comments
Advantages:

Simple, efficient, easy to implement Precise get exactly what is specified The earliest retrieval model to be implemented still widely used in small scale retrieval (email search)
Disadvantages:

Partial match is not possible Very difficult to form query

99
Moinuddin Ahmed
10/13/12
Vector Based Model
Document and query both are = index represented as vectors toftotal number of index terms in vocabulary terms: index term is given term weights according to how important it is to represent the meaning of document
Each
Two widely used parameters for fixing
1010
Moinuddin Ahmed
10/13/12
Vector Based Model
Documents are retrieved and ranked based on the similarity between document vector and query vector Similarity can be calculated using cosine similarity measure
1111
Moinuddin Ahmed
10/13/12
Vector based model

Advantage:
Cosine
similarity measure returns value in the range 0 to 1. Hence, partial matching is possible. is possible. well with general collections.
Ranking Works
Disadvantage:
Index
terms are considered to be mutually independent.
1212
Moinuddin Ahmed
10/13/12
Probabilistic Model
The basic idea is to retrieve the documents according to the probability of the document being relevant Ranks according to the probability of relevance given or, the document and the query
P ( R = 1| d ,q)
The document is termed relevant if its probability of being relevant is greater than its probability of being non relevant
1313 Probabilities are estimated as accurately as possible
Moinuddin Ahmed
10/13/12
Probabilistic model
Advantage:
Strong Ranking according to probability of relevance is possible.
Ranking
Disadvantage:
implemented only in small scale IR tasks like library catalog search outperformed by vector1414
Generally
Moinuddin Ahmed
10/13/12
Architecture of Sandhan
The CLIA system can be broadly categorized into: Offline Processing Online Processing
1) 2)
1515
Moinuddin Ahmed
10/13/12
Offline processing
carried out in a offline mode. before the actual query
It consists of the following modules

Crawling Document Processing Indexing
1616
Moinuddin Ahmed
10/13/12
Online processing
- carried out in online mode
-
after the query has been fired by the user.
It consists of the following modules

Input Processing Subsystem Search & Retrieval Ranking Snippet generation Summary generation
1717
Assamese Monolingual Search Engine

Resources
Nutch Lucene Eclipse Lukeall
Moinuddin Ahmed
10/13/12
used
Galileo
1818
Moinuddin Ahmed
10/13/12
NUTCH
Nutch is an open source web search coded entirely in the Java programming language, but data is written in languageindependent formats. highly modular architecture, allowing developers to create plug-ins The crawler has been written specifically for 1919
Moinuddin Ahmed
10/13/12
Crawling in nutch
2020
Moinuddin Ahmed
10/13/12
Crawling was set at a depth of 2 and topN 50
2121
Moinuddin Ahmed
10/13/12
Configuration files (XML)

Required
user parameters
http.agent.name
http.agent.description http.agent.url http.username http.password
2222
Moinuddin Ahmed
10/13/12
Crawl URL Filter

Regular
expression to filter URLs during crawling
E.g. :
To ignore files with certain suffix:
-\.(gif|exe|zip|ico)$
To
accept host in a certain domain
+^http://([a-z0-9]*\.)*apache.org/ +^http://([a-z0-9]*\.)*gauhati.ac.in/
2323
Moinuddin Ahmed
10/13/12
Lucene
Free
and open source in Java, supported by the
Written
Apache
Lucene
is a mature and highperformance flexible

2424
highly
Moinuddin Ahmed
10/13/12
Architecture of Lucene:
2525
Moinuddin Ahmed
10/13/12
Resources
made 2 simple search engine that will be able to retrieve Assamese documents. Assamese Stemmer, stop word list by GU Crawling was set at a depth of 2 and topN 50 Lukeall 1.1 was used to verify the indexing procedure.
2626
Moinuddin Ahmed
10/13/12
list of approx 500 documents were crawled and indexed successfully
2727
Moinuddin Ahmed
10/13/12
The gui: Nutch & Lucene
2828
Moinuddin Ahmed
10/13/12
result
2929
Moinuddin Ahmed
10/13/12
The gui: nutch & solr
3030
Moinuddin Ahmed
10/13/12
RESult
3131
Moinuddin Ahmed
10/13/12
Future & conclusion

In
Indian context the need of such system becomes more evident that being multi-lingual country, the people here are familiar with more than one language. 21.09% of the total population of India can speak English. Compared to urban areas,
3232
Only

Information Retrieval: A Perspective For Assamese Language

Uploaded by

Copyright:

You might also like

Information Retrieval: A Perspective For Assamese Language

Uploaded by

Document Information

Original Title

Copyright

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Information Retrieval: A Perspective For Assamese Language

Uploaded by

Copyright:

Moinuddin Ahmed

Moinuddin Ahmed 100302005 IST,Gauhati University

Tremendous growth of digital data

What is information retrieval(ir)?

Definition given by Manning:

Art of presentation, storage, organization.

The goal is to satisfy users information need.

Basic Components in an IR system

Rank Sear ing ch

Basics of Information retrieval

Documents are modeled as a bag of index terms: {w11,w12,,w1t}

Query is a Boolean Algebra equation

Boolean Model - Comments

Partial match is not possible Very difficult to form query

Vector Based Model

Two widely used parameters for fixing

Vector Based Model

Vector based model

terms are considered to be mutually independent.

1313 Probabilities are estimated as accurately as possible

Strong Ranking according to probability of relevance is possible.

It consists of the following modules

Crawling Document Processing Indexing

after the query has been fired by the user.

It consists of the following modules

Assamese Monolingual Search Engine

Crawling was set at a depth of 2 and topN 50

Configuration files (XML)

http.agent.description http.agent.url http.username http.password

Crawl URL Filter

expression to filter URLs during crawling

To ignore files with certain suffix:

accept host in a certain domain

and open source in Java, supported by the

is a mature and highperformance flexible

list of approx 500 documents were crawled and indexed successfully

The gui: Nutch & Lucene

The gui: nutch & solr

Future & conclusion

You might also like