Professional Documents
Culture Documents
Information Retrieval: A Perspective For Assamese Language
Information Retrieval: A Perspective For Assamese Language
Information Retrieval: A Perspective For Assamese Language
10/13/12
Information Retrieval:
A Perspective For Assamese
Click to edit Master subtitle style
Moinuddin Ahmed
10/13/12
Content
1.Introduct
ion
22
Moinuddin Ahmed
10/13/12
INTRODUCTION
The introduction of web search engines has boosted the need for very large scale retrieval systems even further.
Internet is a great repository of information. With state of the art search engines like Google,Yahoo etc., the information is easily available.
The use of digital methods for storing and retrieving information has led to the phenomenon of digital obsolescence. 33
Moinuddin Ahmed
10/13/12
Introduction (contd..)
But there is still a problem. Most of the information in web is in English (56.4%). So, for a person who does not know English, this vast information is simply not available. This problem is more appropriate for multicultural and multi-lingual country like India. Cross-language information retrieval enables 44 users to retrieve documents originally created in
Moinuddin Ahmed
10/13/12
Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).
Art because the quality of the IR system depends on experience of the human user.
55
Moinuddin Ahmed
10/13/12
Crawl ing
Index ing
1.
2.
3. 4.
Resu lt Releva nce Feedb The system browses the document collection and ack fetches documents. - Crawling The system builds an index of the documents Indexing User gives the query The system retrieves documents that are relevant to the query from the index and displays to the user 66 Set of URLs to be crawled
Moinuddin Ahmed
10/13/12
77
Moinuddin Ahmed
10/13/12
Boolean Model
t = number of index terms wij = 0 or 1 depending on whether the term appears in the document or not
e.g.
88
Moinuddin Ahmed
10/13/12
Advantages:
Simple, efficient, easy to implement Precise get exactly what is specified The earliest retrieval model to be implemented still widely used in small scale retrieval (email search)
Disadvantages:
Moinuddin Ahmed
10/13/12
Document and query both are = index represented as vectors toftotal number of index terms in vocabulary terms: index term is given term weights according to how important it is to represent the meaning of document
Each
1010
Moinuddin Ahmed
10/13/12
Documents are retrieved and ranked based on the similarity between document vector and query vector Similarity can be calculated using cosine similarity measure
1111
Moinuddin Ahmed
10/13/12
similarity measure returns value in the range 0 to 1. Hence, partial matching is possible. is possible. well with general collections.
Ranking Works
Disadvantage:
Index
1212
Moinuddin Ahmed
10/13/12
Probabilistic Model
The basic idea is to retrieve the documents according to the probability of the document being relevant Ranks according to the probability of relevance given or, the document and the query
P ( R = 1| d ,q)
The document is termed relevant if its probability of being relevant is greater than its probability of being non relevant
Moinuddin Ahmed
10/13/12
Probabilistic model
Advantage:
Ranking
Disadvantage:
implemented only in small scale IR tasks like library catalog search outperformed by vector1414
Generally
Moinuddin Ahmed
10/13/12
Architecture of Sandhan
The CLIA system can be broadly categorized into: Offline Processing Online Processing
1) 2)
1515
Moinuddin Ahmed
10/13/12
Offline processing
carried out in a offline mode. before the actual query
1616
Moinuddin Ahmed
10/13/12
Online processing
- carried out in online mode
-
Input Processing Subsystem Search & Retrieval Ranking Snippet generation Summary generation
1717
Moinuddin Ahmed
10/13/12
used
Galileo
1818
Moinuddin Ahmed
10/13/12
NUTCH
Nutch is an open source web search coded entirely in the Java programming language, but data is written in languageindependent formats. highly modular architecture, allowing developers to create plug-ins The crawler has been written specifically for 1919
Moinuddin Ahmed
10/13/12
Crawling in nutch
2020
Moinuddin Ahmed
10/13/12
2121
Moinuddin Ahmed
10/13/12
user parameters
http.agent.name
2222
Moinuddin Ahmed
10/13/12
E.g. :
-\.(gif|exe|zip|ico)$
To
+^http://([a-z0-9]*\.)*apache.org/ +^http://([a-z0-9]*\.)*gauhati.ac.in/
2323
Moinuddin Ahmed
10/13/12
Lucene
Free
Written
Apache
Lucene
highly
Moinuddin Ahmed
10/13/12
Architecture of Lucene:
2525
Moinuddin Ahmed
10/13/12
Resources
made 2 simple search engine that will be able to retrieve Assamese documents. Assamese Stemmer, stop word list by GU Crawling was set at a depth of 2 and topN 50 Lukeall 1.1 was used to verify the indexing procedure.
2626
Moinuddin Ahmed
10/13/12
2727
Moinuddin Ahmed
10/13/12
2828
Moinuddin Ahmed
10/13/12
result
2929
Moinuddin Ahmed
10/13/12
3030
Moinuddin Ahmed
10/13/12
RESult
3131
Moinuddin Ahmed
10/13/12
Indian context the need of such system becomes more evident that being multi-lingual country, the people here are familiar with more than one language. 21.09% of the total population of India can speak English. Compared to urban areas,
3232
Only