Professional Documents
Culture Documents
Information Storage and Retrieval: Chapter One - Introduction
Information Storage and Retrieval: Chapter One - Introduction
Information Storage and Retrieval: Chapter One - Introduction
Retrieval
1
Outline
Examples of IR systems
Why is IR so hard?
Retrieval
What do we mean by “retrieval”?
Systems
How do computer systems fit into the human
4
What is Information?
What do you think?
There is no “correct” definition
Cookie Monster’s definition:
“news or facts about something”
“mis-information”
Be “about” something
Wisdom
Knowledge
Information
Data
7
Information Hierarchy
Data
The raw material of information
Information
Data organized and presented in a particular manner
Knowledge
“Justified true belief”
Wisdom
Distilled and integrated knowledge
8
“Retrieval?”
“Fetch something” that’s been stored
Recover a stored state of knowledge
Search through stored messages to find some
messages relevant to the task at hand
9
What types of information?
Text (Documents )
XML and structured documents
Images
Audio
Video
Source code
Applications/Web services
10
Information Retrieval Systems?
Document (Web page) retrieval in response to a query
Quite effective (at some things)
11
Examples of IR systems
Conventional (library catalog):
Search by keyword, title, author, etc.
Other:
Cross language information retrieval,
Music retrieval 12
Information Retrieval
Information retrieval (IR) is the process of finding
material (usually documents) of an unstructured nature
(usually text) that satisfies an information need from
within large collections (usually stored on computers).
Information is organized into (a large number of) documents
Large collections of documents from various sources:
news articles, research papers, books, digital
libraries, Web pages, etc.
Example: Web Search Engines like Google claim to
index Trillions of pages
13
motivation
Nowadays the businesses, organizations,
and individuals all around the world
actively use websites on a daily basis.
Webpages
Website
Search engine
About 1 billion of websites
More than 85% are not active (e.g. have a
similar function)
About 10%-15% only active
Google indexed trillion of webpages
Google is currently the most visited
website 14
General Goal of Information Retrieval
To help users find useful information based on their
information needs (with a minimum effort) despite:
Increasing complexity of Information
15
16
Info Retrieval vs. Data Retrieval
Emphasis of IR is on the retrieval of information, rather than
on the retrieval of data
Data retrieval
Consists mainly of determining which documents contain a set of
keywords in the user query
Aims at retrieving all objects that satisfy well defined semantics
a single erroneous object among a thousand retrieved objects implies
failure
Mainly designed for structured databases
Information retrieval
Is concerned with retrieving information about a subject or topic than
retrieving data which satisfies a given query
semantics is frequently loose: the retrieved objects might be inaccurate
small errors are tolerated
17
Info Retrieval vs. Data Retrieval
Example of data retrieval system is a relational database
Data Retrieval Info Retrieval
Data organization Structured Unstructured
Fields Clear Semantics No fields (other
(ID, Name, age,) than text and images etc)
Query Language Artificial (defined, Free text (“natural
SQL) language”), Boolean
Matching Exact (results are Partial match, best match
always “correct”)
Query Complete Incomplete
specification
Items wanted Matching Relevant
Accuracy 100% < 50%
Error response Sensitive Insensitive 18
Why is IR so hard?
Traditionnel Information retrieval (IR) System attempt to find
relevant documents to respond to a user’s request.
Information retrieval problem: locating relevant
documents based on user input, such as keywords or example
documents
The real problem boils down to matching the language of
the query to the language of the document.
Simply matching on words is a very brittle (no elasticity)
approach. One word can have different semantic
meanings. Consider: Take
“take a place at the table”
“take a picture”
19
Basic Concepts in Information Retrieval:
(i) User Task and (ii) Logical View of documents
The User Task:
two user task – retrieval and browsing
Retrieval
DB
Browsing
USER
20
The User Task: Retrieval
It is the process of retrieving information whereby the
main objective is clearly defined from the onset of
searching process.
The user of a retrieval system has to translate his
information need into a query in the language provided by
the system.
In this context (i.e. by specifying a set of words), the user
searches for useful information executing a retrieval task
English Language Statement :
I want a book by J. K Rowling titled The Chamber of
Secrets
21
Browsing
It is the process of retrieving information, whereby the
main objective is not clearly defined from the beginning and
whose purpose might change during the interaction with the
system.
E.g. User might search for documents about ‘car racing’ .
Meanwhile he might find interesting documents about ‘car
manufacturers’. While reading about car manufacturers in
Addis, he might turn his attention to a document providing
‘direction to Addis’, and from this to documents which
cover ‘Tourism in Ethiopia’.
In this context, user is said to be browsing in the collection
and not searching, since a user may has an interest glancing
around 22
Logical View of Documents
Documents in a collection are frequently represented by a set
of index terms or keywords
Such keywords are mostly extracted directly from the text of the
document
These representative keywords provide a logical view of the
document
Docs Tokenization stop words stemming Indexing
Expensive
24
Structure of an IR System
An Information Retrieval System serves as a bridge between the
world of authors and the world of readers/users,
User Documents
Black box
Hits
26
Inside The IR Black Box
Query Documents
Representation Representation
Function Function
Comparison
Index
Function
Hits
27
The Central Problem in IR
Information Seeker Authors
Concepts Concepts
Languages models
29
The User in the Loop
Relevance Feedback [chapter six]
How do humans (and machines) modify queries based on
retrieved results?
User Interaction
Information retrieval meets computer-human interaction
manner?
What tools can systems provide to aid the user in
information seeking?
30
Structure of an IR System
To be effective in its attempt to satisfy information need of
users, the IR system must ‘interpret’ the contents of
documents in a collection and rank them according to their
degree of relevance to the user query.
Thus the notion of relevance is at the center of IR
The primary goal of an IR system is to retrieve all the
documents which are relevant to a user query while
retrieving as few non-relevant documents as possible
31
Structure of an IR System
Typical IR Task
Document
Given: corpus
A corpus of textual
natural-language
documents. Query IR
String System
A user query in the
form of a textual 1. Doc1
string. 2. Doc2
Ranked 3. Doc3
Find: Documents .
.
A ranked set of
documents that are
relevant to the query. 32
Web Search System
Web Spider Document
corpus
Query IR
String
System
1. Page1
2. Page2
3. Page3
Ranked
. Documents
.
33
Web Search Engines
Search Engine refers to a huge database of internet resources
such as web pages, programs, images etc.
Search Engine Components are:
1. Web Crawler
It is also known as spider or bots. [to gather
information.]
2. Database
consists of huge web resources.
3. Search Interfaces
component is an interface between user and the
database. It helps the user to search through the database.
34
Web Search Engines…
A Web Crawler is a software for downloading pages from the
Web.
It is also known as spider or bots.
Fetched documents
Crawling the entire Web would take a very large amount of time
Search engines typically cover only a part of the Web, not all of it
later
Indexing process also runs on multiple machines
Creates a new copy of index instead of modifying old index
Storage
Organization
Access
39
The Retrieval Process
It is necessary to define the text database before any of
the retrieval processes are initiated
This is usually done by the manager of the database and
includes specifying the following
The documents to be used
41
The Retrieval Process …
The user first specifies a user need which is then parsed
and transformed by the same text operation applied to the
text
Next the query operations is applied before the actual query,
which provides a system representation for the user need, is
generated
The query is then processed to obtain the retrieved
documents
Before the retrieved documents are sent to the user, the
Text Operations
logical view Logical view
Searching Index
46
Subsystems of an IR system
The two subsystems of an IR system:
Searching: is an online process of finding relevant documents
in the index list as per users query
Indexing: is an offline process of organizing documents using
keywords extracted from the collection
Indexing and searching: are unavoidably connected
you cannot search what was not first indexed
text document
Tokenize
IDs
tokens
Stop list
non-stoplist Stemming & Normalize
tokens
stemmed Term weighting
terms
terms with
weights Index
48
Searching Subsystem
query parse query
query tokens
ranked non-stoplist
document Stop list
tokens
set
ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index
49
Check your knowledge
1. What are the two user task in IR ?
2. What are the two sub-system in IR describe them
3. Define the logical view of the document the steps in the
logical view of the documents
4. Define the four term that define an information retrieval?
5. Define the steps in the overview of retrieval process
6. What is the IR black box?
7. What is the goal of IR
8. Why IR is so hard?
50