Information Storage and Retrieval: Chapter One - Introduction

Information storage and
Retrieval
Chapter one – Introduction

By Abdo A.
1
Outline
 Define Information Retrieval Systems
 Examples of IR systems
 General Goal of Information Retrieval
 Info Retrieval vs. Data Retrieval
 Why is IR so hard?
 Logical View of Documents
 Structure of an IR System (IR black box)

2
Statistical point of view
 According to the www.factshunt.com (2013) there are,
 (1) 759 Million - Total number of websites on the Web ,
 (2) 510 Million - Total number of Live websites (active),
 (3) 103 Million - Websites added during the year i.e 2013,
 (4) 14.3 Trillion - WebPages, live on the Internet,
 (5) 48 Billion - WebPages indexed by Google.Inc,
 (6) 14 Billion - WebPages indexed by Microsoft's Bing,
 (7) CUIL.com indexed more than 120 Billion web pages,
 (8) Major search engines indexed at least tens of billions of
web pages.
3
Information Retrieval Systems
 Information
 What is “information”?
 Retrieval
 What do we mean by “retrieval”?
 What are different types information needs?
 Systems
 How do computer systems fit into the human
information seeking process?
4
What is Information?
 What do you think?
 There is no “correct” definition
 Cookie Monster’s definition:
 “news or facts about something”
 Oxford English Dictionary

 information: informing, telling; thing told,
knowledge, items of knowledge, news

 Random House Dictionary
 information: knowledge communicated or received
concerning a particular fact or circumstance; news

5
Intuitive Notions
 Information must
 Be something, although the exact nature (substance,
energy, or abstract concept) is not clear;

 Be “new”: repetition of previously received
messages is not informative

 Be “true”: false or counterfactual information is
“mis-information”
 Be “about” something
Robert M. Losee. (1997) A Discipline Independent Definition of Information. Journal

of the American Society for Information Science, 48(3), 254-269.
6
Information Hierarchy
More refined and abstract
Wisdom
Knowledge
Information
Data
7
Information Hierarchy
 Data
 The raw material of information
 Information
 Data organized and presented in a particular manner
 Knowledge
 “Justified true belief”
 Information that can be acted upon
 Wisdom
 Distilled and integrated knowledge
 Demonstrative of high-level “understanding”
8
“Retrieval?”
 “Fetch something” that’s been stored
 Recover a stored state of knowledge
 Search through stored messages to find some
messages relevant to the task at hand
9
What types of information?
 Text (Documents )
 XML and structured documents
 Images
 Audio
 Video
 Source code
 Applications/Web services
10
Information Retrieval Systems?
 Document (Web page) retrieval in response to a query
 Quite effective (at some things)
 Commercially successful (some of them)
 But what goes on behind the scenes?
 How do they work?
 What happens beyond the Web?
11
Examples of IR systems
 Conventional (library catalog):
 Search by keyword, title, author, etc.
 Text-based (Lexis-Nexis, Google, FAST):

 Search by keywords.
 Multimedia (IBMs QBIC, WebSeek, SaFe):

 Search by visual appearance (shapes, colors,… ).
 Question answering systems (AskJeeves, Answerbus):

 Search in (restricted) natural language
 Other:
 Cross language information retrieval,
 Music retrieval 12
Information Retrieval
 Information retrieval (IR) is the process of finding
material (usually documents) of an unstructured nature
(usually text) that satisfies an information need from
within large collections (usually stored on computers).
 Information is organized into (a large number of) documents
 Large collections of documents from various sources:
news articles, research papers, books, digital
libraries, Web pages, etc.
 Example: Web Search Engines like Google claim to
index Trillions of pages
13
motivation
 Nowadays the businesses, organizations,
and individuals all around the world
actively use websites on a daily basis.
 Webpages
 Website
 Search engine
 About 1 billion of websites
 More than 85% are not active (e.g. have a
similar function)
 About 10%-15% only active
 Google indexed trillion of webpages
 Google is currently the most visited
website 14
General Goal of Information Retrieval
 To help users find useful information based on their
information needs (with a minimum effort) despite:
 Increasing complexity of Information
 Changing needs of user
 Provide immediate random access to the document collection.
 Retrieval systems, such as Google, Yahoo, are developed with

this aim.
15
16
Info Retrieval vs. Data Retrieval
Emphasis of IR is on the retrieval of information, rather than
on the retrieval of data
 Data retrieval
 Consists mainly of determining which documents contain a set of
keywords in the user query
 Aims at retrieving all objects that satisfy well defined semantics
 a single erroneous object among a thousand retrieved objects implies
failure
 Mainly designed for structured databases
 Information retrieval
 Is concerned with retrieving information about a subject or topic than
retrieving data which satisfies a given query
 semantics is frequently loose: the retrieved objects might be inaccurate
 small errors are tolerated
17
Info Retrieval vs. Data Retrieval
 Example of data retrieval system is a relational database
Data Retrieval Info Retrieval
Data organization Structured Unstructured
Fields Clear Semantics No fields (other
(ID, Name, age,) than text and images etc)
Query Language Artificial (defined, Free text (“natural
SQL) language”), Boolean
Matching Exact (results are Partial match, best match
always “correct”)
Query Complete Incomplete
specification
Items wanted Matching Relevant
Accuracy 100% < 50%
Error response Sensitive Insensitive 18
Why is IR so hard?
 Traditionnel Information retrieval (IR) System attempt to find
relevant documents to respond to a user’s request.
 Information retrieval problem: locating relevant
documents based on user input, such as keywords or example
documents
 The real problem boils down to matching the language of
the query to the language of the document.
 Simply matching on words is a very brittle (no elasticity)
approach. One word can have different semantic
meanings. Consider: Take
 “take a place at the table”
 “take money to the bank”
 “take a picture”
19
Basic Concepts in Information Retrieval:
(i) User Task and (ii) Logical View of documents
The User Task:
two user task – retrieval and browsing
Retrieval
DB
Browsing
USER
20
The User Task: Retrieval
 It is the process of retrieving information whereby the
main objective is clearly defined from the onset of
searching process.
 The user of a retrieval system has to translate his
information need into a query in the language provided by
the system.
 In this context (i.e. by specifying a set of words), the user
searches for useful information executing a retrieval task
 English Language Statement :
 I want a book by J. K Rowling titled The Chamber of
Secrets
21
Browsing
 It is the process of retrieving information, whereby the
main objective is not clearly defined from the beginning and
whose purpose might change during the interaction with the
system.
 E.g. User might search for documents about ‘car racing’ .
Meanwhile he might find interesting documents about ‘car
manufacturers’. While reading about car manufacturers in
Addis, he might turn his attention to a document providing
‘direction to Addis’, and from this to documents which
cover ‘Tourism in Ethiopia’.
 In this context, user is said to be browsing in the collection
and not searching, since a user may has an interest glancing
around 22
Logical View of Documents
 Documents in a collection are frequently represented by a set
of index terms or keywords
 Such keywords are mostly extracted directly from the text of the
document
 These representative keywords provide a logical view of the
document
Docs Tokenization stop words stemming Indexing
Full Index terms

text
 Document representation viewed as a continuum, in which

logical view of documents might shift from full text to index
terms 23
Logical view of documents
 If full text :
 Each word in the text is a keyword
 Most complex form
 Expensive
 If full text is too large, the set of representative keywords
can be reduced through transformation process called

text operation
 It reduce the complexity of the document representation and
allow moving the logical view from that of a full text to a set
of index terms
24
Structure of an IR System
 An Information Retrieval System serves as a bridge between the
world of authors and the world of readers/users,
 That is, writers present a set of ideas in a document using a set

of concepts. Then Users seek the IR system for relevant
documents that satisfy their information need.
User Documents
Black box
The black box is the information retrieval system.

25
The IR Black Box
Query Documents
Hits
26
Inside The IR Black Box
Query Documents
Representation Representation
Function Function
Query Representation Document Representation
Comparison
Index
Function
Hits
27
The Central Problem in IR
Information Seeker Authors
Concepts Concepts
Query Terms Document Terms
Do these represent the same concepts?

28
Building the IR Black Box
 Different models of information retrieval [chapter four]
 Boolean model
 Vector space model
 Languages models
 Representing the meaning of documents [chapter two]

 How do we capture the meaning of documents?
 Is meaning just the sum of all terms?
 Indexing [chapter three]

 How do we actually store all those words?
 How do we access indexed terms quickly?
29
The User in the Loop
 Relevance Feedback [chapter six]
 How do humans (and machines) modify queries based on
retrieved results?
 User Interaction
 Information retrieval meets computer-human interaction
 How do we present search results to users in an effective
manner?
 What tools can systems provide to aid the user in
information seeking?
30
 To be effective in its attempt to satisfy information need of
users, the IR system must ‘interpret’ the contents of
documents in a collection and rank them according to their
degree of relevance to the user query.
 Thus the notion of relevance is at the center of IR
 The primary goal of an IR system is to retrieve all the
documents which are relevant to a user query while
retrieving as few non-relevant documents as possible
31
Typical IR Task
Document
 Given: corpus
 A corpus of textual
natural-language
documents. Query IR
String System
 A user query in the
form of a textual 1. Doc1
string. 2. Doc2
Ranked 3. Doc3
 Find: Documents .
.
 A ranked set of
documents that are
relevant to the query. 32
Web Search System
Web Spider Document
corpus
Query IR
String
System
1. Page1
2. Page2
3. Page3
Ranked
. Documents
.
33
Web Search Engines
 Search Engine refers to a huge database of internet resources
such as web pages, programs, images etc.
 Search Engine Components are:
1. Web Crawler
 It is also known as spider or bots. [to gather
information.]
2. Database
 consists of huge web resources.
3. Search Interfaces
 component is an interface between user and the
database. It helps the user to search through the database.
34
Web Search Engines…
 A Web Crawler is a software for downloading pages from the
Web.
 It is also known as spider or bots.
 Recursively follow hyperlinks present in known documents, to

find other documents
 Starting from a seed set of documents
 Fetched documents
 Handed over to an indexing system
 Can be discarded after indexing, or store as a cached copy
 Crawling the entire Web would take a very large amount of time
 Search engines typically cover only a part of the Web, not all of it
 Take months to perform a single crawl 35

Web Search Engines …
 Crawling is done by multiple processes on multiple machines,
running in parallel
 Set of links to be crawled stored in a database
 New links found in crawled pages added to this set, to be crawled
later
 Indexing process also runs on multiple machines
 Creates a new copy of index instead of modifying old index
 Old index is used to answer queries
 After a crawl is “completed” new index becomes “old” index
 Multiple machines used to answer queries

 Indices may be kept in memory
 Queries may be routed to different machines for load balancing

36
Web Search Engines …
 List the Applications of a Web Crawler?
• create an index covering broad topics (general Web search )
• create an index covering specific topics (vertical Web search )
• archive content (Web archival )
• analyze Web sites for extracting aggregate statistics (Web characterization)
• keep copies or replicate Web sites (Web mirroring )
• Web site analysis
 What is the Main problem of focused crawling?
To predict the relevance of a page before downloading the page
 What are the basic rules for Web crawler operation are?
 A Web crawler must identify itself as such, and must not pretend to be
a regular Web user.

 It obey the robots exclusion protocol (robots.txt)
 It must keep a low bandwidth usage in a given Web site. 37

What is Information Retrieval ?
 A good formal definition of information retrieval is given in
Baeze-Yates & Riberio-Neto (1990)
“Information retrieval deals with representation, storage,
organization of, and access to information items. The organization
and access of information items should provide the user with easy
access to the information in which he is interested”
 The definition incorporates all important features of a good
information retrieval system
 Representation
 Storage
 Organization
 Access
 The focus is mainly on the user information need

38
Overview of the Retrieval process
39
The Retrieval Process
 It is necessary to define the text database before any of
the retrieval processes are initiated
 This is usually done by the manager of the database and
includes specifying the following
 The documents to be used
 The operations to be performed on the text
 The text model to be used (the text structure and what
elements can be retrieved)

 The text operations transform the original documents and
the information needs and generate a logical view of them
40
Retrieval Process ….
 Once the logical view of the documents is defined, the
database module builds an index of the text
 An index is a critical data structure
 It allows fast searching over large volumes of data
 Different index structures might be used , but the most

popular one is the inverted file
 Given that the document database is indexed, the
retrieval process can be initiated
41
The Retrieval Process …
 The user first specifies a user need which is then parsed
and transformed by the same text operation applied to the
text
 Next the query operations is applied before the actual query,
which provides a system representation for the user need, is
generated
 The query is then processed to obtain the retrieved
documents
 Before the retrieved documents are sent to the user, the
retrieved documents are ranked according to the

likelihood of relevance
42
The Retrieval Process …
 The user then examines the set of ranked documents in
the search for useful information. Two choices for the
user:
 (i) reformulate query, run on entire collection or (ii)
reformulate query, run on result set
 At this point, s/he might pinpoint a subset of the
documents seen as definitely of interest and initiate a user
feedback cycle
 In such a cycle, the system uses the documents selected by
the user to change the query formulation.
 Hopefully, this modified query is a better representation of
the real user need
43
Detail view of the Retrieval Process
User Text
Interface
User Text
need
Text Operations
logical view Logical view
Query Language & DB manager

User Indexing
Operations Module
feedback
Query Inverted file
Searching Index
Retrieved docs Text

Database
Ranking
Ranked docs
44
Issues that arise in IR
 Text representation [chapter two]
 what makes a “good” representation?
 how is a representation generated from text?
 what are retrievable objects and how are they organized?
 information needs representation [chapter seven]

 what is an appropriate query language?
 how can interactive query formulation and refinement be

supported?
 Comparing representations (to identify relevant documents)
 What weighting scheme and similarity measure to be used?
 what is a “good” model of retrieval?
 Evaluating effectiveness of retrieval [chapter five]

 what are good metrics/measurements?
Focus in IR System Design
Our focus during IR system design is:
 In improving performance effectiveness of the system
 Effectiveness of the system is measured in terms of

precision, recall, …
 Stemming, stop words, weighting schemes, matching
algorithms
 In improving performance efficiency
 The concern here is storage space usage, access time,
searching time, data transfer time …
 Concern regarding space – time tradeoffs !!
 Use Compression techniques, data/file structures, etc.
46
Subsystems of an IR system
 The two subsystems of an IR system:
 Searching: is an online process of finding relevant documents
in the index list as per users query
 Indexing: is an offline process of organizing documents using
keywords extracted from the collection
 Indexing and searching: are unavoidably connected
 you cannot search what was not first indexed
 indexing of documents or objects is done in order to be

searchable
 to index one needs an indexing language
 there are many indexing languages

 even taking every word in a document is an indexing
language
47
Indexing Subsystem
documents
Documents Assign document identifier
text document
Tokenize
IDs
tokens
Stop list
non-stoplist Stemming & Normalize
tokens
stemmed Term weighting
terms
terms with
weights Index
48
Searching Subsystem
query parse query
query tokens
ranked non-stoplist
document Stop list
tokens
set
ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index
49
Check your knowledge
1. What are the two user task in IR ?
2. What are the two sub-system in IR describe them
3. Define the logical view of the document the steps in the
logical view of the documents
4. Define the four term that define an information retrieval?
5. Define the steps in the overview of retrieval process
6. What is the IR black box?
7. What is the goal of IR
8. Why IR is so hard?
50

Information Storage and Retrieval: Chapter One - Introduction

Uploaded by

Copyright:

Available Formats

You might also like

Information Storage and Retrieval: Chapter One - Introduction

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Storage and Retrieval: Chapter One - Introduction

Uploaded by

Copyright:

Available Formats

Information storage and

Chapter one – Introduction

 Define Information Retrieval Systems

 General Goal of Information Retrieval

 Info Retrieval vs. Data Retrieval

 Logical View of Documents

 Structure of an IR System (IR black box)

 What are different types information needs?

information seeking process?

 Oxford English Dictionary

knowledge, items of knowledge, news

concerning a particular fact or circumstance; news

energy, or abstract concept) is not clear;

messages is not informative

Robert M. Losee. (1997) A Discipline Independent Definition of Information. Journal

More refined and abstract

 Information that can be acted upon

 Demonstrative of high-level “understanding”

 Commercially successful (some of them)

 But what goes on behind the scenes?

 How do they work?

 What happens beyond the Web?

 Text-based (Lexis-Nexis, Google, FAST):

 Multimedia (IBMs QBIC, WebSeek, SaFe):

 Question answering systems (AskJeeves, Answerbus):

 Changing needs of user

 Provide immediate random access to the document collection.

 Retrieval systems, such as Google, Yahoo, are developed with

 “take money to the bank”

Full Index terms

 Document representation viewed as a continuum, in which

 Most complex form

 If full text is too large, the set of representative keywords

can be reduced through transformation process called

 That is, writers present a set of ideas in a document using a set

The black box is the information retrieval system.

Query Representation Document Representation

Query Terms Document Terms

Do these represent the same concepts?

 Vector space model

 Representing the meaning of documents [chapter two]

 Is meaning just the sum of all terms?

 Indexing [chapter three]

 How do we access indexed terms quickly?

 How do we present search results to users in an effective

 Recursively follow hyperlinks present in known documents, to

 Handed over to an indexing system

 Can be discarded after indexing, or store as a cached copy

 Take months to perform a single crawl 35

 New links found in crawled pages added to this set, to be crawled

 Old index is used to answer queries

 After a crawl is “completed” new index becomes “old” index

 Multiple machines used to answer queries

 Queries may be routed to different machines for load balancing

a regular Web user.

 It must keep a low bandwidth usage in a given Web site. 37

 The focus is mainly on the user information need