Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 14

Chapter One

An Introduction to Information
Retrieval
Goal
• Information retrieval (IR) is a research
field that targets at effectively and effi-
ciently searching information in text and
multimedia documents.

2
Introduction
• Information Retrieval (IR): representation, stor-
age, organization of, and access to information
items
– Focus is on the user information need

3
• Query: a set of keywords which summa-
rizes the description of the user informa-
tion need
• The key goal of an IR system is to retrieve
information which might be useful or rel-
evant to the user
– The emphasis is on the retrieval of informa-
tion, not data

4
Information vs. Data Retrieval
• Data retrieval: which documents contain the
keywords in the user query?
– Clearly defined conditions
• such as regular expression or relation algebra expression
– A single erroneous object means total failure
– Well-defined structure and semantics
• Information retrieval: information about a sub-
ject or topic
– Not well-structured or semantically ambiguous
– Small errors are allowed

5
• IR system must interpret the contents of
the information items and rank them ac-
cording to a degree of relevance to the
user query
– Extracting syntactic and semantic information
from the document text
– Using this information to match the user in-
formation need

6
Motivation
• IR at the center of the stage
– IR in the last 20 years
• classification and categorization
• systems and languages
• user interfaces and visualization
– Still, area was seen as of narrow interest
– Advent of the Web changed this perception
• universal repository of knowledge
• free (low cost) universal access
• no central editorial board
• many problems though: IR seen as key to finding the solu-
tions!

7
Basic Concepts
• User task
– Retrieval vs. browsing
– Pull vs. push
• Logical view of documents
– Index terms or keywords
– Full text
• Text operations: elimination of stopwords, use of
stemming, identification of noun groups
(more on this later)

8
• Logical view of the documents
Accents, Noun Manual
Docs spacing stopwords groups stemming indexing

structure

structure Full text Index terms

• Document representation viewed as a contin-


uum: logical view of docs might shift

9
Past, Present, and Future
• Early developments
– Table of contents
– Index: a collection of selected words or con-
cepts
– Categorization hierarchies
– Computer-centered vs. human-centered views

10
IR in the Library
• 1st generation
– Searching author name and title
• 2nd generation
– By subject headings
– By keywords
• 3rd generation
– Improved graphical interfaces
– Electronic forms
– Hypertext features
– Open system architectures

11
The Web and Digital Libraries
• The Web as a highly interactive medium
– Low cost
– Greater access
– Publishing freedom
• Despite high interactivity, it’s still difficult to re-
trieve information relevant to their information
needs
– Retrieval of higher quality
– Quick response
– Better understanding of user behavior
12
Practical Issues
• Security
• Privacy
• Copyright
• Patent rights
• Others: scanning, optical character recog-
nition (OCR), cross-language retrieval, …

13
The Retrieval Process
Text
User
Interface
user need 4, 10 Text

Text Operations
6, 7
logical view logical view
Query DB Manager
Indexing
user feedback Operations Module
5 8
query inverted file

Searching Index
8
retrieved docs
Text
Database
Ranking
ranked docs
2
14

You might also like