Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 29

Chapter 1

Introduction to Information Retrieval System



What is the IR?

What is the IR problem?

How to organize an IR system? (Or the main
processes in IR)

Indexing

Retrieval

System evaluation
What is Information? The Nature,
Growth and Characteristics of
Information

There is no “correct” definition

Can involve philosophy, psychology, signal processing,
physics

Cookie Monster’s definition:
 “news or facts about something”

Oxford English Dictionary
 information: informing, telling; thing told, knowledge, items of
knowledge, news
 knowledge: knowing familiarity gained by experience;
person’s range of information; a theoretical or practical
understanding of; the sum of what is known
Assignment 1

What is information, according to your
background or area of expertise
Types of Information

Differentiation by form.

Differentiation by content.

Differentiation by quality.

Differentiation by associated
information.
Information Properties

Information can be communicated electronically
 Broadcasting
 Networking

Information can be easily duplicated and shared
 Problems of Ownership
 Problems of Control
Information Hierarchy

Wisdom

knowledge

Information

Data
Information Hierarchy

Data
 The raw material of information

Information
 Data organized and presented by someone

Knowledge
 Information read, heard or seen and understood

Wisdom
 Distilled and integrated knowledge and
understanding
What kinds of information are there?

Text
 books, periodicals, WWW, memos, ads
 published/refereed

Film

Photos, other Images

Broadcast TV, Radio

Telephone Conversations

Databases
What is Information Retrieval?


Finding relevant information in large collections of data


In such a collection you may want to find:

`Give me information on the history of the Kennedys'


An article about the Kennedys (text retrieval)

`What does a brain tumor look like on a CT-scan'


A picture of a brain tumor (image retrieval)

`It goes like this: hmm hmm hahmmm . . . '


A certain song (music retrieval)
What to Retrieve?


“Retrieve that amount of knowledge which a
user needs in a specific situation for solving
his/her current problem” (Kuhlen 1991)
Motivation

IR: Deals with representation, storage,
organization of, and access to information
items

Focus is on the user information need

Unfortunately User information need is not a
simple problem:
 Find all docs containing information on college tennis
teams which: (1) are maintained by a USA university
and (2) participate in the NCAA tournament.
Motivation

This Information need can’t be used directly to
request information using the Web search
engines.

User must translate information need into a
query

The translation yields a set of keywords (index
terms), which summarizes the user information
needs.
The Goal of IR

Goal = find documents relevant to an information need from
a large document set
Info.
need

Query
IR
Retrieval system
Document Answer list
collection
The Goal of IR

In fact the primary goal of an IR system is
to retrieve all the documents which are
relevant to a user query while retrieving as
few non-relevant documents as possible.
Main problems in IR

Document and query indexing
 How to best represent their contents?

Query evaluation (or retrieval process)
 To what extent does a document correspond to a
query?

System evaluation
 How good is a system?
 Are the retrieved documents relevant? (precision)
 Are all the relevant documents retrieved? (recall)
Information Retrieval versus Data Retrieval

Data retrieval
 which docs contain a set of keywords?
 Well defined structure (semantics)

Information retrieval
 information about a subject or topic
 semantics is frequently loose
 small errors are tolerated

IR system:
 interpret contents of information items
 generate a ranking which reflects relevance
 concepts of relevance is most important
Data vs. Information Retrieval
Data Retrieval Information Retrieval

Precise description  Vague information need

Well-structured data Natural Language, images,

Unstructured ...

Precise results

Semantic interpretation

Yes-or-no results

Approximate results

Relevance ranking

SQL defined

Keyword & features
Basic Concepts

The User Task
Retrieval

Database
Browsing

 Retrieval
 information or data (both cases we say that
the user searches for useful information or
data excuting a retrieval task.
 Purposeful
Basic Concepts
 Browsing

whose objectives are not clearly defined in the


beginning and whose purpose might change during
the interaction with the system.
 glancing around
 F1; cars, Le Mans, France, tourism
Basic Concepts

User task (search)
 Can formulate what they need: Retrieval (classical)
 Can’t (or does not know): Browsing (new to IR)
 Still not very well integrated


Logical view of docs
 ... Added linguistic info... not clear if helps Slow, good
 Full text
 Text operations: reduce complexity to index terms
 Keywords, stopwords

 Stemming, noun groups (linguistic processing needed)

 Categories
Fast, bad
Basic Concepts


Logical view of the documents: represented the
documents with a set of index terms or keywords.

Full text logical view: represented the documents with a
full set of words (modern computer).

 Text operations: reduce complexity to index terms


 Keywords, stop words
 Stemming, noun groups (linguistic processing needed)
Basic Concepts

Accents Noun Manual


Docs spacing stopwords groups stemming indexing

structure

structure Full text Index terms


Retrieval process

Database
 The documents to be used
 The operations to be performed on the text (logical View of the
text)

Index (e.g., inverted file)

User query
 Query operations (users are not good at this!)

Retrieved docs
 Ranked by relevance

Feedback cycle
Information Access Methods


Ad-hoc retrieval
 One time queries (e.g. Web search)

Filtering/Routing
 Constant search profile (e.g. Spam filtering)
The Information Retrieval Cycle
Source
Selection Resource

Query
Formulation Query

Search Ranked List

Selection Documents
query reformulation,
vocabulary learning,
relevance feedback
Examination Documents

source reselection
Delivery
Supporting the Search Process
Source
Selection Resource

Query
Formulation Query

Search Ranked List

Indexing Index Selection Documents

Examination Documents
Acquisition Collection

Delivery
History of IR


1950: Calvin N. Moors coins the term `Information Retrieval'

1959: Luhn describes statistical retrieval

1960: Maron and Kuhns dene a probabilistic model of IR

1966: Craneld project denes evaluation measures

1968: Gerard Salton's rst book about the SMART retrieval

system

1972: Lockheed introduces DIALOG as commercial online
service

Late 1980's: First PC systems incorporate retrieval
History of IR

Early 1990's: Cheap disks lead to the information storage

revolution

1992: Westlaw is the first large-scale information service
using

probabilistic retrieval

Mid 1990's: Multi-media databases

1994: The internet and web explosion

1995: IR techniques are incorporated in all kinds of
information

management applications

You might also like