Chapter 3,4, 5 and 6

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 145

Information Retrieval

Chapter 3:
Indexing Structures
Indexing structure
• What is index structure? a data structure that speed the
search operation on a database.

• Querying using an index is faster than searching every row in


a data base
What is Indexing?
•Some definitions

–Is the art of organizing information

–Is an association of descriptors (keywords, concepts) to

documents in view of future retrieval

–Is a process of constructing document surrogates by assigning

identifiers to text items

–Is the process of storing data in a particular way in order to locate

and retrieve the data


Indexing
Indexing
File structures

 A fundamental decision in the design of IR systems is which


type of file structure to use for the underlying document
database

• The file structures used in IR systems are: inverted files,


signature files, PAT trees (trie), flat files, and graphs

• Focused on Inverted file and tries.


The concept of the inverted file type of index is as follows

• Assume a set of documents. Each document is assigned a list


of keywords or attributes, with optional relevance weights
associated with each keyword (attribute).

An inverted file is then the sorted list (or index) of keywords
(attributes), with each keyword having links to the documents
containing that keyword
Inverted Index
“Inverted index is generated to index words in
files.”

Term Position of Doc or Sentence

file 47
generate 19
in 44
index 10, 32
invert 1
is 16
to 29
word 38 9
Index and inverted index
index Inverted index

In your cell phone your list of What allows you to manually enter a phone
number and when you hit ‘dial’ you see the
contacts and which phone numbers
person’s name, rather than the number, because
(cell, home,work) are associated with your phone has taken the phone number and
those contacts found the contact associated with it

DNS, lookup(takes the host name reverse lookup(which takes an IP address and
returns the host name)
and returns an IP address)
storing a mapping from content, such as
mapping a document from words or numbers to its location in a db file
documents to content a file or method of file organization in which
labels indicating the locations of all documents
of a given type are placed in a single record
 
Cont……..

• Inverted files are created as records are added to a database;


they’re important for the process by which computers search
for the terms entered in a query.

• When a query is placed to an electronic database, what the


computer searches is actually the inverted file, not the
records themselves.
Cont….
• The inverted file extracts all the words from each
field for each record entered into the database, and
sorts them into alphabetical order.
Cont……
• Inverted file is a word oriented indexing mechanism based
on sorted list of keywords, with each keyword having links to
the documents containing it
• Data to be held in the inverted file includes list of index terms
and for each term:
Fji, number of occurrences of term tj in document di
Nj, number of documents containing tj
Mi, maximum frequency of any term in di

N, total number of documents in the collection


Inverted file contains:
1. The vocabulary(Lists of terms)
2. The occurrence(location and frequency of terms in document
collection)

Vocabulary is the set of all distinct words(index terms in the text


collection)
Construction of Inverted file

• An inverted index consists of two files: vocabulary and


posting files

A vocabulary file(Word List)

Posting file(inverted list)


A vocabulary file(Word List)

• Stores all of the distinct terms (keywords) that appear in any


of the documents in lexicographical order and for each word a
pointer to posting file The act of writing
dictionaries
• Records kept for each term j in the word list contains the
following:
Term j
Number of documents in which term j occurs

Total frequency of term j


Pointer(posting list of ) term j
Posting file(inverted list)

• For each distinct term in the vocabulary, stores a list of


pointers to the documents that contain that term

• Each element in the inverted list is called posting, i.e. the


occurrence of a term in a document
Indexer steps
 Step1: Token sequence arranging the tokens(words) of the
given documents
 Step2: sort by terms and give docID

  if there are the same term, sort it by docID


 Step3: Dictionary and postings(doc fre, postings)
Organization of Index File

Vocabulary
Postings
(word list) Documents
(inverted list)
Term No Tot Pointer
of freq To
Doc posting

Act 3 3 Inverted
Bus 3 4 lists

pen 1 1
total 2 3

19
Example:
• Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.

I did enact Julius


Doc 1 Caesar I was killed
i' the Capitol;
Brutus killed me.

So let it be with
Caesar. The noble
Doc 2
Brutus hath told you
Caesar was ambitious
20
Sorting the Vocabulary
• After all documents have been parsed the inverted file is sorted by terms
Term Doc #
I 1 Term Doc #
did 1 ambitious 2
enact 1 be 2
julius 1 brutus 1
caesar 1 brutus 2
I 1 capitol 1
was 1 caesar 1
killed 1 caesar 2
i' 1 caesar 2
the 1 did 1
capitol 1 enact 1
brutus 1 hath 1
killed 1 I 1
me 1 I 1
so 2 i' 1
let 2 it 2
it 2 julius 1
be 2 killed 1
with 2 killed 1
caesar 2
let 2
the 2
me 1
noble 2
noble 2
brutus 2
so 2
hath 2
the 1
told 2
the 2
you 2
told 2
caesar 2
you 2
was 2
was 1
ambitious 2
was 2
with 2

21
Remove duplicate terms & add frequency
Term Doc # Freq

•Multiple term
Term Doc # ambitious 2 1
ambitious 2
be 2 1
be 2
entries in a brutus 1
brutus
brutus
1
2
1
1
single
brutus 2
capitol 1 capitol 1 1

document are caesar


caesar
1
2
caesar
caesar
1
2
1
2
merged and caesar
did
2
1
did
enact
1
1
1
1
frequency enact
hath
1
1 hath 2 1
I 1 2
information I
I
1
1 i' 1 1

added i'
it
1
2
it
julius
2
1
1
1
julius 1
•Counting killed 1
killed
let
1
2
2
1
killed 1
number of let 2 me
noble
1
2
1
1
me 1
occurrence of noble 2 so 2 1

terms in the
so 2 the 1 1
the 1 the 2 1
collections
the 2
told 2 1
told 2
you 2 1
helps to you
was
2
1 was 1 1
was 2 1
compute TF was
with
2
2 with 2 1

22
Vocabulary and postings file
The file is commonly split into a Dictionary and a Postings file

Term Doc # Freq


ambitious 2 1 Doc # Freq
be 2 1 Term N docs Tot Freq 2 1
brutus 1 1 ambitious 1 1 2 1
brutus 2 1 be 1 1 1 1
capitol 1 1 brutus 2 2 2 1
caesar 1 1 capitol 1 1 1 1
caesar 2 2 caesar 2 3 1 1
did 1 1 did 1 1 2 2
enact 1 1 1 1
enact 1 1
hath 1 1 1 1
hath 2 1
I 1 2 2 1
I 1 2 i' 1 1 1 2
i' 1 1 it 1 1 1 1
it 2 1 julius 1 1 2 1
julius 1 1 killed 1 2 1 1
killed 1 2 let 1 1 1 2
let 2 1 me 1 1 2 1
me 1 1 noble 1 1 1 1
noble 2 1 so 1 1 2 1
so 2 1 the 2 2 2 1
told 1 1 1 1
the 1 1
you 1 1 2 1
the 2 1
was 2 2 2 1
told 2 1 with 1 1
you 2 1 2 1
was 1 1 1 1
2 1
was 2 1
2 1
with 2 1

Pointers 23
Tries
 Trie is the data structure very similar to Binary Tree.
 Trie data structure stores the data in particular
fashion, so that retrieval of data became much faster
and helps in performance.

 The name "TRIE" is coined from the word

retrieve.
Searching in a Trie

• Tries are useful for testing whether a given query string q is in


the list
–Starting with the first character of q we traverse the Trie
along the branch defined by the next character of q.

–If this branch does not exist in the Trie, then q can not be one
of the set of strings
Trie Example:
• Example 1: search for the string GOOD.
– we start at the root and we follow the G edge, followed by
the O edge, another O edge and finally the D edge.

• Example 2: search for the string BAD.


– we start from the root, follow the B edge and find out that
there is no A edge.
Tries construction
• A trie is a data structure that stores information about the
contents of each node in the path from the root to the
leaves.
Non compact tries
• A non compact trie is one in which every edge of the
underlying tree represents a symbol of the alphabet.
• Example: Assume that the symbols in our alphabet are the
CAPITAL letters A..Z with terminal symbol $. Construct the
trie for the following 5 strings: BIG, BIGGER, BILL, GOOD,
GOSH.
 In the figure the strings either starts with B or G.
Therefore the root of the trie is connected to 3 edges (B,
G, $)
 $ used to fulfill the 3rd trie
Trie
Example 2:
cat, can’t, hey, hello, dog
Non compact tries

Drawback:
• The above structure is rather
wasteful of memory
because each edge
represents a single symbol.
For huge texts this design is
an enormous waste of space.
Compact tries
• Compact trie trims(decreases) unary nodes which lead to
leaves.
.
{bear, bell, bid, bull, buy, sell, stock, stop}
Compressed
Why we use tries?
• To do a fast search in a large text. For example,
searching an item in a dictionary which contains
several gigabytes of text.
–E.g. the Oxford English dictionary.

• Support fast pattern matching.


–An example is an application where users type a
query and the system quickly come up with a list of
words starting with what the user typed in. 35
Exercise 1
Draw the inverted index that would be built for the following
document collection.

Doc 1 new home sales top forecasts


Doc 2 home sales rise in July
Doc 3 increase in home sales in July
Doc 4 July new home sales rise
Exercise 2
• Construct trie (non compact and compact) for the
words(pot$, potato$, tattoo$, temp$)
Signature files

• The signature file approach works as follows:


 The documents are stored sequentially in the "text file."
Their "signatures" (hash-coded bit patterns) are stored in the
"signature file.“
 When a query arrives, the signature file is scanned and many
non-qualifying documents are discarded.
 The rest are either checked (so that the "false drops" are
discarded) or they are returned to the user as they are.
Signature file

• A “signature” is created as an abstraction of a


document.
• All the signatures that represent the documents
in the collection are kept in a file called
“signature file”.
Cont…
signature files have been used in the following environments:
1. PC-based, medium size db

2. WORMs(write once read many)


3. Parallel machines
4. Distributed text db(all documents stored on several servers.
But the database may be managed together or independently )
Example: the web can be managed independently
End
Chapter 4

IR Models
Introduction of IR Models
At the end of this chapter every student must able to:

 Define what model is


 Describe why model is needed in information retrieval
 Differentiate different types of information retrieval models
 Boolean Model

Vector space model


 probabilistic model
 know how to calculate and find the similarity of some
documents to the given query
 Identify term frequency, document frequency, inverted
document frequency, term weight and similarity
measurements
What is model?
Model- is the ideal abstraction of something
which is working in the real world.
There are 2 good reasons for having models of IR
1. Models guide research and provide the means
of academic discussion
2. Models can serve as a blueprint to implement ac
actual retrieval system
IR Models
• In IR, mathematical models are used to understand and reason
about some behavior or phenomena in the real world

• A model of an information retrieval predicts and explains what


a user will find relevant given the user query
Retrieval model
• Thus, retrieval models are models that can describe
the computational processes (here, retrieval)
– e.g., how documents are ranked
– e.g., how similarities are measured
• Are models that can attempt to describe the
human process
– e.g., the information need, interaction
– Few do so meaningfully
• Are models that specify the details of
– Document representation
– Query representation
– Retrieval function (matching function)
– Ranking
Retrieval Models
• A number of IR models are proposed over the years to retrieve
information
• The following are the major models developed to retrieve
information
– Boolean model
• Exact match model
– Statistical models
• Vector space and probabilistic models are the major
statistical models
• Are “best match” models
– Linguistic and knowledge based models
• Are “best match models”
What is the difference b/n best match and exact match?
Types of models
• The three classic information retrieval models are:

– Boolean retrieval models

– Vector space models

– Probabilistic models
1. Boolean model

A document either matches a query, or does


not.
 The Boolean retrieval model is a model for information
retrieval in which we can pose(create) any query which
is in the form of a Boolean expression of terms, that is,
in which terms are combined with the operators
AND, OR, and NOT.
…..cont
 The first model of an information retrieval
 The most criticized model
 Developed by George Boole
• Boole defined 3 basic operators
AND
OR
NOT
Example
……cont

• Boolean relevance prediction ( R )


– A document is predicted as relevant to a query iff it satisfies the query
expression
– Each query term specifies a set of documents containing them
• AND (^) : the intersection of two sets
• OR (V) : the union of two sets
• NOT (¬) :set inverse, or really set difference
– A query, thus, searches a set of documents to determine their content
– The search engine retrieves those documents satisfying the logical
constraints of the query
….cont

• Most queries search for more than one term

– Information need: to find all documents containing “information”


and “retrieval”
Answer Only documents containing both “information” and “retrieval” satisfy
this query

– Information need: to find all documents containing “information”


or “retrieval”
Answer Will be satisfied by a document that contains either of the two words or
both.

– Information need: to find all documents containing “information”


or “retrieval”, but not both
Boolean model

Consider a set of five docs and assume that they contain the terms
shown in the table

Doc. Terms
D1 Algorithm, information, retrieval
D2 Retrieval, science
D3 Algorithm, information, science
D4 Pattern, retrieval, science
D5 Science, algorithm

Find documents retrieved by the following expressions in


a. Information AND retrieval
{d1,d3} ∩{d1,d2,d4}={d1}
b. Information OR retrieval
{d1,d3} U{d1,d2,d4}={d1, d2,d3,d4}
Advantages and disadvantages of Boolean
model
• Advantages of Boolean model
A very simple model based on sets(Easy for
expert)
It is computationally efficient
Expressiveness and clarity
Is still a dominant model with the commercial
database systems
Disadvantages of Boolean model
• Disadvantages
Needs to train users

Very limited to show user information need in detail

No weighting of index or query terms

Based on exact matchingthere may be relevant


document that is partially matched
Vector Space Model
Suggested by peter Luhn and Salton
A classic model of document retrieval based on
representing documents and queries as a vector
Partial matching is possible

Retrieval is based on the similarity between the query


vector and document vectors
The output documents are ranked according to this
similarity
….cont

• The similarity is based on the occurrences of the


keywords in the query and document

• The angle between query and document is measured


by using:

cosine similarity measurement since both


document and user’s query are represented as vectors
in VSM
…cont

• VSM assumes that if document vector V1 is closer to


the query than another document vector V2: then;
 The document represented by V1 is more
relevant than the one represented by V2
In VSM, to decide the similarity of the document to the
given query, term weighting(tf*idf) is used. To
calculate tf*idf, first we have to calculate the following
1.Term frequency(tf)
• Term frequency: is the number of times a given
term appears in that documents

Tf = number of frequency of a term


maximum frequency

(for a single document)


2. Inverse document frequency(IDF)

• IDF used to measure whether the terms are


common or rare across all documents

• Idf = log N/df where

N total number of documents


Df document frequency (number of documents
containing the given term)
When we change the log2 to log10

Then :
3. Term weighting (tf*idf)
Term weighting is the assignment of numerical values to
terms that represent their importance in a document in
order to improve retrieval effectiveness

• Term weighting = tf*idf


Tfi,j* log N/df
4. Document length
calculate the length of the document
Document length normalization adjusts the term frequency or the
relevance score in order to normalize the effect of document
length on the document ranking.

Document length= the square root summation of term weight


square for each document
5. similarity
• At the end we need to calculate similarity of the documents to
the query.

• The widely used measure of similarity in vector space model is


called the cosine similarity.

• The cosine similarity between two vectors d,j (the document


vector) and q (query vector) is given by:
…..cont
…cont
• The denominator of the above formula can be
replaced by the length of the document times
length of the query. This means
Examples

• Example1: If the following three documents are given with


one query, then, which document must be ranked first?

D1: New york times

D2: New York post

D3: Los Angeles times

Query: new new times


Solution
• Step1: calculate inversed document frequency (IDF) for each term.
Step2: calculate term frequency (tf)
Step 3: calculate term weight (tw): TW=tf*idf
Step 4: calculate document length or length of the
document and length of the query
Step 5: calculate the similarity of each document
to the query

Therefore; since the value of D1>D2>D3, the document must be


ranked as:
D1, D2, D3.
Example 2
 Which document must be ranked first for the
following?

Doc1: Breakthrough drug for schizophrenia


Doc2: New schizophrenia drug
Doc3:New approach for treatment of
schizophrenia
Doc4: New hopes for schizophrenia patients
Query: Treatment of schizophrenia patients
Example 3
From the following documents which one
must be ranked first?

D1:The health observances for march


D2:The health oriented calendar

D3: The awareness news for march awareness


Q: March health awareness
Advantages and Disadvantages of VSM
Latent semantic indexing
• Latent Semantic Indexing (LSI) is an extension of the
vector space retrieval method (Deerwester et al., 1990).
• LSI can retrieve relevant documents even when they do
not share any words with the query.

• if only a synonym of the keyword is present in a


document, the document will be still found relevant.
LSI/LSA
LSI

• Latent Semantic Indexing (LSI) [Deerwester et al] tries to


overcome the problems of lexical matching by using
statistically derived conceptual indices instead of individual
words for retrieval.

• Latent Semantic Indexing is a technique that projects queries and


documents into a space with “latent” semantic dimensions.
• In the latent semantic space, a query and a document can
have high cosine similarity even if they do not share any
terms
LSI

Classic IR might lead to poor retrieval due to:


• unrelated documents might be included in the answer set
• relevant documents that do not contain at least one index term
are not retrieved
• Reasoning: retrieval based on index terms is vague and noisy
• The user information need is more related to concepts and ideas
than to index terms
• A document that shares concepts with another document known
to be relevant might be of interest
Probabilistic model
• The probabilistic model captures the IR problem using a
probabilistic framework
• Given a user query, there is an ideal answer set for this query
• Given a description of this ideal answer set, we could retrieve
the relevant documents
• Querying is seen as a specification of the properties of this
ideal answer set
Cont…
• Given a user query q and a document dj, the probabilistic model
tries to estimate the probability that the user will find the
document dj interesting (i.e., relevant).

– Estimate the probability that a given doument is relevant to


the given query
Why probabilities in IR?

User Query
Understanding
Information Need Representation of user need is
uncertain
How to match?

Uncertain guess of
Document whether document has
Documents Representation
relevant content

In traditional IR systems, matching between each document and


query is attempted in a semantically imprecise space of index terms.
Probabilities provide a principled foundation for uncertain reasoning.
Can we use probabilities to quantify our uncertainties?
Probabilistic Model
Why introduce probabilities and probability theory in IR?
 As a process, retrieval is inherently uncertain
 Understanding of user’s information needs is uncertain
Are we sure the user mapped his need into a good query?
 Even if the query represents well the need, did we represent it well?
 Estimating document relevance for the query
 Uncertainty from selection of document representation
 Uncertainty from matching query and documents
Probability theory is a common framework for modeling
uncertainty
An IR system is uncertain primarily about
1. Understanding of the query

2. Whether a document satisfies the query


Probability theory
 Provides principled foundation for reasoning under uncertainty

 Probabilistic information retrieval models estimate how likely it is that a


document is relevant for a query
Probabilistic IR models
 Classic probabilistic models (BIM, BM11, BM25)

 Bayesian networks for text retrieval

Probabilistic IR models are among the oldest, but also among the best-
• performing and most widely used IR models
Probability ranking principle

• Ranked retrieval setup: given a collection of documents, the


user issues a query, and an ordered list of documents is
returned.

• Assume binary notion of relevance: Rd,q is a random


dichotomous (yes/no) variable, such that

– Rd,q = 1 if document d is relevant w.r.t query q

– Rd,q = 0 otherwise

• Probabilistic ranking orders documents decreasingly by their


estimated probability of relevance w.r.t. query: P(R = 1|d, q)
Bayesian based probabilistic model
• Let x be a document in the collection
• Let R represent relevance document with respect to given
query and let NR be non relevance
• Need to find p(R|x)- probability that a document x is relevant
P(R|x)=probability that a document ‘x’ is relevant
P(x|R)=probability that if a relevant document is retrieved, it is ‘x’
P(R)=probability of relevant document in the collection
P(x)=probability that ‘x’ is in the collection
Example 1
• Assume that the following is given for you
P(R)=0.6
P(x)=0.5
P(x|R)=0.7
Then what is P(R|x)?
• P(R|x)=p(R)*P(x|R) = 0.6*0.7 = 0.84
P(x) 0.5
Example 2
Assume that document ‘y’ is in the collection
Probability that if a non-relevant document retrieved, it is ‘y’ is
p(y/NR) =0.2

Probability of non-relevant documents in the collection is p (NR) =0.6


Probability of ‘y’ in the collection is p(y) =0.4
a) What is the probability that y is non relevant document?
b) Is the document is relevant or non relevant?
Solution

a/ P (NR/Y) =p(y/NR) p (NR)/p(y)


P (NR/Y)=0.2*0.6/0.4= 0.3

b/ P(R/Y) +p (NR/Y) =1
P(R/Y) = 1-p (NR/Y)
P(R/Y) = 1-0.3

P(R/Y) =0.7, hence the document is relevant


Binary independence model
• As the name implies, this model assumes that the index terms
exist independently in the documents and we can then assign
binary (0,1) values to these index terms.

• For a further illustration of this model, consider a document


Dk in a collection, is represented by a binary vector t = (t 1 ,t 2
,t 3 ,…,t u ) where u represents total number of terms in the
collection, t i =1 indicates the presence of the ith index term and
t i =0 indicates its absence.
Binary independence model
• A decision rule can be formulated by which any document
can be assigned to either the relevant or non-relevant set of
documents for a particular query.

• The obvious rule is to assign a document to the relevant set if the


probability of the document being relevant given the
document representation is greater than the probability of
document being non relevant, that is, if:

P(relevant|t) > P(non-relevant|t)


Using Bayes’s theorem,
BIM
• Binary documents and queries are represented as vector of
binary

• Independence terms are independent of each other

• Some of the assumptions of the BIM can be removed. For


example, we can remove the assumption that terms are
independent
• A case that violates this assumption s term pairs like: hong and
kong, new york, los angeles, Addis Ababa, Arba Minch, Abba
Gada, Haadha Siinqee, w/c are strongly dependent on each other

No of relevant No of non Total


documents relevant
documents

No of documents r n-r n
containing term tk

No of documents R-r N-n-R+r N-n


not containing
term tk

Total R N-R N

Total number of Total number of


relevant doc retrieved doc in collection
Definition
N Total number of documents in the collection

n Total number of documents containing the term tk

R total number of relevant documents retrieved


r Total number of relevant documents retrieved containing the term
tk

Based on this,

odds(probability) of term tk appearing in relevant document is given by


r/R-r

odds(probability) of term tk appearing in irrelevant document is given


by n-r/N-n-R+r
• Assume that if odds of term tk appearing in relevant is
equal to that of non relevant, then

r/R-r=n-r/N-n-R+r
Now we can calculate the relevance function as:

Relevance function(W) = r(N-n-R+r)/(R-r)(n-r), but when


we have no relevant document that contain the term tk (r=0), this
become zero, so, we have to add 0.5 to reduce zero result
Example
N=20Total no of documents in the collection
n=10 Total no of documents containing the term tk
R =15Total no of relevant documents retrieved

r=5Total no of relevant documents retrieve containing term tk

Then calculate the probability of relevance function of the term tk


Solution
• Since our result becomes zero, when we don’t have relevant
document, let us add 0.5 to the given formula

Relevance function(W) = r(N-n-R+r)/(R-r)(n-r),

RF(W) = r+0.5(N-n-R+r+0.5)
(R-r+0.5)(n-r+0.5),

RF(W) = 5+0.5(20-10-15+5+0.5)
(15-5+0.5)(10-5+0.5),
=0.095 (the probability that there is relevant document)
Information Retrieval

Chapter 5:
Retrieval Evaluation
IR Evaluation

• It is known that measuring or evaluating the performance


and accuracy of the system is very important after IR
system is designed.

• According to (Singhal, 2001), there are two main things


to measure in IR system; these are: effectiveness of the
system and its efficiency
…cont

 To measure informal information retrieval effectiveness in the


standard way, we need a test collection consisting of three
things:
1. A document collection
2. A test suite of information needs, expressible as queries
3. A set of relevance judgments, standardly a binary assessment
of either relevant or non relevant for each query-document
pair or Relevance judgments indicate which documents are
relevant to the information need of a user
….cont

The standard approach to information retrieval system


evaluation revolves around the notion of relevant and
non relevant documents.
With respect to a user information need, a document in
the test collection is given a binary classification as
either relevant or non relevant.
This decision is referred to as the gold standard or
ground truth judgment of relevance.
Mind Break
A document is relevant if it addresses the stated
information need, not because it just happens to
contain all the words in the query.

How ?
Why System Evaluation?
It provides the ability to measure the difference between IR systems


How well do our search engines work?


Is system A better than B?


Under what conditions?

Evaluation drives what to research


Identify techniques that work and do not work


There are many retrieval models/ algorithms/ systems


which one is the best?
Types of Evaluation Strategies

•System-centered studies
– Given documents, queries, and relevance judgments
• Try several variations of the system
• Measure which system returns the “best” hit list

•User-centered studies
– Given several users, and at least two retrieval systems
• Have each user try the same task on both systems
• Measure which system works the “best” for users information need
Evaluation Criteria
What are some main measures for evaluating an IR system’s performance?

• Efficiency: time, space

 Speed in terms of retrieval time and indexing time

 Speed of query processing

 Index size: Index/corpus size ratio

• Effectiveness

 How is a system capable of retrieving relevant documents from the collection?

 Is a system better than another one?

 User satisfaction: How “good” are the documents that are returned as a response to
user query?

 “Relevance” of results to meet information need of users


Performance measures (Recall, Precision, etc.)

• The two most frequent and basic measures for


information retrieval effectiveness are :
1. Precision and
2. Recall.
Precision

Precision (P) is the fraction of retrieved documents that


are relevant
The ability to retrieve top-ranked
documents that are mostly relevant.
Precision is percentage of retrieved
documents that are relevant to the query (i.e. number of
retrieved documents that are relevant).
Precision Formula
Recall
Recall (R) is the fraction of relevant documents that are
retrieved

– The ability of the search to find all of the


relevant items in the corpus

– Recall is percentage of relevant documents


retrieved from the database in response to users query.
Recall Formula
Accuracy

These notions can be made clear by examining the


following contingency table:
….cont
Then
precision = tp / (tp + fp)
Recall = tp / (tp + fn)

In terms of the contingency table above,


accuracy = (tp + tn)/(tp + f p + f n + tn).
Definition
Examples
An IR system returns 8 relevant documents, and 10
non relevant documents. There are a total of 20
relevant documents in the collection. What is the
precision of the system on this search, and what is its
recall?
Example
• A database contains 80 records on a particular topic. A search was
conducted on that topic and 60 records were retrieved, of the 60
records retrieved, 45 were relevant. The total number of relevant
document of the record is 70. Based on this:

a) Calculate the ratio of relevant retrieved documents to the totally


retrieved documents
b) Calculate the ratio of relevant retrieved documents to the total
relevant documents which exist in the collection
Example 2
• Assume there are a total of 14 relevant documents, compute
precision and recall?
R- Precision

Precision at the R-th position in the ranking of results


for a query, where R is the total number of relevant
documents.
– Calculate precision after R documents are seen

– Can be averaged over all queries


Example
What is P@4? For the above example?

• R = # of relevant docs = 6
Answer:
R-Precision = 4/6 = 0.67
Example 2
• Let total number of relevant documents = 6, compute recall and
precision for each cut off point n:
n doc # relevant Recall Precision
1 588 x 0.167 1
2 589 x 0.333 1
3 576
4 590 x 0.5 0.75
5 986
6 592 x 0.667 0.667
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x 0.833 0.38
14 990 128
Problems with both precision and recall
 Number of irrelevant documents in the collection is not
taken into account.
 Recall is undefined when there is no relevant document
in the collection.
 Precision is undefined when no document is retrieved.
Other measures
 Noise = retrieved irrelevant docs / retrieved docs
 Silence/Miss = non-retrieved relevant docs / relevant
docs

Noise = 1 – Precision; Silence = 1 – Recall


F-measure

• A single measure that trades off precision versus


recall is the F measure, which is the weighted
harmonic mean of precision and recall:
• One measure of performance that takes into accounts
both recall and precision. Harmonic mean of recall
and precision:
Difficulties in Evaluating IR System

 IR systems essentially facilitate communication between a


user and document collections
 Relevance is a measure of the effectiveness of
communication
– Effectiveness is related to the relevancy of retrieved
items.
– Relevance: relates to problem, information need,
query and a document or surrogate
……..cont

 Relevancy is not typically binary but continuous.

– Even if relevancy is binary, it is a difficult judgment to make.

 Relevance judgments is made by


– The user who posed the retrieval problem

– An external judge

– Is the relevance judgment made by users and external person the


same?

 Relevance judgment is usually:


……….cont

– Subjective: Depends upon a specific user’s judgment.


– Situational: Relates to user’s current needs.
– Cognitive: Depends on human perception and
behavior.
– Dynamic: Changes over time.
END OF

CHAPTER
5
Information Retrieval

Chapter 6:
Query Languages and Operations
Definitions

• A query is a request for data or information from a database table or


combination of tables.

• Query language (QL) refers to any computer programming


language that requests and retrieves data from database and
information systems by sending queries.

• It works on user entered structured and formal programming


command based queries to find and extract data from host databases.
Keyword-based queries

Queries are combinations of words.

The document collection is searched for documents that contain


these words.

The concept of word must be defined.


A word is a sequence of letters terminated by a separator (period,
comma, space, etc).
Types of keyword based query
• Single-word queries
• Phrase queries
• Multiple word queries
• Proximity queries
• Boolean Queries
Single-word queries
• A query is a single word
– Usually used for searching in document images

• Simplest form of query.

• All documents that include this word are retrieved.

• Documents may be ranked by the frequency of this word in the


document.
139
Phrase queries
• A query is a sequence of words treated as a single unit.
– Also called “literal string” or “exact phrase” query.
• Phrase is usually surrounded by quotation marks.
• All documents that include this phrase are retrieved.
• Usually, separators (commas, colons, etc.) and common words (e.g.,
“a”, “the”, “of”, “for”…) in the phrase are ignored.
• In effect, this query is for a set of words that must appear in sequence.
– Allows users to specify a context and thus gain precision.
• Example: “Information Processing for Document Retrieval”.

140
Multiple-word queries
•A query is a set of words (or phrases).
•Two options: A document is retrieved if it includes
–any of the query words, or
–each of the query words.
•Documents are ranked by the number of query words they contain:
– A document containing n query words is ranked higher than a
document containing m < n query words.
– Documents are ranked in decreasing order:
• those containing all the query words are ranked at the top, only
one query word at bottom.
–Frequency counts may be used to break ties among documents that
contain the same query words.
–Example: what is the result for the query “Red Bird” ?

141
Proximity queries
• Restrict the distance within a document between two search terms.
• Important for large documents in which the two search words may appear in different
contexts.
• Proximity specifications limit the acceptable occurrences and hence increase the
precision of the search.
• General Format: Word1 within m units of Word2.
– Unit may be character, word, paragraph, etc.

• Examples:
– Information within 5 words of Retrieval:

Finds documents that discuss “Information Processing for Document Retrieval” but
not “Information Processing and Searching for Relevant Document Retrieval”.
– Nuclear within 0 paragraphs of science:
142
Finds documents that discuss “Nuclear” and “science” in the same paragraph.
Boolean queries
• Based on concepts from logic: AND, OR, NOT

– It describes the information needed by relating multiple words with Boolean


operators.
• Operators: AND, OR, NOT

• Semantics: For each query word w a corresponding set Dw is constructed that


includes the documents that contain w.
• The Boolean expression is then interpreted as an expression on the corresponding
document sets with corresponding set operators:
– AND: Finds only documents containing all of the specified words or phrases.
– OR: Finds documents containing at least one of the specified words or phrases.

– NOT: Excludes documents containing the specified word or phrase.


143
Examples: Boolean queries
1.computer OR server
– Finds documents containing either computer, server or both

2. (computer OR server) NOT mainframe


– Select all documents that discuss computers or servers, do not select any
documents that discuss mainframes.
3. computer NOT (server OR mainframe)
– Select all documents that discuss computers, and do not discuss either
servers or mainframes.
4. computer OR server NOT mainframe
– Select all documents that discuss computers, or documents that discuss
servers but do not discuss mainframes. 144
Pattern matching
Relevance Feedback &
Query Expansion

148
Relevance feedback

• After initial retrieval results are presented, allow the user to


provide feedback on the relevance of one or more of the
retrieved documents.

• Use this feedback information to reformulate the query.

• Produce new results based on reformulated query.


• Allows more interactive, multi-pass process.
Relevance Feedback Architecture
Query Document
String corpus

Revise Rankings
d IR ReRanked
Query System Documents
1. Doc2
2. Doc4
Query
Ranked 1. Doc1 3. Doc5
Reformulation 2. Doc2 .
Documents 3. Doc3 .
1. Doc1  .
2. Doc2  .
3. Doc3 
Feedback .
.
150
Query expansion

• Revise query to account for feedback:


– Query Expansion: Add new terms to query from relevant
documents.

– Term Reweighting: Increase weight of terms in relevant


documents and decrease weight of terms in irrelevant

documents.
Research areas
1. Text Annotation Techniques
2. Cross-lingual IR
3. Web search engine for local languages 

4. Intelligent IR (content-based understanding)


5. Application of NLP techniques for IR
6. Document classification

7. Document summarization
8. Multimedia Retrieval

9. Image Retrieval (Content-based, Document, etc.)


152
END OF

THE COURSE

GOOD BY

You might also like