Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

By Abdo Ababor

July, 2021
• Introduction
• Improving results
• Options for improving results…
• Global methods
• Local methods
Introduction
• It is difficult to formulate queries which are well
designed for retrieval purposes.
• Relevance feedback: user feedback on relevance of
docs in initial set of results.
– User issues a (short, simple) query
– The user marks some results as relevant or non-relevant.
– The system computes a better representation of the
information need based on feedback.
– Relevance feedback can go through one or more iterations.
• Idea: it may be difficult to formulate a good query when
you don’t know the collection well, so iterate 3
Introduction
l No detailed knowledge of collection and retrieval environment
difficult to formulate queries well designed for retrieval
Need many formulations of queries for effective retrieval
uFirst formulation: often naïve attempt to retrieve relevant information
uDocuments initially retrieved:
Examined for relevance information (user, automatically)
Improve query formulations for retrieving additional
relevant documents
uQuery reformulation:
Expanding original query with new terms
Reweighting the terms in expanded query

4
Approaches for Query Operations
1. Approaches based on feedback from users
• Users relevance feedback
2. Approaches based on information derived from set of
initially retrieved documents
• Local Analysis (pseudo-relevance feedback based on
local set of documents)
• That is Dynamic
3. Approaches based on global information derived from
document collection
• Global analysis (thesaurus)

5
Query Reformulation
• Two basic techniques to revise query to account for
feedback:
• Query Expansion: Add new terms to query from
relevant documents.
• Term Reweighting: Modify term weights based on user
relevance judgements
• Increase weight of terms in relevant documents and
decrease weight of terms in irrelevant documents.

• Several algorithms have been developed for query


reformulation.
6
Relevance Feedback Architecture

Query Document
String corpus

Revised Rankings
ReRanked
Query Documents
1. Doc2
2. Doc4
Query 3. Doc5
Ranked 1. Doc1
.
Reformulation Documents
2. Doc2
.
3. Doc3
.
1. Doc1  .
2. Doc2 
3. Doc3 
Feedback .
.

7
Relevance Feedback
• Description of the architecture is as follow:
• After initial retrieval results are presented, allow the user to
provide feedback on the relevance of one or more of the
retrieved documents.

• Use this feedback information to reformulate the query.

• Produce new results based on reformulated query.

• Allows more interactive, multi-pass process.

8
Relevance Feedback: Assumptions
• A1: User has sufficient knowledge for initial query.
• Problem – user does not have sufficient initial knowledge.

• A2: Relevance prototypes are “well-behaved”.


• Term distribution in relevant documents will be similar
• Term distribution in non-relevant documents will be different from
those in relevant documents
• Either: All relevant documents are tightly clustered around a single
prototype.
• Or: There are different prototypes, but they have significant
vocabulary overlap.
• Similarities between relevant and irrelevant documents are small

• Problem— Often: instances of a general concept 9


User Relevance Feedback
•Most popular query reformulation strategy
•Cycle:
•User presented with list of retrieved documents
•User marks those which are relevant
 In practice: top 10-20 ranked documents are examined
 Incremental
•Select important terms from documents assessed relevant
by users
•Enhance importance of these terms in a new query
•Expected:
New query moves towards relevant documents and away
from non-relevant documents 10
User Relevance Feedback

Would you expect such a feature to increase


the query volume at a search engine?

11
Relevance feedback without user input
• Relevance feedback with user input
•Clustering hypothesis
known relevant documents contain terms which can be used to
describe a larger cluster of relevant documents
•Description of cluster built interactively with user assistance

• Relevance feedback without user input


•Obtain cluster description automatically
•Identify terms related to query terms
•e.g. synonyms, stemming variations, terms close to query
terms in text
12
Thesaurus
• A thesaurus provides information on synonyms and
semantically related words and phrases.
• Example:
physician
syn: ||croaker, doc, doctor, MD, medical,
mediciner, medico, ||sawbones
rel: medic, general practitioner, surgeon,

13
Thesaurus-based Query Expansion
• For each term, t, in a query, expand the query with
synonyms and related words of t from the thesaurus.
• May weight added terms less than original query terms.
• Generally increases recall.
• May significantly decrease precision, particularly with
ambiguous terms.
• “interest rate”  “interest rate fascinate evaluate”

14
WordNet
• A more detailed database of semantic relationships between English
words.
• Developed by famous cognitive psychologist George Miller and a
team at Princeton University.
• About 144,000 English words.
• Nouns, adjectives, verbs, and adverbs grouped into about 109,000
synonym sets called synsets.

15
WordNet Synset Relationships
• Antonym: front  back
• Attribute: benevolence  good (noun to adjective)
• Pertainym: alphabetical  alphabet (adjective to noun)
• Similar: unquestioning  absolute
• Cause: kill  die
• Entailment: breathe  inhale
• Holonym: chapter  text (part to whole)
• Meronym: computer  cpu (whole to part)
• Hyponym: plant  tree (specialization)
• Hypernym: apple  fruit (generalization)

16
WordNet Query Expansion
• Add synonyms in the same synset.
• Add hyponyms to add specialized terms.
• Add hypernyms to generalize a query.
• Add other related terms to expand query.

17
Statistical Thesaurus
• Existing human-developed thesauri are not easily available
in all languages.
• Human thesuari are limited in the type and range of
synonymy and semantic relations they represent.
• Semantically related terms can be discovered from
statistical analysis of corpora.

18
Pseudo Feedback
• Just assume the top m retrieved documents are relevant,
and use them to reformulate the query.
• Allows for query expansion that includes terms that are
correlated with the query terms.
• Two strategies
Local strategies
Global strategies

19
Relevance Feedback Architecture

Query Document
String corpus

Revised Rankings
ReRanked
Query Documents
1. Doc2
2. Doc4
Query 3. Doc5
Ranked 1. Doc1
.
Reformulation Documents
2. Doc2
.
3. Doc3
.
1. Doc1  .
2. Doc2 
3. Doc3 
Feedback .
.

20
Local analysis (pseudo-relevance feedback)
• Examine documents retrieved for query to determine query
expansion
• No user assistance
• Use relevance feedback methods without explicit user input.
• Clustering techniques

21
Clusters
• Synonymy association: terms that frequently co-occur
inside local set of documents
ci , j
ci , j   tf (t , d )  tf (t , d ) mi , j 
dDl
i j
ci ,i  c j , j  ci , j
• Term-term (e.g., stem-stem) association matrix (normalised)
• For term ti
• Take the n largest values mi,j. The resulting terms tj form cluster
for ti
• Query q
• Finding clusters for the |q| query terms
• Keep clusters small
• Expand original query

22
Global analysis
• Expand query using information from whole set of
documents in collection.
• Thesaurus-like structure using all documents
• Approach to automatically built thesaurus
• (e.g. similarity thesaurus based on co-occurrence
frequency)
• Approach to select terms for query expansion

23
Automatic Global Analysis
• Determine term similarity through a pre-computed
statistical analysis of the complete corpus.
• Compute association matrices which quantify term
correlations in terms of how frequently they co-occur.
• Expand queries with statistically most similar terms.

24
Global analysis:
Thesaurus-based query expansion
• For each term, t, in a query, expand the query with synonyms
and related words of t from the thesaurus
• feline → feline cat
• May weight added terms less than original query terms.
• Generally increases recall
• Widely used in many science/engineering fields
• May significantly decrease precision, particularly with
ambiguous terms.
• “interest rate”  “interest rate fascinate evaluate”
• There is a high cost of manually producing a thesaurus
• And for updating it for scientific changes

25
Automatic Global analysis
(e.g. Automatic Thesaurus Generation)
• Attempt to generate a thesaurus automatically by analyzing the
collection of documents
• Fundamental notion: similarity between two words
• Definition 1: Two words are similar if they co-occur with
similar words.
• Definition 2: Two words are similar if they occur in a given
grammatical relation with the same words.
• You can harvest, peel, eat, prepare, etc. apples and pears, so
apples and pears must be similar.
• Co-occurrence based is more robust, grammatical relations are
more accurate.

Why?
26
Normalized Association Matrix
• Frequency based correlation factor favors more frequent
terms.
• Normalize association scores: s 
cij
cii  c jj c ij
ij

• Normalized score is 1 if two terms have the same


frequency in all documents.

27
Query Expansion with Correlation Matrix
• For each term i in query, expand query with the n terms,
j, with the highest value of cij (sij).
• This adds semantically related terms in the “neighborhood”
of the query terms.

28
Problems with Global Analysis
• Term ambiguity may introduce irrelevant statistically correlated
terms.
• “Apple computer”  “Apple red fruit computer”
• Since terms are highly correlated anyway, expansion may not
retrieve many additional documents.

29
Automatic Local Analysis
• At query time, dynamically determine similar terms
based on analysis of top-ranked retrieved documents.
• Base correlation analysis on only the “local” set of retrieved
documents for a specific query.
• Avoids ambiguity by determining similar (correlated)
terms only within relevant documents.
• “Apple computer”  “Apple computer Powerbook laptop”

30
Global vs. Local Analysis
• Global analysis requires intensive term correlation
computation only once at system development time.
• Local analysis requires intensive term correlation
computation for every query at run time (although
number of terms and documents is less than in global
analysis).
• But local analysis gives better results.

31
Global Analysis Refinements
• Only expand query with terms that are similar to all terms
in the query.
sim(ki , Q)  c
k j Q
ij

• “fruit” not added to “Apple computer” since it is far


from “computer.”
• “fruit” added to “apple pie” since “fruit” close to both
“apple” and “pie.”
• Use more sophisticated term weights (instead of just
frequency) when computing term correlations.

32
Query Expansion Conclusions
• Expansion of queries with related terms can improve
performance, particularly recall.
• However, must select similar terms very carefully to avoid
problems, such as loss of precision.

Some search engines offer a similar/related pages


feature (this is a trivial form of relevance feedback)

33
Quiz in 6 minutes
1. Explain the difference between relevance feedback
with user input and Relevance feedback without user
input (2pt.)
2. Explain the difference between Local analysis feedback
strategies and Global analysis feedback strategies (2pt.)
3. From Local and Global analysis feedback which one is
better? (1pt.)

34

You might also like