Professional Documents
Culture Documents
IR Module For MIS Rift
IR Module For MIS Rift
Prepared by:
1. Tolessa Desta (MSC.)
April, 2023
Nekemte, Ethiopia
Preface
This module is designed for students of Management Information System who take the course
“Information Retrieval”.
The module contains subsequent units followed by unit outlines and unit objectives, so that readers will
have a glance look at each unit’s aim to be focused when going in to the details of the reading.
Dear Readers, it is appreciable if you would go through each Unit and perform each activity questions
before you proceed to the next unit that would easily help you achieve the general objective of the
module.
Course Introduction
General Objectives of the Course
At the end of the course students will be able to:
To familiarize students with the basic theories and principles of information storage and
retrieval
Explain the retrieval process
Describe automatic text operation and automatic indexing
Explain evaluation of information retrieval
Analyze the different retrieval models such as Boolean model, vector based retrieval model, and
probabilistic retrieval model
Express query languages, query operations, string manipulation and search algorithms;
Explain current issues in information retrieval
To introduce modern concepts of information retrieval systems.
To acquaint students with the various indexing, matching, organizing and evaluating strategies
developed for information retrieval (IR) systems
To enable students understand current research issues and trends in IR
In your words, Define Information Retrieval and give possible example you might think
right now.
Information Retrieval (IR) can be defined as a software program that deals with the organization,
storage, retrieval, and evaluation of information from document repositories, particularly textual
information.
Information Retrieval is the activity of obtaining material that can usually be documented on an
unstructured nature i.e. usually text which satisfies an information need from within large collections
which is stored on computers. For example, Information Retrieval can be when a user enters a query
into the system. With the help of the following diagram, we can understand the process of information
retrieval (IR)
Information retrieval defined “Discipline that deals with the structure, analysis, organization, storage,
searching, and retrieval of information”
Information retrieval (IR) is the process of finding material (usually documents) of an unstructured
nature (usually text) that satisfies an information need of the user from within large collections (usually
stored on computers). Information is organized into (a large number of) documents. Large collections
of documents from various sources:
news articles,
research papers,
books,
digital libraries,
Much IR research focuses more specifically on text retrieval. But there are many other interesting
areas: Cross-language retrieval, Audio (Speech & Music) retrieval, Question-answering, Image
retrieval, Video retrieval.
Information retrieval is defined as the process of accessing and retrieving the most appropriate
information from text based on a particular query given by the user, with the help of context-based
indexing or metadata. Google Search is the most famous example of information retrieval.
General Goal of Information Retrieval
To help users find useful/relevant information based on their information needs (with a
minimum effort) despite
The challenge is:
Increasing complexity of Information
Changing needs of user
What is the difference between data, information, data retrieval and information Retrieval?
Main objective of IR
Provide the users with effective access to and interaction with information resources.
Purpose/role of an IR system
An information retrieval system is designed to retrieve the documents or information required
by the user community.
It should make the right information available to the right user.
Thus, an information retrieval system aims at collecting and organizing information in one or
more subject areas in order to provide it to the user as soon as possible.
Thus it serves as a bridge between the world of creators or generators of information and the
users of that information.
Information retrieval (IR) is concerned with representing, searching, and manipulating large
collections of electronic text and other human-language data.
Web search engines — Google, Bing, and others — are by far the most popular and heavily
used IR services, providing access to up-to-date technical information, locating people and
organizations, summarizing news and events, and simplifying comparison shopping.
What is the main difference between Information Retrieval and Information Retrieval System?
Write by your own word.
linked for retrieval based upon search of the text. Techniques are beginning to emerge to search these
other media types. An Information Retrieval System consists of a software program that facilitates a
user in finding the information the user needs. The system may use standard computer hardware or
specialized hardware to support the search sub function and to convert non-textual sources to a
searchable media (e.g., transcription of audio to text).
Kinds of information retrieval systems
Two broad categories of information retrieval system can be identified:
In- house: In- house information retrieval systems are set up by a particular library or
information center to serve mainly the users within the organization. One particular type of in-
house database is the library catalogue.
Online: Online IR is nothing but retrieving data from web sites, web pages and servers that may
include data bases, images, text, tables, and other types.
IR and Related Areas
1. Database Management
2. Library and Information Science
3. Artificial Intelligence
4. Natural Language Processing
5. Machine Learning
1. Database Management
Focused on structured data stored in relational tables rather than free-form text.
Focused on efficient processing of well-defined queries in a formal language (SQL).
Clearer semantics for both data and queries.
Recent move towards semi-structured data (XML) brings it closer to IR.
2. Library and Information Science
Focused on the human user aspects of information retrieval (human-computer interaction,
user interface, visualization).
Concerned with effective categorization of human knowledge.
Concerned with citation analysis and bibliometrics (structure of information).
Recent work on digital libraries brings it closer to CS & IR.
3. Artificial Intelligence
Focused on the representation of knowledge, reasoning, and intelligent action.
document providing ‘direction to Addis’, and from this to documents which cover ‘Tourism in
Ethiopia’. In this context, user is said to be browsing in the collection and not searching, since a user
may has 19 an interest of glancing around
Logical View of Documents
How do you understand the logical views of the documents in Information Retrieval?
Document representation viewed as a continuum, in which logical view of documents might shift from
full text to index terms
If full text :
Each word in the text is a keyword
Most complex form….why?
Expensive….why?
If full text is too large, the set of representative keywords can be reduced through
transformation process called:
Text operation
It reduce the complexity of the document representation and allow moving the logical view
from that of a full text to a set of index terms
Structure of an IR System
How do you interrelate user, documents, web, web crawler and information retrieval System?
An Information Retrieval System serves as a bridge between the world of authors and the world of
readers/users, That is, writers present a set of ideas in a document using a set of concepts. Then Users
seek the IR system for relevant documents that satisfy their information need.
It is necessary to define the text collection before any of the retrieval processes are initiated. This is
usually done by the manager of the database and includes specifying the following
The documents to be used
The operations to be performed on the text
The text model to be used (the text structure and what elements can be retrieved)
The text operations transform the original documents and the information needs and generate a logical
view of each document
Once the logical view of the documents is defined, the database module builds an index of the text
An index is a critical data structure
It allows fast searching over large volumes of data
Reduces the vocabulary of the collection
Different index structures might be used, but the most popular one is the inverted file
Given the document database is indexed, the retrieval process can be initiated
1. The user first specifies a user need using words which is then parsed and transformed by the
same text operation applied to the text
2. Next the query operations is applied before the actual query, which provides a system
representation for the user need, is generated
3. The query is then processed (compared with the document) to obtain the relevant documents
Before the retrieved documents are presented to the user, the retrieved documents are ranked
according to the likelihood of relevance
4. The user then examines the list of ranked retrieved documents for useful information: and the user is
not satisfied with the result:
i. reformulate query, run on entire collection or
ii. reformulate query, run on displayed result set
5. At this point, the user might pinpoint a subset of the documents seen as definitely of interest and
initiate a user feedback cycle
In such a cycle, the system uses the documents selected by the user to change the query
formulation.
Hopefully, this modified query is a better representation of the real user need
Issues in IR
Text representation
What makes a “good” representation?
How is a representation generated from text?
What are retrievable objects and how are they organized?
Information needs representation
What is an appropriate query language?
How can interactive query formulation and refinement be supported?
Comparing representations
to identify relevant documents
What weighting scheme and similarity measure to be used?
What is a “good” model of retrieval?
Evaluating effectiveness of retrieval
What are good metrics?
What constitutes a good experimental test bed?
Indexing Subsystem
Searching Subsystem
Write in briefly the main difference between Zipf’s law, Luhn’s Idea and Heap’s Law.
Zipf's Law- named after the Harvard linguistic professor George Kingsley Zipf (1902-1950),
attempts to capture the distribution of the frequencies (i.e., number of occurrences) of the words within
a text.
Zipf's Law states that when the distinct words in a text are arranged in decreasing order of their
frequency of occurrences (most frequent words first), the occurrence characteristics of the vocabulary
can be characterized by the constant rank-frequency
Law of Zipf: Frequency * Rank = constant
If the words, w, in a collection are ranked, r, by their frequency, f, they roughly fit the relation:
r*f=c Does rank is directly proportional to frequency?
The table shows the most frequently occurring words from 336,310 document collection containing
125,720,891 total words; out of which there are 508,209 unique words
More Examples: Zipf’s Law
square-root)
Heap’s distributions
Distribution of size of the vocabulary: there is a linear relationship between vocabulary size and
number of tokens
• Example: from 1,000,000,000 words, there may be 100,000 distinct words. Can you agree?
Example: - We want to estimate the size of the vocabulary for a corpus of 1,000,000 words. However,
we only know the statistics computed on smaller corpora sizes:
For 100,000 words, there are 50,000 unique words
For 500,000 words, there are 150,000 unique words
Estimate the vocabulary size for the 1,000,000 words corpus?
How about for a corpus of 1,000,000,000 words?
Encryption
Preprocessing is the process of controlling the size of the vocabulary or the number of distinct words
used as index terms.
Preprocessing will lead to an improvement in the information retrieval performance. However, some
search engines on the Web omit preprocessing. Every word in the document is an index term.
Main Text Operations
5 main operations for selecting index terms, i.e. to choose words/stems (or groups of words) to
be used as indexing terms:
Tokenization of the text: generate a set of words from text collection
Elimination of stop words - filter out words which are not important in the retrieval
process
Normalization – resolving artificial difference among words
Stemming words - remove affixes (prefixes and suffixes) and group together word
variants with similar meaning
Construction of term categorization structures such as thesaurus, to capture
relationship for allowing the expansion of the original query with related terms
Generating Document Representatives
Text Processing System
Input text – full text, abstract or title
Output – a document representative adequate for use in an automatic retrieval system
The document representative consists of a list of class names, each name representing a class of
words occurring in the total input text.
A document will be indexed by a name if one of its significant words occurs as a member of
that class.
The logical view of the document is provided by representative keywords or index terms, which are
frequently used historically to represent documents in a collection. In modern computers, retrieval
systems adopt a full-text logical view of the document. However, with very large collections, the set of
representative keywords may have to be reduced. This process of reduction or compression of the set
of representative keywords is called text operations (or transformation).
Logic view of a doc in text preprocessing
information items to generate lists of index terms. The lexical analysis phase produces candidate
index terms that may be further processed and eventually added to indexes. Query processing is the
activity of analyzing a query and comparing it to indexes to find relevant items. Lexical analysis of
a query produces tokens that are parsed and turned into an internal representation suitable for
comparison with indexes.
Issues in Tokenization
The main Objective of Tokenization is – the identification of words in the text document.
Tokenization is greatly dependent on how the concept of the word is defined.
The first decision that must be made in designing a lexical analyzer for an automatic indexing
system is: What counts as a word or token in the indexing scheme?
Is that a sequence of characters, numbers, and alpha-numeric once? A word is a sequence of
letters terminated by a separator (period, comma, space, etc).
The definition of letter and separator is flexible; e.g., a hyphen could be defined as a letter or as
a separator. Usually, common words (such as “a”, “the”, “of”, …) are ignored.
The standard tokenization approach is single-word tokenization where input is split into words
using white space characters as delimiters and it ignores other characters rather than words.
This approach introduces errors at an early stage because it ignores multi-word units, numbers,
hyphens, punctuation marks, and apostrophes.
Numbers/Digits
Most numbers are usually not good index terms – Without a surrounding context, they are
inherently vague
The preliminary approach is to remove all words containing sequences of digits unless specified
otherwise
The advanced approach is to perform date and number normalization to unify format
anti-virus, anti-war,…
Hyphens
Breaking up hyphenated words seems to be useful
But, some words include hyphens as an integrated part
Adopt a general rule to process hyphens and specify the possible exceptions
Punctuation marks
Removed entirely in the process of lexical analysis
But, some are an integrated part of the word
• 510B.C.
The case of letters
Not important for the identification of index terms
Converted all the text to either to either lower or upper cases
But, parts of semantics will be lost due to case conversion
One word or multiple: How do you decide it is one token or two or more?
Hewlett-Packard Hewlett and Packard as two tokens?
State-of-the-art: break up hyphenated sequence.
San Francisco, Los Angeles
lowercase, lower-case, lower case ?
data base, database, data-base
How to handle special cases involving apostrophes, hyphens etc? C++, C#, URLs, emails, …
Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs. republican) can be a
meaningful part of a token.
However, frequently they are not.
Simplest approach is to ignore all numbers and punctuation and use only case-insensitive unbroken
strings of alphabetic characters as tokens.
Generally, don’t index numbers as text, But often very useful. Will often index “meta-data”, including
creation date, format, etc. separately
Issues of tokenization are language specific
Requires the language to be known
The application area for which the IR system would be developed for also dictates the nature of valid
tokens
2. Normalization
• It is Canonicalizing tokens so that matches occur despite superficial differences in the
character sequences of the tokens
• Need to “normalize” terms in indexed text as well as query terms into the same form
• Example: We want to match U.S.A. and USA, by deleting periods in a term
Case Folding: Often best to lower case everything, since users will use lowercase regardless of
‘correct’ capitalization…
Republican vs. republican
Fasil vs. fasil vs. FASIL
Anti-discriminatory vs. antidiscriminatory
Car vs. Automobile?
Normalization issues
Good for:
Allow instances of Automobile at the beginning of a sentence to match with a query of
automobile
Helps a search engine when most users type ferrari while they are interested in a
Ferrari car
Not advisable for:
Proper names vs. common nouns
E.g. General Motors, Associated Press, Kebede…
Solution:
lowercase only words at the beginning of the sentence
In IR, lowercasing is most practical because of the way users issue their queries
3. Elimination of Stop words
Stop words
Word which are too frequent among the docs in the collection are not good discriminators
A word occurring in 80% of the docs in the collection is useless for purposes of retrieval
E.g, articles, prepositions, conjunctions …
Filtering out stop words achieves a compression of 40% size of the indexing structure
The extreme approach: some verbs, adverbs, and adjectives could be treated as stop words
The stop word list – Usually contains hundreds of words
Stop words are extremely common words across document collections that have no discriminatory
power
They may occur in 80% of the documents in a collection.
They would appear to be of little value in helping select documents matching a user need
and needs to be filtered out from index list
Examples of stop words:
articles (a, an, the);
pronouns:(I, he, she, it, their, his)
prepositions (on, of, in, about, besides, against),
conjunctions/ connectors (and, but, for, nor, or, so, yet),
verbs (is, are, was, were),
adverbs (here, there, out, because, soon, after) and
adjectives (all, any, each, every, few, many, some)
Stop words are language dependent.
Intuition:
Stop words have little semantic content; it is typical to remove such high-frequency words
Stop words take up 50% of the text. Hence, document size reduces by 30-50%
Smaller indices for information retrieval
Good compression techniques for indices: The 30 most common words account for
30% of the tokens in written text
Better approximation of importance for classification, summarization, etc.
Term conflation
One of the problems involved in the use of free text for indexing and retrieval is the variation in word
forms that is likely to be encountered.
The most common types of variations are
spelling errors (father, fathor)
alternative spellings i.e. locality or national usage (color vs colour, labor vs labour)
multi-word concepts (database, data base)
Affixes (dependent, independent, dependently) abbreviations (i.e., that is).
Function of conflation in terms of IR
Reducing the total number of distinct terms to a consequent reduction of dictionary size and updating
problems
—Less terms to worry about
Bringing similar words, having similar meanings to a common form with the aim of increasing retrieval
effectiveness.
More accurate statistics
4. Stemming/Morphological analysis
What is stemming? Write the main difference between stemming and lemmatization.
Stemming reduces tokens to their root form of words to recognize morphological variation.
The process involves removal of affixes (i.e. prefixes & suffixes) with the aim of reducing variants to
the same stem
Stemming is a process that stems or removes last few characters from a word, often leading to incorrect
meanings and spelling. Lemmatization considers the context and converts the word to its meaningful
base form, which is called Lemma.
Lemmatization is a text pre-processing technique used in natural language processing (NLP) models to
break a word down to its root meaning to identify similarities. For example, a lemmatization algorithm
would reduce the word better to its root word, or lemme, good. Lemmatization usually refers to doing
things properly with the use of a vocabulary and morphological analysis of words, normally aiming to
remove inflectional endings only and to return the base or dictionary form of a word, which is known
as the lemma .
The process of lemmatization seeks to get rid of inflectional suffixes and prefixes for the purpose of
bringing out the word’s dictionary form.
Write the main difference between inflectional & derivational morphology of a word with
their examples.
.
Often removes inflectional & derivational morphology of a word
i. Inflectional morphology: vary the form of words in order to express grammatical features, such
as singular/plural or past/present tense. E.g. Boy → boys, cut → cutting.
ii. Derivational morphology: makes new words from old ones. E.g. creation is formed from create,
but they are two separate words. And also, destruction → destroy
Stemming is language dependent
Correct stemming is language specific and can be complex.
Compressed and compression are both accepted.
The final output from a conflation algorithm is a set of classes, one for each stem detected.
A Stem: the portion of a word which is left after the removal of its affixes (i.e., prefixes and/or
suffixes). Example: ‘connect’ is the stem for {connected, connecting connection, connections}
Thus, [automate, automatic, automation] all reduce to automat
A stem is used as index terms/keywords for document representations
Queries: Queries are handled in the same way.
Ways to implement stemming
The advantage of this approach is that it works perfectly (insofar as the stem of a word can be
defined perfectly); the disadvantages are the space required by the dictionary and the investment
required to maintain the dictionary as new words appear.
The second approach is to use a set of rules that extract stems from words.
The advantages of this approach are that the code is typically small, and it can gracefully handle
new words; the disadvantage is that it occasionally makes mistakes.
But, since stemming is imperfectly defined, anyway, occasional mistakes are tolerable, and the
rule-based approach is the one that is generally chosen.
Typical errors of stemming
Write the types of stemming errors with their examples.
.
There are majorly three types of errors you would find in stemming:
Overstemming
Understemming
Over-stemming: Occurs when too much is removed.
The error of taking off too much
example: croûtons croût
since croûtons is the plural of croûton
‘wander’ → ‘wand’
‘news’ → ‘new’;
‘universal’, ‘universe’, ‘universities’, and ‘university’ →’univers’.
Under-stemming: Occurs when two words are stemmed from the same root.
The error of taking off too small a suffix
For example,
croulons croulon since croulons is a form of the verb crouler
‘knavish’ → ‘knavish’
‘data’ → ‘dat’
‘datum’ → ‘datu’
Miss-stemming
taking off what looks like an ending, but is really part of the stem
reply rep
NOTE: Both data and datum have the same root yet they form two separate words.
Criteria for judging stemmers
Correctness
Overstemming: too much of a term is removed.
Can cause unrelated terms to be conflated retrieval of non-relevant documents
Understemming: too little of a term is removed.
Prevent related terms from being conflated relevant documents may not be retrieved
Thesauri
What is Thesaurus? How do you differentiate Thesaurus from dictionary?
Mostly full-text searching cannot be accurate, since different authors may select different words to
represent the same concept
Problem: The same meaning can be expressed using different terms that are synonyms, homonyms,
and related terms
How can it be achieved such that for the same meaning the identical terms are used in the index
and the query?
Thesaurus: The vocabulary of a controlled indexing language, formally organized so that a priori
relationships between concepts (for example as "broader" and “related") are made explicit.
A thesaurus contains terms and relationships between terms
IR thesauri rely typically upon the use of symbols such as USE/UF (UF=used for), BT, and RT to
demonstrate inter-term relationships.
e.g., car = automobile, truck, bus, taxi, motor vehicle -color = colour, paint
Aim of Thesaurus
Thesaurus tries to control the use of the vocabulary by showing a set of related words to handle
synonyms and homonyms
The aim of thesaurus is therefore:
to provide a standard vocabulary for indexing and searching
Thesaurus rewrite to form equivalence classes, and we index such equivalences
When the document contains automobile, index it under car as well (usually, also vice-versa)
to assist users with locating terms for proper query formulation: When the query contains
automobile, look under car as well for expanding query
to provide classified hierarchies that allow the broadening and narrowing of the current request
according to user needs
Thesaurus Construction
Example: thesaurus built to assist IR for searching cars and vehicles:
Term: Motor vehicles
UF: Automobiles Cars Trucks
BT: Vehicles
RT: Road Engineering Road Transport
What is Indexing?
Indexing is an arrangement of index terms to permit fast searching and reading memory spaces
used to speed up access to desired information from document collection as per users query
It enhances efficiency in terms of time for retrieval. Relevant documents are searched and
retrieved quick
Index file usually has index terms in a sorted order. Which list is easier to search?
Index files are much smaller than the original file. Remember Heaps Law: in 1 GB of text collection
the vocabulary might have a size of close to 5 MB. This size may be further reduced by text operation
Indexing: Basic Concepts
A simple alternative is to search the whole text sequentially (online search)
Another option is to build data structures over the text (called indices) to speed up the search
Major Steps in Index Construction
Source file:Collection of text document
A document is a collection of words/terms and other informational elements
Tokenize: identify words in a document, so that each document is represented by a list of keywords or
attributes
Index Terms Selection: apply text operations or preprocessing
Stop words removal: words with high frequency are non-content bearing and needs to be removed
from text collection
Stemming: reduce words with similar meaning into their stem/root word
Term weighting: Different index terms have varying relevance when used to describe document
contents. This effect is captured through the assignment of numerical weights to each index term of
a document. There are different index terms weighting methods: including TF, IDF, TF*IDF,…
Indexing structure: a set of index terms (vocabulary) are organized in
Index File to easily identify documents in which each term occurs in.
Basic Indexing Process
An index file is list of search terms that are organized for associative look-up, i.e., to answer user’s
query:
In which documents does a specific search term appear?
Where, within each document does each term appear? (There may be several occurrences of a
term in a document.)
For organizing index file for a collection of documents, there are various options available:
Decide what data structure and/or file structure to use.
Is it sequential file, inverted file, suffix tree, etc. ?
Sequential File
Sequential file is the most primitive file structures.
It has no vocabulary (unique list of words) as well as linking pointers.
The records are generally arranged serially, one after another, but in lexicographic order
on the value of some key field.
A particular attribute is chosen as primary key whose value will determine the order of
the records.
When the first key fails to discriminate among records, a second key is chosen to give an
order.
Example: Given a collection of documents, they are parsed to extract words and these are saved with
the Document ID.
Sequential File
To access records we have to search serially; —starting at the first record read and investigate all the
succeeding records until the required record is found or end of the file is reached.
—Its main advantages:
easy to implement;
Provides fast access to the next record using lexicographic order.
Can be searched quickly, using binary search, O(log n)
Question: “What is the Update Option?”
Is the index needs to be rebuilt or incremental update is supported?
Its disadvantages:
No weights attached to terms. Individual words are treated independently
Random access is slow: since similar terms are indexed individually, we need to find all terms that
match with the query
Inverted file
A word oriented indexing mechanism based on sorted list of keywords, with each keyword having
links to the documents containing it. Building and maintaining an inverted index is a relatively low
cost and low risk. On a text of n words an inverted index can be built in O(n) time. This list is inverted
from a list of terms in location order to a list of terms in alphabetical order.
Inverted file
Write some notes on Term frequency, Document frequency, Collection frequency, vocabulary
files and posting files
Suffix Tree
A suffix tree is an extension of suffix trie that construct a Trie of all the proper suffixes of S
The suffix tree is created by compacting unary nodes of the suffix TRIE.
These are the simplest and easy-to-implement IR models. These are based on mathematical knowledge
that was easily recognized and understood as well.
Following are the examples of classical IR models:
Boolean models,
Vector models,
Probabilistic models.
Terms are either present or absent. Thus, wij έ {0, 1} sim(q,dj) = 1, if document satisfies the boolean
query, 0 otherwise
Exercise: What are the relevant documents retrieved for the query: ((Caesar OR milton) AND (swift
OR shakespeare))
Computing weights
How to compute weight for term i in document j (wij ) and weight for term i in query q (wiq)?
A good weight must take into account two effects:
Quantification of intra-document contents (similarity)
tf factor, the term frequency within a document
Quantification of inter-documents separation (dissimilarity)
idf factor, the inverse document frequency across documents
As a result of which most IR systems are using tf*idf weighting technique: wij = tf(i,j) * idf(i)
Let: N be the total number of documents in the collection
ni be the number of documents which contain ki
freq(i,j) total existence frequency of ki within dj
Anormalized tf factor is given by f(i,j) = freq(i,j)/max(freq(j)) where the maximum is computed over
all terms which occur within the document dj
TF: Term Frequency, which measures how frequently a term occurs in a document. Since every
document is different in length, it is possible that a term would appear much more times in long
documents than shorter ones. Thus, the term frequency is often divided by the document length (aka.
the total number of terms in the document) as a way of normalization:
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the
document).
The idf factor is computed as idf(i) = log (N/ni) the log is used to make the values of tf and idf
comparable. It can also be interpreted as the amount of information associated with the term ki.
The best term-weighting schemes use tf*idf weights which are given by
wij = tf(i,j) * log(N/ni)
Advantages:
term-weighting improves quality of the answer set since it displays in ranked order
partial matching allows retrieval of documents that approximate the query conditions
cosine ranking formula sorts documents according to degree of similarity to the query
Disadvantages: assumes independence of index terms (??)
More Examples
Interpolation
It is a general form of precision/recall calculation
Precision change w.r.t. Recall (not a fixed point)
It is an empirical fact that on average as recall increases, precision decreases
Interpolate precision at 11 standard recall levels: rj Î{0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0},where j
= 0….10
The interpolated precision at the j-th standard recall level is the maximum known precision at any recall
level between the jth and (j + 1)th level:
More Examples:
Mean Average Precision (MAP):
Often we have a number of queries to evaluate for a given system. For each query, we can calculate
average precision, and if we take average of those averages for a given system, it gives us Mean
Average Precision (MAP), which is a very popular measure to compare two systems.
R-precision: It is defined as precision after R documents retrieved, where R is the total number of
relevant documents for a given query.
Average precision and R-precision are shown to be highly correlated. In the previous example, since
the number of relevant documents (R) is 5, R-precision for both the rankings is 0.4 (value of precision
after 5 documents retrieved).
Chapter Five
Query Languages and Query Operation
Keyword-Based Querying
Queries are combinations of words. —The document collection is searched for documents that contain
these words. — Word queries are intuitive, easy to express and provide fast ranking.
The concept of word must be defined. A word is a sequence of letters terminated by a separator
(period, comma, space, etc). Definition of letter and separator is flexible; e.g., hyphen could be defined
as a letter or as a separator. Usually, common words (such as “a”, “the”, “of”…) are ignored.
Single-word queries
A query is a single word —
Usually used for searching in document images —
Simplest form of query.
All documents that include this word are retrieved.
Documents may be ranked by the frequency of this word in the document.
Phrase queries
A query is a sequence of words treated as a single unit.
Also called “literal string” or “exact phrase” query.
Phrase is usually surrounded by quotation marks.
All documents that include this phrase are retrieved.
Usually, separators (commas, colons, etc.) and common words (e.g., “a”, “the”, “of”, “for”…)
in the phrase are ignored. —In effect, this query is for a set of words that must appear in
sequence.
Allows users to specify a context and thus gain precision.
Example: “Information Processing for Document Retrieval”.
Multiple-word queries
A query is a set of words (or phrases)
Two options: A document is retrieved if it includes
any of the query words, or
each of the query words.
Documents are ranked by the number of query words they contain: A document containing n query
words is ranked higher than a document containing m < n query words.
— Truth Table
Examples: Boolean queries
1. Computer OR server —Finds documents containing either computer, server or both
2. (computer OR server) NOT mainframe —
Select all documents that discuss computers or servers, do not select any documents that discuss
mainframes.
3. Computer NOT (server OR mainframe) —Select all documents that discuss computers, and do not
discuss either servers or mainframes.
4. Computer OR server NOT mainframe —Select all documents that discuss computers, or documents
that discuss servers but do not discuss mainframes.
Natural language
Using natural language for querying is very easy and attractive for the user.
Example: “Find all the documents that discuss campaign finance reforms, including documents that
discuss violations of campaign financing regulations. Do not include documents that discuss campaign
contributions by the gun and the tobacco industries”. “Documents that contain information on bank and
the bank is not related to rivers but financial institute”. Natural language queries are converted to a
formal language for processing against a set of documents. Such translation requires intelligence and is
still a challenge for IR systems.
Pseudo Natural Language processing: System scans the text and extracts recognized terms and
Boolean connectors. The grammaticality of the text is not important. Often used by search engines.
Problem: Recognizing the negation in the search statement (“Do not include...”). — Compromise:
Users enter natural language clauses connected with Boolean operators. In the above example:
“campaign finance reforms”or “violations of campaign financing regulations" and not “campaign
contributions by the gun and the tobacco industry’s”.
Query Operations: - Relevance Feedback & Query Expansion
Problems with Keywords
Keywords may not retrieve relevant documents that include synonymous terms.
The goal of query expansion is to enrich the user’s query by finding additional search terms, either
automatically, or semi automatically that represent the user’s information need more accurately and
completely, thus avoiding, at least to an extent, the aforementioned problems, and increasing the
chances of matching the user’s query to the representations of relevant ideas in documents. Query
expansion techniques may be categorized by the following criteria:
Source of query expansion terms;
Techniques used for weighting query expansion terms;
Role and involvement of the user in the query expansion process.
Query expansion can be performed automatically or interactively. In automatic query expansion
(AQE), the system selects and adds terms to the user’s query, whereas in interactive query expansion
(IQE), the system selects candidate terms for query expansion, shows them to the user, and asks the
user to select (or deselect) terms that they want to include into (or exclude from) the query.
There are three main sources of QE terms: (i) hand-built knowledge resources such as dictionaries,
thesauri, and ontologies; (ii) the documents used in the retrieval process; (iii) external text collections
and resources (e.g., the WWW, Wikipedia).
Query expansion is a technique that modifies the original query of a user to retrieve more relevant
documents from a large collection of information. Relevance feedback is a process that allows the user
to indicate which documents are relevant or not, and then uses this information to refine the query
expansion.
In relevance feedback, users give additional input (relevant/non-relevant) on documents, which is
used to reweight terms in the documents.
In query expansion, users give additional input (good/bad search term) on words or phrases.
Examples of query expansion with relevance feedback
There are many examples of query expansion with relevance feedback in different domains and
applications. For instance, in web search, Google uses implicit feedback to personalize and refine the
search results based on the user's history and preferences. In academic search, Scopus uses explicit
feedback to allow the user to select the relevant fields, keywords, and sources for query expansion. In
image search, Pinterest uses both explicit and implicit feedback to suggest related images and keywords
based on the user's pins and interests.
B. Term reweighting in expanded query: Modify term weights based on user relevance judgments.
Increase weight of terms in relevant documents
Synonymy association: terms that frequently co-occur inside local set of documents
At query time, dynamically determine similar terms based on analysis of top-ranked retrieved
documents. Base correlation analysis on only the “local” set of retrieved documents for a
specific query.
Avoids ambiguity by determining similar (correlated) terms only within relevant documents.
“Apple computer” v “Apple computer Power book laptop”
Global analysis
Expand query using information from whole set of documents in collection
Thesaurus-based Controlled vocabulary, maintained by editors (e.g., medline) Manual thesaurus
Approach to select terms for query expansion
Determine term similarity through a pre-computed statistical correlation analysis of the
complete corpus.
Compute association matrices which quantify term correlations in terms of how frequently they
co-occur.
Expand queries with statistically most similar terms.
Problems with Global Analysis
Term ambiguity may introduce irrelevant statistically correlated terms.
“Apple computer”˄“Apple red fruit computer”
Since terms are highly correlated anyway, expansion may not retrieve many additional documents.
Global vs. Local Analysis
Global analysis requires intensive term correlation computation only once at system development
time. Local analysis requires intensive term correlation computation for every query at run time
(although number of terms and documents is less than in global analysis). But local analysis gives
better results.
Query Expansion Conclusions
Expansion of queries with related terms can improve performance, particularly recall. However, must
select similar terms very carefully to avoid problems, such as loss of precision.