IR Module For MIS Rift

Information Retrieval /2023
Rift Valley University

Nekemte Campus
Department of Management Information System
A Module Prepared for the Course Information Retrieval
Prepared by:
1. Tolessa Desta (MSC.)
April, 2023
Nekemte, Ethiopia

1
Preface
This module is designed for students of Management Information System who take the course
“Information Retrieval”.
The module contains subsequent units followed by unit outlines and unit objectives, so that readers will
have a glance look at each unit’s aim to be focused when going in to the details of the reading.
Dear Readers, it is appreciable if you would go through each Unit and perform each activity questions
before you proceed to the next unit that would easily help you achieve the general objective of the
module.
Course Introduction
General Objectives of the Course
At the end of the course students will be able to:
 To familiarize students with the basic theories and principles of information storage and
retrieval
 Explain the retrieval process
 Describe automatic text operation and automatic indexing
 Explain evaluation of information retrieval
 Analyze the different retrieval models such as Boolean model, vector based retrieval model, and
probabilistic retrieval model
 Express query languages, query operations, string manipulation and search algorithms;
 Explain current issues in information retrieval
 To introduce modern concepts of information retrieval systems.
 To acquaint students with the various indexing, matching, organizing and evaluating strategies
developed for information retrieval (IR) systems
 To enable students understand current research issues and trends in IR

2
Chapter One: Introduction to ISR

What is Information Retrieval?
 In your words, Define Information Retrieval and give possible example you might think
right now.
Information Retrieval (IR) can be defined as a software program that deals with the organization,
storage, retrieval, and evaluation of information from document repositories, particularly textual
information.
Information Retrieval is the activity of obtaining material that can usually be documented on an
unstructured nature i.e. usually text which satisfies an information need from within large collections
which is stored on computers. For example, Information Retrieval can be when a user enters a query
into the system. With the help of the following diagram, we can understand the process of information
retrieval (IR)
Information retrieval defined “Discipline that deals with the structure, analysis, organization, storage,
searching, and retrieval of information”
Information retrieval (IR) is the process of finding material (usually documents) of an unstructured
nature (usually text) that satisfies an information need of the user from within large collections (usually
stored on computers). Information is organized into (a large number of) documents. Large collections
of documents from various sources:
 news articles,
 research papers,
 books,
 digital libraries,

3
 Web pages, etc.

Example: Google indexing size in 1998 Google already had 26 million pages
Now: 1 trillion (as in 1,000,000,000,000) unique URLs
Examples of IR systems
Conventional (library catalog): by keyword, title, author, etc.
E.g.: You are probably familiar with AAU library catalog
Text-based (Google, Yahoo, msn, etc): Search by keywords.
Limited search using queries in natural language.
Multimedia (QBIC, WebSeek, SaFe): Search by visual appearance (shapes, colors,… ).
Question answering systems (Ask Jeeves, Answer bus): Search in (restricted) natural language
Other: Cross language information retrieval (uses multiple languages), Music retrieval
Information retrieval is the process of searching for relevant documents from unstructured large corpus
that satisfy users’ information need.
It is a tool that finds and selects from a collection of items a subset that serves the user’s purpose
Much IR research focuses more specifically on text retrieval. But there are many other interesting
areas: Cross-language retrieval, Audio (Speech & Music) retrieval, Question-answering, Image
retrieval, Video retrieval.
Information retrieval is defined as the process of accessing and retrieving the most appropriate
information from text based on a particular query given by the user, with the help of context-based
indexing or metadata. Google Search is the most famous example of information retrieval.
General Goal of Information Retrieval
 To help users find useful/relevant information based on their information needs (with a
minimum effort) despite
 The challenge is:
 Increasing complexity of Information
 Changing needs of user

4
 Provide immediate random access to the document collection.

 Retrieval systems, such as Google, Yahoo, are developed with this aim.
Information Retrieval vs. Data Retrieval
 What is the difference between data, information, data retrieval and information Retrieval?
 Emphasis of IR is on the retrieval of information, rather than on the retrieval of data

Data retrieval: - Consists mainly of determining which documents contain a set of keywords in the
user query (which is not enough to satisfy the user information need)
 Aims at retrieving all objects that satisfy well defined semantics
 a single erroneous object among a thousand retrieved objects implies failure
Information retrieval is concerned with retrieving information about a subject or topic than retrieving
data which satisfies a given query.
Semantics is frequently loose: the retrieved objects might be inaccurate
 small errors are tolerated
 Example of data retrieval system is a relational database
 “Information retrieval deals with representation, storage, organization of, and access to
information items.
 The organization and access of information items should provide the user with easy access to
the information in which he is interested”
 The definition incorporates all important features of a good information retrieval system
 Representation, Storage, Organization, Access,
 The focus is on the user information need rather than a precise query
 Write in your own word the main objective of IR.
Main objective of IR
 Provide the users with effective access to and interaction with information resources.
Purpose/role of an IR system
 An information retrieval system is designed to retrieve the documents or information required
by the user community.
 It should make the right information available to the right user.

5
 Thus, an information retrieval system aims at collecting and organizing information in one or
more subject areas in order to provide it to the user as soon as possible.
 Thus it serves as a bridge between the world of creators or generators of information and the
users of that information.
 Information retrieval (IR) is concerned with representing, searching, and manipulating large
collections of electronic text and other human-language data.
 Web search engines — Google, Bing, and others — are by far the most popular and heavily
used IR services, providing access to up-to-date technical information, locating people and
organizations, summarizing news and events, and simplifying comparison shopping.
 What is the main difference between Information Retrieval and Information Retrieval System?
 Write by your own word.
Information Retrieval System

An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of
information. Information in this context can be composed of text (including numeric and date data),
images, audio, video and other multi-media objects. Although the form of an object in an Information
Retrieval System is diverse, the text aspect has been the only data type that lent itself to fully functional
processing. The other data types have been treated as highly informative sources, but are primarily

6
linked for retrieval based upon search of the text. Techniques are beginning to emerge to search these
other media types. An Information Retrieval System consists of a software program that facilitates a
user in finding the information the user needs. The system may use standard computer hardware or
specialized hardware to support the search sub function and to convert non-textual sources to a
searchable media (e.g., transcription of audio to text).
Kinds of information retrieval systems
Two broad categories of information retrieval system can be identified:
 In- house: In- house information retrieval systems are set up by a particular library or
information center to serve mainly the users within the organization. One particular type of in-
house database is the library catalogue.
 Online: Online IR is nothing but retrieving data from web sites, web pages and servers that may
include data bases, images, text, tables, and other types.
IR and Related Areas
1. Database Management
2. Library and Information Science
3. Artificial Intelligence
4. Natural Language Processing
5. Machine Learning
1. Database Management
 Focused on structured data stored in relational tables rather than free-form text.
 Focused on efficient processing of well-defined queries in a formal language (SQL).
 Clearer semantics for both data and queries.
 Recent move towards semi-structured data (XML) brings it closer to IR.
2. Library and Information Science
 Focused on the human user aspects of information retrieval (human-computer interaction,
user interface, visualization).
 Concerned with effective categorization of human knowledge.
 Concerned with citation analysis and bibliometrics (structure of information).
 Recent work on digital libraries brings it closer to CS & IR.
3. Artificial Intelligence
 Focused on the representation of knowledge, reasoning, and intelligent action.

7
 Formalisms for representing knowledge and queries:

 First-order Predicate Logic
 Bayesian Networks
 Recent work on web ontologies and intelligent information agents brings it closer to IR.
4. Natural Language Processing
 Focused on the syntactic, semantic, and pragmatic analysis of natural language text and
discourse.
 Ability to analyze syntax (phrase structure) and semantics could allow retrieval based on
meaning rather than keywords.
Natural Language Processing: IR Directions
 Methods for determining the sense of an ambiguous word based on context (word sense
disambiguation).
 Methods for identifying specific pieces of information in a document (information
extraction).
 Methods for answering specific NL questions from document corpora or structured data
like Free Base or Google’s Knowledge Graph.
5. Machine Learning
 Focused on the development of computational systems that improve their performance with
experience.
 Automated classification of examples based on learning concepts from labeled training
examples (supervised learning).
 Automated methods for clustering unlabeled examples into meaningful groups (unsupervised
learning).
Machine Learning: IR Directions
 Text Categorization
 Automatic hierarchical classification (Yahoo).
 Adaptive filtering/routing/recommending.
 Automated spam filtering.
 Text Clustering
 Clustering of IR query results.
 Automatic formation of hierarchies (Yahoo).

8
 Learning for Information Extraction

 Text Mining
 Learning to Rank
 List some of the features of an Information Retrieval System.
Features of an information retrieval system

Liston and Schene suggest that an effective information retrieval system must have provisions for:
 Prompt dissemination of information
 Filtering of information
 The right amount of information at the right time
 Active switching of information
 Receiving information in an economical way
 Browsing
 Getting information in an economical way
 Current literature
 Access to other information systems
 Interpersonal communications, and
 Personalized help
Why is IR so hard?
 Information retrieval problem: locating relevant documents based on user input, such as
keywords or example documents
 The real problem boils down to matching the language of the query to the language of the
document.
 Simply matching on words is a very weak approach.
 One word can have different semantic meanings.
 Consider: Take
 “take a place at the table”
 “take money to the bank”
 “take a picture”
More Problems with IR
 You can’t even tell what part of speech a word has:

9
 “I saw her duck”

 Duck---the animal
 Duck----to lower head quickly
 A query that searches for “pictures of a duck” will find documents that contains:
 “I saw her duck away from the ball falling from the sky”
 Proper Nouns often use regular old nouns
 Consider a document with “a man named Abraham owned a Lincoln”
 A word matching query for “Abraham Lincoln” may well find the above document.
Basic Concepts in Information Retrieval
 Write the Basic concepts of information Retrieval in your own word.
(i) User Task and

(ii) Logical View of documents
The User Task

Retrieval is the process of retrieving information whereby the main objective is clearly defined from
the onset of searching process. The user of a retrieval system has to translate his information need into
a query in the language provided by the system. In this context (i.e. by specifying a set of words), the
user searches for useful information executing a retrieval task. English Language Statement: I want a
book by J. K Rowling titled The Chamber of Secrets
Browsing is the process of retrieving information, whereby the main objective is not clearly defined
from the beginning and whose purpose might change during the interaction with the system. E.g. User
might search for documents about ‘car racing’. Meanwhile he might find interesting documents about
‘car manufacturers’. While reading about car manufacturers in Addis, he might turn his attention to a

10
document providing ‘direction to Addis’, and from this to documents which cover ‘Tourism in
Ethiopia’. In this context, user is said to be browsing in the collection and not searching, since a user
may has 19 an interest of glancing around
Logical View of Documents
 How do you understand the logical views of the documents in Information Retrieval?
 Documents in a collection are frequently represented by a set of index terms or keywords

 Such keywords are mostly extracted directly from the text of the document
 These representative keywords provide a logical view of the document
Document representation viewed as a continuum, in which logical view of documents might shift from
full text to index terms
 If full text :
 Each word in the text is a keyword
 Most complex form….why?

11
 Expensive….why?
 If full text is too large, the set of representative keywords can be reduced through
transformation process called:
 Text operation
 It reduce the complexity of the document representation and allow moving the logical view
from that of a full text to a set of index terms
Structure of an IR System
 How do you interrelate user, documents, web, web crawler and information retrieval System?
An Information Retrieval System serves as a bridge between the world of authors and the world of
readers/users, That is, writers present a set of ideas in a document using a set of concepts. Then Users
seek the IR system for relevant documents that satisfy their information need.
The black box is the information retrieval system.

To be effective in its attempt to satisfy information need of users, the IR system must somehow
‘interpret’ the contents of documents in a collection and rank them according to their degree of
relevance to the user query. Thus the notion of relevance is at the centre of IR. The primary goal of an
IR system is to retrieve all the documents which are relevant to a user query while retrieving as few
non-relevant documents as possible
Typical IR Task
Given:
1. A collection of textual natural-language documents 2. A user query in the form of a textual string
Process:
The various activities required to select the relevant document
Result:
A ranked set of documents that are assumed to be relevant to the user query
Measure of Effectiveness:
Number of relevant docs from the retrieved collection —
Number of relevant docs retrieved from the whole 24 collection

12
Typical IR System Architecture
Web Search System (e.g.: Google)

13
Overview of the Retrieval process
 Write the process of retrieving information (document) from the web.
It is necessary to define the text collection before any of the retrieval processes are initiated. This is
usually done by the manager of the database and includes specifying the following
 The documents to be used
 The operations to be performed on the text
 The text model to be used (the text structure and what elements can be retrieved)
The text operations transform the original documents and the information needs and generate a logical
view of each document
Once the logical view of the documents is defined, the database module builds an index of the text
 An index is a critical data structure
 It allows fast searching over large volumes of data
 Reduces the vocabulary of the collection
Different index structures might be used, but the most popular one is the inverted file
Given the document database is indexed, the retrieval process can be initiated
1. The user first specifies a user need using words which is then parsed and transformed by the
same text operation applied to the text
2. Next the query operations is applied before the actual query, which provides a system
representation for the user need, is generated
3. The query is then processed (compared with the document) to obtain the relevant documents
Before the retrieved documents are presented to the user, the retrieved documents are ranked
according to the likelihood of relevance
4. The user then examines the list of ranked retrieved documents for useful information: and the user is
not satisfied with the result:
i. reformulate query, run on entire collection or
ii. reformulate query, run on displayed result set
5. At this point, the user might pinpoint a subset of the documents seen as definitely of interest and
initiate a user feedback cycle
 In such a cycle, the system uses the documents selected by the user to change the query
formulation.

14
 Hopefully, this modified query is a better representation of the real user need
Issues in IR
 List and discuss the issues of an Information Retrieval.
 Text representation
 What makes a “good” representation?
 How is a representation generated from text?
 What are retrievable objects and how are they organized?
 Information needs representation
 What is an appropriate query language?
 How can interactive query formulation and refinement be supported?
 Comparing representations
 to identify relevant documents
 What weighting scheme and similarity measure to be used?
 What is a “good” model of retrieval?
 Evaluating effectiveness of retrieval
 What are good metrics?
 What constitutes a good experimental test bed?

15
Focus in IR System Design

 What are the two focuses of IR system design?
 Our focus during IR system design is:

 In improving performance effectiveness of the system
 Effectiveness of the system is measured in terms of:
 Precision,
 Recall, …
 Stemming, stop words, weighting schemes, matching algorithms will help to improve
effectiveness of the IR system
 In improving performance efficiency
o The concern here is storage space usage, access time, searching time, data transfer time …
o There is space – time tradeoffs!!
o Use Compression techniques, data/file structures, etc.
Subsystems of an IR system
 List and define in briefly the two subsystems of an IR system.
The two subsystems of an IR system:

 Indexing: is an offline process of organizing documents using keywords extracted from the
collection
 Searching: is an online process of finding relevant documents in the index list as per users query
 Indexing and searching: are unavoidably connected
 you cannot search that was not first indexed in some manner
 indexing of documents is done in order to be searchable
 there are many ways to do indexing
 to index one needs an indexing language
 there are many indexing languages
 every word in a document could be an indexing language

16
Indexing Subsystem
Searching Subsystem
Application areas within IR

 Write another application area of an Information Retrieval with explanation.
 Cross language retrieval

 Speech/broadcast retrieval
 Text categorization
 Text summarization
 Structured document element retrieval (XML)

17
Chapter 2 Text Operations

Index Term Selection and Text Operations
Index Term Selection – Noun words (or group of noun words) are more representative of the
semantics of doc content
Preprocess the text of docs in collection in order to select the meaningful/representative index terms
• Control the size of the vocabulary E.g., “the house of the lord”
 Statistical Properties of Text
 How is the frequency of different words distributed?
 How fast does vocabulary size grow with the size of a corpus?
 Such factors affect the performance of IR system & can be used to select suitable term weights
& other aspects of the system.
 A few words are very common.
 2 most frequent words (e.g. “the”, “of”) can account for about 10% of word occurrences in a
document.
 Most words are very rare.
 Half the words in a corpus appear only once, called “read only once”
 Called a “heavy tailed ”distribution, since most of the probability mass is in the “tail”
Zipf’s distributions Rank Frequency Distribution
Word distribution: Zipf's Law
 Write in briefly the main difference between Zipf’s law, Luhn’s Idea and Heap’s Law.

18
Zipf's Law- named after the Harvard linguistic professor George Kingsley Zipf (1902-1950),
attempts to capture the distribution of the frequencies (i.e., number of occurrences) of the words within
a text.
Zipf's Law states that when the distinct words in a text are arranged in decreasing order of their
frequency of occurrences (most frequent words first), the occurrence characteristics of the vocabulary
can be characterized by the constant rank-frequency
Law of Zipf: Frequency * Rank = constant
If the words, w, in a collection are ranked, r, by their frequency, f, they roughly fit the relation:
r*f=c  Does rank is directly proportional to frequency?
Different collections have different constants c.
The table shows the most frequently occurring words from 336,310 document collection containing
125,720,891 total words; out of which there are 508,209 unique words
More Examples: Zipf’s Law

19
Methods that Build on Zipf's Law

 Stop lists: Ignore the most frequent words (upper cut-off). Used by almost all systems.
 Significant words: Take words in between the most frequent (upper cut-off) and least frequent
words (lower cut-off).
 Term weighting: Give differing weights to terms based on their frequency, with most frequent
words weighted less. Used by almost all ranking methods.
Explanations for Zipf’s Law
The law has been explained by “principle of least effort” which makes it easier for a speaker or writer
of a language to repeat certain words instead of coining new and different words. Zipf’s explanation
was his “principle of least effort” which balances between speaker’s desire for a small vocabulary and
hearer’s desire for a large one.
Zipf’s Law Impact on IR
Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces
inverted-index storage costs.
Bad News: For most words, gathering sufficient data for meaningful statistical analysis (e.g. for
correlation analysis for query expansion) is difficult since they are extremely rare.
Word significance: Luhn’s Ideas
Luhn Idea (1958): the frequency of word occurrence in a text furnishes a useful measurement of word
significance.
Luhn suggested that both extremely common and extremely uncommon words were not very useful
for indexing. For this, Luhn specifies two cutoff points: an upper and lower cutoff based on which non-
significant words are excluded. The words exceeding the upper cutoff were considered to be common.
 The words below the lower cutoff were considered to be rare
 Hence they are not contributing significantly to the content of the text
 The ability of words to discriminate content, reached a peak at a rank order position half way
between the two-cutoffs
 Let f be the frequency of occurrence of words in a text, and r their rank in decreasing order of
word frequency, then a plot relating f & r yields the following curve

20
Vocabulary size: Heaps’ Law

How to estimates the number of vocabularies in a given corpus, Dictionaries, 600,000 words and
above, But they do not include names of people, locations, products etc
Vocabulary Growth: Heaps’ Law
How does the size of the overall vocabulary (number of unique words) grow with the size of the
corpus? This determines how the size of the inverted index will scale with the size of the corpus.
Heap’s law: estimates the number of vocabularies in a given corpus
 The vocabulary size grows by O (nβ), where β is a constant between 0 – 1.
 If V is the size of the vocabulary and n is the length of the corpus in words,
 Heap’s provides the following equation: Where constants: K » 10-100 β » 0.4-0.6 (approx.
square-root)
Heap’s distributions
Distribution of size of the vocabulary: there is a linear relationship between vocabulary size and
number of tokens

21
• Example: from 1,000,000,000 words, there may be 100,000 distinct words. Can you agree?
Example: - We want to estimate the size of the vocabulary for a corpus of 1,000,000 words. However,
we only know the statistics computed on smaller corpora sizes:
 For 100,000 words, there are 50,000 unique words
 For 500,000 words, there are 150,000 unique words
 Estimate the vocabulary size for the 1,000,000 words corpus?
 How about for a corpus of 1,000,000,000 words?

22
Issues: recall and precision

 breaking up hyphenated terms increase recall but decrease precision
 preserving case distinctions enhance precision but decrease recall
 commercial information systems usually take recall
 enhancing approach (numbers and words containing
 digits are index terms, and all are case insensitive)
Text Operations
 Write the concepts of Text Operations by your own words with their brief examples.
 Are all words in a document important?

 Not all words in a document are equally significant to represent the contents/meanings of a
document
o Some word carry more meaning than others
o Noun words are the most representative of a document content
 Therefore, need to preprocess the text of a document in a collection to be used as index terms
 Using the set of all words in a collection to index documents creates too much noise for the
retrieval task
o Reduce noise means reduce words which can be used to refer to the document
A text operation is the process of transforming text documents in to their logical representations that
can be used as index terms.
Encryption

23
Preprocessing is the process of controlling the size of the vocabulary or the number of distinct words
used as index terms.
Preprocessing will lead to an improvement in the information retrieval performance. However, some
search engines on the Web omit preprocessing. Every word in the document is an index term.
Main Text Operations
 5 main operations for selecting index terms, i.e. to choose words/stems (or groups of words) to
be used as indexing terms:
 Tokenization of the text: generate a set of words from text collection
 Elimination of stop words - filter out words which are not important in the retrieval
process
 Normalization – resolving artificial difference among words
 Stemming words - remove affixes (prefixes and suffixes) and group together word
variants with similar meaning
 Construction of term categorization structures such as thesaurus, to capture
relationship for allowing the expansion of the original query with related terms
Generating Document Representatives
 Text Processing System
 Input text – full text, abstract or title
 Output – a document representative adequate for use in an automatic retrieval system
 The document representative consists of a list of class names, each name representing a class of
words occurring in the total input text.
 A document will be indexed by a name if one of its significant words occurs as a member of
that class.
The logical view of the document is provided by representative keywords or index terms, which are
frequently used historically to represent documents in a collection. In modern computers, retrieval
systems adopt a full-text logical view of the document. However, with very large collections, the set of

24
representative keywords may have to be reduced. This process of reduction or compression of the set
of representative keywords is called text operations (or transformation).
Logic view of a doc in text preprocessing
Goals of Text Operations
 Write the main goals of Text operations.
 Improve the quality of answer set (recall-precision figures)

 Reduce the space and search time
Document Preprocessing
 Lexical analysis of the text
 Elimination of stop words
 Stemming the remaining words
 Selecting of indexing terms
 Construction term categorization structures like Thesauri, Word/Doc Clustering
1. Lexical Analysis/Tokenization of Text
Lexical analysis or Tokenization is a fundamental operation in both query processing and automatic
indexing. It is the process of converting an input stream of characters into a stream of words or
tokens. Tokens are groups of characters with collective significance. In other words, it is one of the
steps used to convert the text of the documents into the sequence of words, w1, w2 … wn to be
adopted as index terms. It is the process of demarcating and possibly classifying sections of a string
of input characters into words. Generally, Lexical analysis is the first stage of automatic indexing,
and of query processing. Automatic indexing is the process of algorithmically examining

25
information items to generate lists of index terms. The lexical analysis phase produces candidate
index terms that may be further processed and eventually added to indexes. Query processing is the
activity of analyzing a query and comparing it to indexes to find relevant items. Lexical analysis of
a query produces tokens that are parsed and turned into an internal representation suitable for
comparison with indexes.
Issues in Tokenization
 Write the main issues of Tokenization.
 The main Objective of Tokenization is – the identification of words in the text document.
 Tokenization is greatly dependent on how the concept of the word is defined.
 The first decision that must be made in designing a lexical analyzer for an automatic indexing
system is: What counts as a word or token in the indexing scheme?
 Is that a sequence of characters, numbers, and alpha-numeric once? A word is a sequence of
letters terminated by a separator (period, comma, space, etc).
 The definition of letter and separator is flexible; e.g., a hyphen could be defined as a letter or as
a separator. Usually, common words (such as “a”, “the”, “of”, …) are ignored.
 The standard tokenization approach is single-word tokenization where input is split into words
using white space characters as delimiters and it ignores other characters rather than words.
 This approach introduces errors at an early stage because it ignores multi-word units, numbers,
hyphens, punctuation marks, and apostrophes.
Numbers/Digits
 Most numbers are usually not good index terms – Without a surrounding context, they are
inherently vague
 The preliminary approach is to remove all words containing sequences of digits unless specified
otherwise
 The advanced approach is to perform date and number normalization to unify format
anti-virus, anti-war,…
Hyphens
 Breaking up hyphenated words seems to be useful
 But, some words include hyphens as an integrated part

26
 Adopt a general rule to process hyphens and specify the possible exceptions
Punctuation marks
 Removed entirely in the process of lexical analysis
 But, some are an integrated part of the word
• 510B.C.
The case of letters
 Not important for the identification of index terms
 Converted all the text to either to either lower or upper cases
 But, parts of semantics will be lost due to case conversion
 One word or multiple: How do you decide it is one token or two or more?
 Hewlett-Packard  Hewlett and Packard as two tokens?
 State-of-the-art: break up hyphenated sequence.
 San Francisco, Los Angeles
 lowercase, lower-case, lower case ?
 data base, database, data-base
 How to handle special cases involving apostrophes, hyphens etc? C++, C#, URLs, emails, …
 Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs. republican) can be a
meaningful part of a token.
 However, frequently they are not.
Simplest approach is to ignore all numbers and punctuation and use only case-insensitive unbroken
strings of alphabetic characters as tokens.
Generally, don’t index numbers as text, But often very useful. Will often index “meta-data”, including
creation date, format, etc. separately
 Issues of tokenization are language specific
 Requires the language to be known
The application area for which the IR system would be developed for also dictates the nature of valid
tokens

27
2. Normalization
• It is Canonicalizing tokens so that matches occur despite superficial differences in the
character sequences of the tokens
• Need to “normalize” terms in indexed text as well as query terms into the same form
• Example: We want to match U.S.A. and USA, by deleting periods in a term
 Case Folding: Often best to lower case everything, since users will use lowercase regardless of
‘correct’ capitalization…
 Republican vs. republican
 Fasil vs. fasil vs. FASIL
 Anti-discriminatory vs. antidiscriminatory
 Car vs. Automobile?
Normalization issues
 Good for:
 Allow instances of Automobile at the beginning of a sentence to match with a query of
automobile
 Helps a search engine when most users type ferrari while they are interested in a
Ferrari car
 Not advisable for:
 Proper names vs. common nouns
 E.g. General Motors, Associated Press, Kebede…
 Solution:
 lowercase only words at the beginning of the sentence
 In IR, lowercasing is most practical because of the way users issue their queries
3. Elimination of Stop words
Stop words
 Word which are too frequent among the docs in the collection are not good discriminators
 A word occurring in 80% of the docs in the collection is useless for purposes of retrieval
E.g, articles, prepositions, conjunctions …
 Filtering out stop words achieves a compression of 40% size of the indexing structure
 The extreme approach: some verbs, adverbs, and adjectives could be treated as stop words
The stop word list – Usually contains hundreds of words

28
Stop words are extremely common words across document collections that have no discriminatory
power
 They may occur in 80% of the documents in a collection.
 They would appear to be of little value in helping select documents matching a user need
and needs to be filtered out from index list
Examples of stop words:
 articles (a, an, the);
 pronouns:(I, he, she, it, their, his)
 prepositions (on, of, in, about, besides, against),
 conjunctions/ connectors (and, but, for, nor, or, so, yet),
 verbs (is, are, was, were),
 adverbs (here, there, out, because, soon, after) and
 adjectives (all, any, each, every, few, many, some)
Stop words are language dependent.
Intuition:
Stop words have little semantic content; it is typical to remove such high-frequency words
Stop words take up 50% of the text. Hence, document size reduces by 30-50%
 Smaller indices for information retrieval
 Good compression techniques for indices: The 30 most common words account for
30% of the tokens in written text
 Better approximation of importance for classification, summarization, etc.
Term conflation
One of the problems involved in the use of free text for indexing and retrieval is the variation in word
forms that is likely to be encountered.
The most common types of variations are
 spelling errors (father, fathor)
 alternative spellings i.e. locality or national usage (color vs colour, labor vs labour)
 multi-word concepts (database, data base)
 Affixes (dependent, independent, dependently) abbreviations (i.e., that is).
Function of conflation in terms of IR

29
Reducing the total number of distinct terms to a consequent reduction of dictionary size and updating
problems
 —Less terms to worry about
Bringing similar words, having similar meanings to a common form with the aim of increasing retrieval
effectiveness.
 More accurate statistics
4. Stemming/Morphological analysis
 What is stemming? Write the main difference between stemming and lemmatization.
Stemming reduces tokens to their root form of words to recognize morphological variation.
The process involves removal of affixes (i.e. prefixes & suffixes) with the aim of reducing variants to
the same stem
Stemming is a process that stems or removes last few characters from a word, often leading to incorrect
meanings and spelling. Lemmatization considers the context and converts the word to its meaningful
base form, which is called Lemma.
Lemmatization is a text pre-processing technique used in natural language processing (NLP) models to
break a word down to its root meaning to identify similarities. For example, a lemmatization algorithm
would reduce the word better to its root word, or lemme, good. Lemmatization usually refers to doing
things properly with the use of a vocabulary and morphological analysis of words, normally aiming to
remove inflectional endings only and to return the base or dictionary form of a word, which is known
as the lemma .
The process of lemmatization seeks to get rid of inflectional suffixes and prefixes for the purpose of
bringing out the word’s dictionary form.
Some points to remember:

 It is slower than stemming
 Accuracy is more than stemming

30
 Dictionary based approach

 It is preferred to retain meaning in sentence
 Depends heavily on POS tag for finding root word or lemma
What is the process of stemming?
Stemming is a text-preprocessing technique. It is used for removing the affixes from the words to
convert them into their root/base form.
Stemming is the process of reducing a word to its stem that affixes to suffixes and prefixes or to the
roots of words known as "lemmas". Stemming is important in natural language understanding (NLU)
and natural language processing (NLP).
 Write the main difference between inflectional & derivational morphology of a word with
their examples.
.
Often removes inflectional & derivational morphology of a word
i. Inflectional morphology: vary the form of words in order to express grammatical features, such
as singular/plural or past/present tense. E.g. Boy → boys, cut → cutting.
ii. Derivational morphology: makes new words from old ones. E.g. creation is formed from create,
but they are two separate words. And also, destruction → destroy
Stemming is language dependent
Correct stemming is language specific and can be complex.
Compressed and compression are both accepted.
The final output from a conflation algorithm is a set of classes, one for each stem detected.
A Stem: the portion of a word which is left after the removal of its affixes (i.e., prefixes and/or
suffixes). Example: ‘connect’ is the stem for {connected, connecting connection, connections}
Thus, [automate, automatic, automation] all reduce to  automat
A stem is used as index terms/keywords for document representations
Queries: Queries are handled in the same way.
Ways to implement stemming
 Write the ways to implement stemming.
There are basically two ways to implement stemming.

 The first approach is to create a big dictionary that maps words to their stems.

31
 The advantage of this approach is that it works perfectly (insofar as the stem of a word can be
defined perfectly); the disadvantages are the space required by the dictionary and the investment
required to maintain the dictionary as new words appear.
 The second approach is to use a set of rules that extract stems from words.
 The advantages of this approach are that the code is typically small, and it can gracefully handle
new words; the disadvantage is that it occasionally makes mistakes.
 But, since stemming is imperfectly defined, anyway, occasional mistakes are tolerable, and the
rule-based approach is the one that is generally chosen.
Typical errors of stemming
 Write the types of stemming errors with their examples.
.
There are majorly three types of errors you would find in stemming:
 Overstemming
 Understemming
Over-stemming: Occurs when too much is removed.
The error of taking off too much
example: croûtons  croût
since croûtons is the plural of croûton
‘wander’ → ‘wand’
‘news’ → ‘new’;
‘universal’, ‘universe’, ‘universities’, and ‘university’ →’univers’.
Under-stemming: Occurs when two words are stemmed from the same root.
The error of taking off too small a suffix
For example,
croulons  croulon since croulons is a form of the verb crouler
‘knavish’ → ‘knavish’
‘data’ → ‘dat’
‘datum’ → ‘datu’
Miss-stemming
taking off what looks like an ending, but is really part of the stem
reply  rep

32
NOTE: Both data and datum have the same root yet they form two separate words.
Criteria for judging stemmers
Correctness
Overstemming: too much of a term is removed.
Can cause unrelated terms to be conflated  retrieval of non-relevant documents
Understemming: too little of a term is removed.
Prevent related terms from being conflated  relevant documents may not be retrieved
Thesauri
 What is Thesaurus? How do you differentiate Thesaurus from dictionary?
Mostly full-text searching cannot be accurate, since different authors may select different words to
represent the same concept
Problem: The same meaning can be expressed using different terms that are synonyms, homonyms,
and related terms
How can it be achieved such that for the same meaning the identical terms are used in the index
and the query?
Thesaurus: The vocabulary of a controlled indexing language, formally organized so that a priori
relationships between concepts (for example as "broader" and “related") are made explicit.
A thesaurus contains terms and relationships between terms
IR thesauri rely typically upon the use of symbols such as USE/UF (UF=used for), BT, and RT to
demonstrate inter-term relationships.
e.g., car = automobile, truck, bus, taxi, motor vehicle -color = colour, paint
Aim of Thesaurus

33
Thesaurus tries to control the use of the vocabulary by showing a set of related words to handle
synonyms and homonyms
The aim of thesaurus is therefore:
 to provide a standard vocabulary for indexing and searching
 Thesaurus rewrite to form equivalence classes, and we index such equivalences
 When the document contains automobile, index it under car as well (usually, also vice-versa)
 to assist users with locating terms for proper query formulation: When the query contains
automobile, look under car as well for expanding query
 to provide classified hierarchies that allow the broadening and narrowing of the current request
according to user needs
Thesaurus Construction
Example: thesaurus built to assist IR for searching cars and vehicles:
Term: Motor vehicles
UF: Automobiles Cars Trucks
BT: Vehicles
RT: Road Engineering Road Transport

34
Chapter Three: Indexing Structure

Indexing: Basic Concepts
The usual unit for indexing is the word
Index terms - are used to look up records in a file.
 What is Indexing?
Indexing is an arrangement of index terms to permit fast searching and reading memory spaces
 used to speed up access to desired information from document collection as per users query
 It enhances efficiency in terms of time for retrieval. Relevant documents are searched and
retrieved quick
Index file usually has index terms in a sorted order. Which list is easier to search?
Index files are much smaller than the original file. Remember Heaps Law: in 1 GB of text collection
the vocabulary might have a size of close to 5 MB. This size may be further reduced by text operation
Indexing: Basic Concepts
A simple alternative is to search the whole text sequentially (online search)
Another option is to build data structures over the text (called indices) to speed up the search
Major Steps in Index Construction
Source file:Collection of text document
A document is a collection of words/terms and other informational elements
Tokenize: identify words in a document, so that each document is represented by a list of keywords or
attributes
Index Terms Selection: apply text operations or preprocessing
Stop words removal: words with high frequency are non-content bearing and needs to be removed
from text collection
Stemming: reduce words with similar meaning into their stem/root word

35
Term weighting: Different index terms have varying relevance when used to describe document
contents. This effect is captured through the assignment of numerical weights to each index term of
a document. There are different index terms weighting methods: including TF, IDF, TF*IDF,…
Indexing structure: a set of index terms (vocabulary) are organized in
Index File to easily identify documents in which each term occurs in.
Basic Indexing Process
Index file Evaluation Metrics

Running time of the main operations
Access/search time
How much is the running time to find the required search key from the list?
Update time (Insertion time, Deletion time)
How much time it takes to update existing records in an attempt to:
o Add new terms,
o Delete existing unnecessary terms?
o Change the term-weight of an existing term
Does the index structure allow incremental update or re-indexing?
Space overhead
Computer storage space consumed for keeping the list.
Building Index file
An index file of a document is a file consisting of a list of index terms and a link to one or more
documents that has the index term.

36
An index file is list of search terms that are organized for associative look-up, i.e., to answer user’s
query:
 In which documents does a specific search term appear?
 Where, within each document does each term appear? (There may be several occurrences of a
term in a document.)
For organizing index file for a collection of documents, there are various options available:
 Decide what data structure and/or file structure to use.
 Is it sequential file, inverted file, suffix tree, etc. ?
Sequential File
 Sequential file is the most primitive file structures.
 It has no vocabulary (unique list of words) as well as linking pointers.
 The records are generally arranged serially, one after another, but in lexicographic order
on the value of some key field.
 A particular attribute is chosen as primary key whose value will determine the order of
the records.
 When the first key fails to discriminate among records, a second key is chosen to give an
order.
Example: Given a collection of documents, they are parsed to extract words and these are saved with
the Document ID.

37
Sequential File
To access records we have to search serially; —starting at the first record read and investigate all the
succeeding records until the required record is found or end of the file is reached.
—Its main advantages:

38
 easy to implement;
 Provides fast access to the next record using lexicographic order.
Can be searched quickly, using binary search, O(log n)
Question: “What is the Update Option?”
Is the index needs to be rebuilt or incremental update is supported?
Its disadvantages:
No weights attached to terms. Individual words are treated independently
Random access is slow: since similar terms are indexed individually, we need to find all terms that
match with the query
Inverted file
A word oriented indexing mechanism based on sorted list of keywords, with each keyword having
links to the documents containing it. Building and maintaining an inverted index is a relatively low
cost and low risk. On a text of n words an inverted index can be built in O(n) time. This list is inverted
from a list of terms in location order to a list of terms in alphabetical order.
Inverted file
 Write some notes on Term frequency, Document frequency, Collection frequency, vocabulary
files and posting files
Data to be held in the inverted file includes

The vocabulary (List of terms): is the set of all distinct words (index terms) in the text collection.
Having information about vocabulary (list of terms) speeds searching for relevant documents
• For each term: it contains information related to
i. Location: all the text locations/positions where the word occurs
ii. frequency of occurrence of terms in a document collection
TFij, number of occurrences of term tj in document di
DFj, number of documents containing tj

39
CF, total frequency of tj in the corpus n

mi, maximum frequency of a term in di
N, total number of documents in a collection
Inverted file
Having information about the location of each term within the document helps for:
 user interface design: highlight location of search term
 proximity based ranking: adjacency and near operators (in Boolean searching)
Eg: which one is more relevant for a query ‘artificial intelligence’
D1: the idea in artificial heart implantation is the intelligence in the field of science.
D2: the artificial process of intelligent system
D3: the field of artificial intelligence is multi disciplinary.
Having information about frequency is used for: calculating term weighting (like TF, TF*IDF,…)
optimizing query processing
•Is it possible to keep all these information during searching?

40
Postings File (Inverted List)

For each distinct term in the vocabulary, stores a list of pointers to the documents that contain that term.
Each element in an inverted list is called a posting, i.e., the occurrence of a term in a document. It is
stored as a separate inverted list for each column, i.e., a list corresponding to each term in the index
file. Each list consists of one or many individual postings
Advantage of dividing inverted file:
Keeping a pointer in the vocabulary to the list in the posting file allows: the vocabulary to be kept in
memory at search time even for large text collection, and Posting file to be kept on disk for accessing
the documents to be given to the user.
General structure of Inverted File
The following figure shows the general structure of inverted index file.

41
Sorting the Vocabulary

After all documents have been tokenized the inverted file is sorted by terms.

42

43

44
Searching on Inverted File

Since the whole index file is divided into two, searching can be done faster by loading vocabulary list
which takes less memory even for large document collection
 Using binary Search the searching takes logarithmic time
 The search is in the vocabulary lists
 Updating inverted file is very complex.
Suffix trees
Suffix tree takes text as one long string. No words.
It helps to handle:
 Complex queries
 Compacted trie structure
 String indexing.
 Exact set matching problem.
 longest common substring.
 Frequent substring
Problem: space
If the query does not need exact matching, then suffix tree would be the solution
What are suffix arrays and trees?
Text indexing data structures not word based allow search for patterns or computation of statistics
Important Properties
 Size
 Speed of exact matching
 Space required for construction
 Time required for construction

45
Suffix tree (Trie)

What is Suffix? A suffix is a substring that exists at the end of the given string.
Each position in the text is considered as a text suffix
If txt=t1t2...ti...tn is a string,thenTi=ti,ti+1...tn is the suffix of txt that starts at
position i,where 1≤ i ≤ n
—Example: txt = mississippi
T1 = mississippi;
T2 = ississippi;
T3 = ssissippi;
T4 = sissippi;
T5 = issippi;
T6 = ssippi;
T7 = sippi;
T8 = ippi;
T9 = ppi;
T10 = pi;
T11 = i;

46
Exercise: generate suffix of “technology”?

•A suffix Tree is an ordinary tree in which the input strings are all possible suffixes.
• –Principles: The idea behind suffix TRIE is to assign to each symbol in a text an index
corresponding to its position in the text. (i.e: First symbol has index 1, last symbol has index n
(#of symbols in text).
•To build the suffix TRIE we use these indices instead of the actual object.
•The structure has several advantages:
–It requires less storage space.
–We do not have to worry how the text is represented (binary, ASCII, etc).
–We do not have to store the same object twice (no duplicate).
Suffix Tree
A suffix tree is an extension of suffix trie that construct a Trie of all the proper suffixes of S
The suffix tree is created by compacting unary nodes of the suffix TRIE.

47
We store pointers rather than words in the leaves.

It is also possible to replace strings in every edge by a pair (a,b),where a & b are the beginning and end
index of the string.i.e.(3,7) for OGOL$, (1,2) for GO, (7,7) for $
Search in suffix tree

Searching for all instances of a substring S in a suffix tree is easy since any substring of S is the prefix
of some suffix.
Pseudo-code for searching in suffix tree:
 Start at root
 Go down the tree by taking each time the corresponding path
 If S correspond to a node then return all leaves in sub-tree
The places where S can be found are given by the pointers in all the leaves in the subtree rooted at x. If
S encountered a NIL pointer before reaching the end, then S is not in the tree
• Example: If S = "GO" we take the GO path and return:
GOOGOL$, GOL$.
If S = "OR" we take the O path and then we hit a NIL pointer so "OR" is not in the tree.

48
Chapter Four: IR models and Retrieval Evaluation

IR Models - Basic Concepts
Word evidence: Bag of words
IR systems usually adopt index terms to index and retrieve documents. Each document is represented
by a set of representative keywords or index terms (called Bag of Words)
An index term is a word useful for remembering the document main themes
Not all terms are equally useful for representing the document contents:
•less frequent terms allow identifying a narrower set of documents
But No ordering information is attached to the Bag of Words identified from the document collection.
One central problem regarding IR systems is the issue of predicting which documents are relevant and
which are not • such a decision is usually dependent on a ranking algorithm which attempts to
establish a simple ordering of the documents retrieved
Documents appearing at the top of this ordering are considered to be more likely to be relevant.
Thus ranking algorithms are at the core of IR systems
The IR models determine the predictions of what is relevant and what is not, based on the notion of
relevance implemented by the system.
After preprocessing, N distinct terms remain which are unique terms that form the VOCABULARY
Let
ki be an index term i & dj be a document j
K = (k1, k2, …, kN) is the set of all index terms
• Each term, i, in a document or query j, is given a real-valued weight, wij.
–wij is a weight associated with (ki,dj). If wij = 0, it indicates that term does not belong to document dj
• The weight wij quantifies the importance of the index term for describing the document contents
Vec(dj) = (w1j, w2j, …, wtj) is a term weighted vector associated with the document dj
Mapping Documents & Queries
Represent both documents and queries as N-dimensional vectors in a term-document matrix, which
shows occurrence e.g. an entry in the matrix corresponds to the “weight” of a term in the document

49
Weighting Terms in Vector Space

The importance of the index terms is represented by weights associated to them.
Problem: to show the importance of the index term for describing the document/query contents, what
weight we can assign? — Solution 1: Binary weights’=1 if present, 0 otherwise Similarity: number of
terms in common between the document and the query. Problem: Not all terms are equally
interesting— E.g. the vs. dog vs. cat. Solution: Replace binary weights with non-binary weights
dj=(w1,j,w2,j,...,wN,j);qk =(w1,k,w2,k,...,wN,k)
Types of IR Model
The following are three models that are classified for the Information model (IR) model:
Classical IR Models
 Write the three classical models of IR.
These are the simplest and easy-to-implement IR models. These are based on mathematical knowledge
that was easily recognized and understood as well.
Following are the examples of classical IR models:
 Boolean models,
 Vector models,
 Probabilistic models.
The Boolean Model

Boolean model is a simple model based on set theory
Boolean model imposes a binary criterion for deciding relevance

50
Terms are either present or absent. Thus, wij έ {0, 1} sim(q,dj) = 1, if document satisfies the boolean
query, 0 otherwise
The Boolean Model: Example

Generate the relevant documents retrieved by the Boolean model for the query:
q = k1 ˄ (K2 v ¬ k3)
The Boolean Model: Example

Given the following determine documents retrieved by the Boolean model based IR system
Index Terms: K1, …, K8.
Documents:
1. D1 = {K1, K2, K3, K4, K5}
2. D2 = {K1, K2, K3, K4}
3. D3 = {K2, K4, K6, K8}
4. D4 = {K1, K3, K5, K7}
5. D5 = {K4, K5, K6, K7, K8}
6. D6 = {K1, K2, K3, K4}
Query: K1˄ (K2 ˅ ¬K3)
Answer: {D1, D2, D4, D6} ˄ ({D1, D2, D3, D6} ˅{D3, D5}) = {D1, D2, D6}

51
The Boolean Model: Further Example

Given the following three documents, Construct Term– document matrix and find the relevant
documents retrieved by the Boolean model for given query
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
Query: “gold silver truck”
Table below shows document –term (ti) matrix
• Also find the relevant documents for the queries:

(a) “gold delivery”;
(b) ship gold;
(c) “silver truck”
Exercise: - Given the following three documents with the following contents:
D1 =“computer information retrieval”
D2 =“computer retrieval”
D3 =“information”
D4 =“computer information”
What are the relevant documents retrieved for the queries:
Q1 =“information ^ retrieval”
Q2 =“information ^ ¬computer”
D1 = “computer information retrieval”
D2 = “computer retrieval”
D3 = “information”
D4 = “computer information”
Computer: {D1, D2, D4} ¬ computer: {D3}

52
Information: {D1, D3, D4}

Retrieval: {D1, D2}
Q1 = (information Ù retrieval) = {D1, D3, D4} Ù {D1, D2} = {D1}
Q2 = (information Ù ¬computer) = {D1, D3, D4} Ù {D3} = {D3}
Exercise: What are the relevant documents retrieved for the query: ((Caesar OR milton) AND (swift
OR shakespeare))

53
Drawbacks of the Boolean Model
 Retrieval based on binary decision criteria with no notion of partial matching

 No ranking of the documents is provided (absence of a grading scale)
 Information need has to be translated into a Boolean expression which most users find awkward
 The Boolean queries formulated by the users are most often too simplistic
 As a consequence, the Boolean model frequently returns either too few or too many documents
in response to a user query
Advantages of the Boolean Model
Following are the advantages of the Boolean model:
 It is the simplest model based on sets.
 It is easy to understand and implement.
 It only retrieves exact matches.
 It gives the user, a sense of control over the system.
Disadvantages of the Boolean Model
Following are the disadvantages of the Boolean model:
 The model’s similarity function is Boolean. Hence, there would be no partial matches. This can
be annoying for the users.
 In this model, the Boolean operator usage has much more influence than a critical word.

54
 The query language is expressive, but it is complicated too.

 There is no ranking for retrieved documents by the model.
Vector-Space Model
This is the most commonly used strategy for measuring relevance of documents for a given query. This
is because,
 Use of binary weights is too limiting
 Non-binary weights provide consideration for partial matches
 The term weights are used to compute a degree of similarity between a query and each
document
 Ranked set of documents provides for better matching
The idea behind VSM is that the meaning of a document is conveyed by the words used in that
document and the weight it carries.
To find relevant documents for a given query:
First, Documents and queries are mapped into term vector space.
 Note that queries are considered as short document
 Short document mean with few words
Second, in the vector space, queries and documents are represented as weighted vectors
There are different weighting technique; the most widely used one is computing tf*idf for each term
Third, similarity measurement is used to rank documents by the closeness of their vectors to the query.
Documents are ranked by closeness to the query. Closeness is determined by a similarity score
calculation.
Term-document matrix.
A collection of n documents and query can be represented in the vector space model by a term-
document matrix.
An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term has
no significance in the document or it simply doesn’t exist in the document. Otherwise, wij> 0 whenever
ki έ dj

55
Computing weights
How to compute weight for term i in document j (wij ) and weight for term i in query q (wiq)?
A good weight must take into account two effects:
Quantification of intra-document contents (similarity)
tf factor, the term frequency within a document
Quantification of inter-documents separation (dissimilarity)
idf factor, the inverse document frequency across documents
As a result of which most IR systems are using tf*idf weighting technique: wij = tf(i,j) * idf(i)
Let: N be the total number of documents in the collection
ni be the number of documents which contain ki
freq(i,j) total existence frequency of ki within dj
Anormalized tf factor is given by f(i,j) = freq(i,j)/max(freq(j)) where the maximum is computed over
all terms which occur within the document dj
TF: Term Frequency, which measures how frequently a term occurs in a document. Since every
document is different in length, it is possible that a term would appear much more times in long
documents than shorter ones. Thus, the term frequency is often divided by the document length (aka.
the total number of terms in the document) as a way of normalization:
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the
document).
The idf factor is computed as idf(i) = log (N/ni) the log is used to make the values of tf and idf
comparable. It can also be interpreted as the amount of information associated with the term ki.
The best term-weighting schemes use tf*idf weights which are given by
wij = tf(i,j) * log(N/ni)

56
• For the query term weights, a suggestion is

wiq = (0.5 + [0.5 * freq(i,q) / max(freq(q)]) * log(N/ni)
The vector space model with tf*idf weights is a good ranking strategy with general collections
The vector space model is usually as good as the known ranking alternatives. It is also simple and fast
to compute. IDF: Inverse Document Frequency, which measures how important a term is. IDF measures
how rare a term is in collection. The IDF is a measure of the general importance of the term.
Invert the document frequency: - It diminishes the weight of terms that occur very frequently in the
collection and increases the weight of terms that occur rarely.
 Gives full weight to terms that occur in one document only.
 Gives lowest weight to terms that occur in all documents.
 Terms that appear in many different documents are less indicative of overall topic.
 IDF provides high values for rare words and low values for common words.
 IDF is an indication of a term’s discrimination power.
When does TF*IDF registers a high weight?
o when a term t occurs many times within a small number of documents
o Highest tf*idf for a term shows a term has a high term frequency (in the given document) and
a low document frequency (in the whole collection of documents);
o the weights hence tend to filter out common terms thus giving high discriminating power to
those documents
Lower TF*IDF is registered when the term occurs fewer times in a document, or occurs in many
documents
Lowest TF*IDF is registered when the term occurs in virtually all documents
While computing TF, all terms are considered equally important. However it is known that certain
terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need
to weigh down the frequent terms while scale up the rare ones, by computing the following:
IDF (t) = log (Total number of documents / Number of documents with term t in it).
Example: Computing weights
o A collection includes 10,000 documents
o The term A appears 20 times in a particular document
o The maximum appearance of any term in this document is 50
o The term A appears in 2,000 of the collection documents.

57
o Compute TF*IDF weight?

o f(i,j) = freq(i,j)/max(freq(l,j)) = 20/50 = 0.4
o idf(i) = log(N/ni) = log (10,000/2,000) = log(5) = 0.699
o wij = f(i,j) * log(N/ni) = 0.4 * 0.699 = 0.2796
o Assume collection contains 10,000 documents and statistical analysis shows that document
frequencies (DF) of three terms are: A (50), B (1300), C (250). And also term frequencies (TF)
of these terms are: A (3), B (2), C (1). Compute TF*IDF for each term?
o A: tf = 3/3=1.00; idf = log2(10000/50) = 7.644;
o tf*idf = 7.644
o B: tf = 2/3=0.67; idf = log2(10000/1300) = 2.943; tf*idf = 1.962
o C: tf = 1/3=0.33; idf = log2(10000/250) = 5.322; tf*idf = 1.774
Another example
Imagine the term t appears 20 times in a document that contains a total of 100 words. The Term
Frequency (TF) of t can be calculated as follow:
TF=20/100=0.2
Assume a collection of related documents containing 10,000 documents. If 100 documents out of
10,000 documents contain the term t, the Inverse Document Frequency (IDF) of t can be calculated as
follows
IDF=log (10000/100)=2
Using these two quantities, we can calculate the TF-IDF score of the term t for the document.
TF-IDF=0.2×2=0.4
Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency
(i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the
word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated
as log(10,000,000 / 1,000) = 4.
Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

58
Vector-Space Model: Example

Suppose we query for the query: Q: “gold silver truck”. The database collection consists of three
documents with the following documents.
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
Assume that all terms are used, including common terms, stop words, and also no terms are reduced to
root terms. Show retrieval results in ranked order?

59
Advantages:
 term-weighting improves quality of the answer set since it displays in ranked order
 partial matching allows retrieval of documents that approximate the query conditions
 cosine ranking formula sorts documents according to degree of similarity to the query
Disadvantages: assumes independence of index terms (??)
More Examples

60
Suppose the database collection consists of the following documents.

c1: Human machine interface for Lab ABC computer applications
c2: A survey of user opinion of computer system response time
c3: The EPS user interface management system
c4: System and human system engineering testing of EPS
c5: Relation of user-perceived response time to error measure
M1: The generation of random, binary, unordered trees
M2: The intersection graph of paths in trees
M3: Graph minors: Widths of trees and well-quasi-ordering
M4: Graph minors: A survey
Query:
Find documents relevant to "human computer interaction”
Why System Evaluation?
It provides the ability to measure the difference between IR systems
 How well our search does engines work?
 Is system A better than system B?
 Under what conditions?
Evaluation drives what to research or improve
Identify techniques that work and do not work
There are many retrieval models/ algorithms/ systems which one is the best?
What is the best component for: Similarity measures (dot-product, cosine,…)
Index term selection (stop-word removal, stemming…) Term weighting (TF, TF*IDF,…)
Types of Evaluation Strategies
System-centered evaluation
Given documents, queries, & relevance judgments (by experts)
Try several variations of the system
Measure which system returns the “best” matching list of documents
User-centered evaluation
Given several users, and at least two IR systems
Have each user try the same task on both systems
Measure which system works the “best” for users information need

61
How can we measure user’s satisfaction?

How do we know their impression towards the IR system?
Major Evaluation Criteria
 Write the major evaluation criteria for information retrieval system.
What are the main measures for evaluating an IR system’s performance?

Effectiveness
 How is a system capable of retrieving relevant documents from the collection?
 Is a system better than another one?
 User satisfaction: How “good” are the documents that are returned as a response to user query?
 “Relevance” of results to meet information need of users
Efficiency: time, space
 Speed in terms of retrieval time and indexing time
 Speed of query processing
 The space taken by corpus vs. index
 Is there a need for compression?
Difficulties in Evaluating IR System
 IR systems essentially facilitate communication between a user and document collections
 Relevance is a measure of the effectiveness of communication
 Effectiveness is related to the relevancy of retrieved items.
 Relevance: relates information need (query) and a document
 Relevancy is not typically binary (Yes-No) but Relative
 Even if relevancy is binary, it is a difficult judgment to make.
 Relevance is the degree of a correspondence existing between a document and a query as
determined by requester / information specialist/ external judge / other users
 Relevance judgments is made by
 The user who posed the retrieval problem
 An external judge or information specialists or system developer
 Is the relevance judgment made by users, information specialists or external person the same?
Why?
 Subjective: Depends upon a specific user’s judgment.

62
 Situational: Relates to user’s current needs.

 Cognitive: Depends on human perception and behavior.
 Dynamic: Changes over time.
Measuring Retrieval Effectiveness

Retrieval of documents may result in:
 False negative (false drop): some relevant documents may not be retrieved.(Type II)
 False positive: some irrelevant documents may be retrieved. (Type I)
For many applications a good index should not permit any false drops, but may permit a few false
positives.
 Write the relationship between Recall, Precision, F-Measure, and MAP.

63
Relevant performance metrics

64

65
Graphing Precision and Recall

o Plot each (recall, precision) point on a graph. Two ways of plotting:
 Cutoff vs. Recall/Precision graph
 Recall vs. Precision graph
o Recall and Precision are inversely related
o Recall is a non-decreasing function of the number of documents retrieved,
o Precision usually decreases (in a good system)
o The plot is usually for a single query.
o How we plot for two or more queries?
Precision/Recall tradeoff
 A system can increase recall by retrieving many documents (down to a low level of relevance
ranking),
 but many irrelevant documents would be fetched, reducing precision
 Can get high recall (but low precision) by retrieving all documents for all queries

66

67
Interpolation
It is a general form of precision/recall calculation
Precision change w.r.t. Recall (not a fixed point)
It is an empirical fact that on average as recall increases, precision decreases
Interpolate precision at 11 standard recall levels: rj Î{0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0},where j
= 0….10
The interpolated precision at the j-th standard recall level is the maximum known precision at any recall
level between the jth and (j + 1)th level:

68
Interpolating across queries

 For each query calculate precision at 11 standard recall levels
o Compute average precision at each standard recall level across all queries.
o Plot average precision/recall curves to evaluate overall system performance on a
document/query corpus.
 Average precision at seen relevant documents
 Typically average performance over a large set of queries.
 Favors systems which produce relevant documents high in rankings
Single-valued measures
 users may want a single value for each query to evaluate performance
 Such single valued measures include:
Average precision: it is calculated by averaging precision when recall increases.
Mean Average Precision (MAP): across queries
R-precision, precision with fixed number of document ranking

69

70
More Examples:
Mean Average Precision (MAP):
 Often we have a number of queries to evaluate for a given system. For each query, we can calculate
average precision, and if we take average of those averages for a given system, it gives us Mean
Average Precision (MAP), which is a very popular measure to compare two systems.
R-precision: It is defined as precision after R documents retrieved, where R is the total number of
relevant documents for a given query.
Average precision and R-precision are shown to be highly correlated. In the previous example, since
the number of relevant documents (R) is 5, R-precision for both the rankings is 0.4 (value of precision
after 5 documents retrieved).

71
 What is interpolation? How it is related with measures of accuracy in information retrieval?

 Identify the concepts of Noise, Silence, Miss and Fallout

72
Chapter Five
Query Languages and Query Operation
Keyword-Based Querying
Queries are combinations of words. —The document collection is searched for documents that contain
these words. — Word queries are intuitive, easy to express and provide fast ranking.
The concept of word must be defined. A word is a sequence of letters terminated by a separator
(period, comma, space, etc). Definition of letter and separator is flexible; e.g., hyphen could be defined
as a letter or as a separator. Usually, common words (such as “a”, “the”, “of”…) are ignored.
Single-word queries
 A query is a single word —
 Usually used for searching in document images —
 Simplest form of query.
 All documents that include this word are retrieved.
 Documents may be ranked by the frequency of this word in the document.
Phrase queries
 A query is a sequence of words treated as a single unit.
 Also called “literal string” or “exact phrase” query.
 Phrase is usually surrounded by quotation marks.
 All documents that include this phrase are retrieved.
 Usually, separators (commas, colons, etc.) and common words (e.g., “a”, “the”, “of”, “for”…)
in the phrase are ignored. —In effect, this query is for a set of words that must appear in
sequence.
 Allows users to specify a context and thus gain precision.
Example: “Information Processing for Document Retrieval”.
Multiple-word queries
 A query is a set of words (or phrases)
 Two options: A document is retrieved if it includes
 any of the query words, or
 each of the query words.
Documents are ranked by the number of query words they contain: A document containing n query
words is ranked higher than a document containing m < n query words.

73
Documents are ranked in decreasing order: —

 those containing all the query words are ranked at the top, only one query word at
 Frequency counts may be used to break ties among documents that contain the same query
words. Example: what is the result for the query “Red Flag”?
Proximity queries
 Restrict distance within a document between two search terms.
 Important for large documents in which the two search words may appear in different contexts.
 Proximity specifications limit the acceptable occurrences and hence increase the precision of the
search. (Inclusion of irrelevant doc affects only precision and not recall)
 General Format: Word1 within m units of Word2.
 Unit may be character, word, paragraph, etc.
Examples:
Information within 5 words of Retrieval:
Finds documents that discuss “Information Processing for Document
Retrieval”but not“Information processing and searching for Relevant Document Retrieval”.
Nuclear within 0 paragraphs of science:
Finds documents that discuss “Nuclear” and “science” in the same paragraph.
Boolean queries
Based on concepts from logic: AND, OR, NOT
It describes the information needed by relating multiple words with Boolean operators.
Operators: AND, OR, NOT
Semantics: For each query word w a corresponding set Dw is constructed that includes the documents
that contain w. The Boolean expression is then interpreted as an expression on the corresponding
document sets with corresponding set operators:
AND: Finds only documents containing all of the specified words or phrases.
OR: Finds documents containing at least one of the specified words or phrases.
NOT: Excludes documents containing the specified word or phrase. —
Precedence: Order of operations
— NOT, AND, OR: -use parentheses to override precedence
Process left-to-right among operators with the same precedence.

74
— Truth Table
Examples: Boolean queries
1. Computer OR server —Finds documents containing either computer, server or both
2. (computer OR server) NOT mainframe —
Select all documents that discuss computers or servers, do not select any documents that discuss
mainframes.
3. Computer NOT (server OR mainframe) —Select all documents that discuss computers, and do not
discuss either servers or mainframes.
4. Computer OR server NOT mainframe —Select all documents that discuss computers, or documents
that discuss servers but do not discuss mainframes.
Natural language
Using natural language for querying is very easy and attractive for the user.
Example: “Find all the documents that discuss campaign finance reforms, including documents that
discuss violations of campaign financing regulations. Do not include documents that discuss campaign
contributions by the gun and the tobacco industries”. “Documents that contain information on bank and
the bank is not related to rivers but financial institute”. Natural language queries are converted to a
formal language for processing against a set of documents. Such translation requires intelligence and is
still a challenge for IR systems.
Pseudo Natural Language processing: System scans the text and extracts recognized terms and
Boolean connectors. The grammaticality of the text is not important. Often used by search engines.
Problem: Recognizing the negation in the search statement (“Do not include...”). — Compromise:
Users enter natural language clauses connected with Boolean operators. In the above example:
“campaign finance reforms”or “violations of campaign financing regulations" and not “campaign
contributions by the gun and the tobacco industry’s”.
Query Operations: - Relevance Feedback & Query Expansion
Problems with Keywords
Keywords may not retrieve relevant documents that include synonymous terms.

75
◦ “restaurant” vs. “café”

◦ “PRC” vs. “China”
Keyword based search may retrieve irrelevant documents that include ambiguous terms.
◦ “bit” (unit of data Vs act of eating)
◦ “Apple” (company Vs fruit)
◦ “bat” (baseball Vs mammal)
◦ “Bank” (River Vs institute)
Techniques for Intelligent IR
 Take into account the meaning of the words used
 Take into account the order of words in the query
 Adapt to the user based on automatic or semi-automatic feedback
 Extend search with related terms —
 Perform automatic spell checking / diacritics restoration
 Take into account the authority of the source.
Query operations
 Users have no detailed knowledge of collection and retrieval environment
 difficult to formulate queries well designed for retrieval
 Need many formulations of queries for effective retrieval
 First formulation: often naïve (simple) attempt to retrieve relevant information
 Documents initially retrieved:
o Can be examined for relevance information by user judgment or automatically by the
system
o Improve query formulations for retrieving additional relevant documents
Query reformulation
Two basic techniques to revise query (reformulation) to account for feedback:
A. Query expansion: Expanding original query with new terms from relevant documents. This is
done by adding new terms to query from relevant documents
Query expansion (QE) is a process in Information Retrieval which consists of selecting and adding
terms to the user’s query with the goal of minimizing query-document mismatch and thereby
improving retrieval performance.

76
The goal of query expansion is to enrich the user’s query by finding additional search terms, either
automatically, or semi automatically that represent the user’s information need more accurately and
completely, thus avoiding, at least to an extent, the aforementioned problems, and increasing the
chances of matching the user’s query to the representations of relevant ideas in documents. Query
expansion techniques may be categorized by the following criteria:
 Source of query expansion terms;
 Techniques used for weighting query expansion terms;
 Role and involvement of the user in the query expansion process.
Query expansion can be performed automatically or interactively. In automatic query expansion
(AQE), the system selects and adds terms to the user’s query, whereas in interactive query expansion
(IQE), the system selects candidate terms for query expansion, shows them to the user, and asks the
user to select (or deselect) terms that they want to include into (or exclude from) the query.
There are three main sources of QE terms: (i) hand-built knowledge resources such as dictionaries,
thesauri, and ontologies; (ii) the documents used in the retrieval process; (iii) external text collections
and resources (e.g., the WWW, Wikipedia).
Query expansion is a technique that modifies the original query of a user to retrieve more relevant
documents from a large collection of information. Relevance feedback is a process that allows the user
to indicate which documents are relevant or not, and then uses this information to refine the query
expansion.
In relevance feedback, users give additional input (relevant/non-relevant) on documents, which is
used to reweight terms in the documents.
In query expansion, users give additional input (good/bad search term) on words or phrases.
Examples of query expansion with relevance feedback
There are many examples of query expansion with relevance feedback in different domains and
applications. For instance, in web search, Google uses implicit feedback to personalize and refine the
search results based on the user's history and preferences. In academic search, Scopus uses explicit
feedback to allow the user to select the relevant fields, keywords, and sources for query expansion. In
image search, Pinterest uses both explicit and implicit feedback to suggest related images and keywords
based on the user's pins and interests.
B. Term reweighting in expanded query: Modify term weights based on user relevance judgments.
 Increase weight of terms in relevant documents

77
 decrease weight of terms in irrelevant documents

Approaches for Relevance Feedback
1. Approaches based on Users relevance feedback
 Relevance feedback with user input
 Clustering hypothesis: known relevant documents contain terms which can be used to describe a
larger cluster of relevant documents
 Description of cluster built interactively with user assistance
2. Approaches based on pseudo relevance feedback
 Use relevance feedback methods without explicit user involvement.
 Obtain cluster description automatically
 Identify terms related to query terms
E.g. synonyms, stemming variations, term proximity (terms close to query terms in text)
Pseudo-Relevance Feedback (PRF) is a well-known method of query expansion for improving the performance
of information retrieval systems. All the terms of PRF documents are not important for expanding the user query.
Therefore selection of proper expansion term is very important for improving system performance. Individual
query expansion terms selection methods have been widely investigated for improving its performance. Every
individual expansion term selection method has its own weaknesses and strengths.
User Relevance Feedback
User Relevance Feedback is the most popular query reformulation strategy
Cycle:
1. User presented with list of retrieved documents
 After initial retrieval results are presented, allow the user to provide feedback on the relevance
of one or more of the retrieved documents.
2. User marks those which are relevant
 In practice: top 10-20 ranked documents are examined
3. Use this feedback information to reformulate the query.
4. Enhance importance of these terms in a new query
 Produce new results based on reformulated query.
5. Allows more interactive, multi-pass process.
Expected: New query moves towards relevant documents and away from non-relevant documents
User Relevance Feedback Architecture

78
Pseudo Relevance Feedback

 Just assume the top m retrieved documents are relevant, and use them to reformulate the query.
The value of m could be set to different values based on the IR system under investigation
 Allows for query expansion that includes terms that are correlated with the query terms.
Two strategies:
Local strategies: Approaches based on information derived from set of initially retrieved documents
(local set of documents)
Global strategies: Approaches based on global information derived from document collection (the
whole collection)
Pseudo Feedback Architecture
Types of Query Expansion

Local analysis
 Examine documents retrieved for query to determine query expansion
 Analysis of documents in result set
 No user assistance

79
 Synonymy association: terms that frequently co-occur inside local set of documents
 At query time, dynamically determine similar terms based on analysis of top-ranked retrieved
documents. Base correlation analysis on only the “local” set of retrieved documents for a
specific query.
 Avoids ambiguity by determining similar (correlated) terms only within relevant documents.
 “Apple computer” v “Apple computer Power book laptop”
Global analysis
 Expand query using information from whole set of documents in collection
 Thesaurus-based Controlled vocabulary, maintained by editors (e.g., medline) Manual thesaurus
 Approach to select terms for query expansion
 Determine term similarity through a pre-computed statistical correlation analysis of the
complete corpus.
 Compute association matrices which quantify term correlations in terms of how frequently they
co-occur.
 Expand queries with statistically most similar terms.
Problems with Global Analysis
Term ambiguity may introduce irrelevant statistically correlated terms.
“Apple computer”˄“Apple red fruit computer”
Since terms are highly correlated anyway, expansion may not retrieve many additional documents.
Global vs. Local Analysis
Global analysis requires intensive term correlation computation only once at system development
time. Local analysis requires intensive term correlation computation for every query at run time
(although number of terms and documents is less than in global analysis). But local analysis gives
better results.
Query Expansion Conclusions
Expansion of queries with related terms can improve performance, particularly recall. However, must
select similar terms very carefully to avoid problems, such as loss of precision.

80

IR Module For MIS Rift

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IR Module For MIS Rift

Uploaded by

Copyright:

Available Formats

Information Retrieval /2023

Rift Valley University

Department of Management Information System

Department of Management Information System

Chapter One: Introduction to ISR

Department of Management Information System

 Web pages, etc.

Department of Management Information System

 Provide immediate random access to the document collection.

 Emphasis of IR is on the retrieval of information, rather than on the retrieval of data

 Write in your own word the main objective of IR.

Department of Management Information System

Information Retrieval System

Department of Management Information System

Department of Management Information System

 Formalisms for representing knowledge and queries:

Department of Management Information System

 Learning for Information Extraction

 List some of the features of an Information Retrieval System.

Features of an information retrieval system

Department of Management Information System

 “I saw her duck”

(i) User Task and

The User Task

Department of Management Information System

 Documents in a collection are frequently represented by a set of index terms or keywords

Department of Management Information System

The black box is the information retrieval system.

Department of Management Information System

Typical IR System Architecture

Web Search System (e.g.: Google)

Department of Management Information System

Overview of the Retrieval process

 Write the process of retrieving information (document) from the web.

Department of Management Information System

 List and discuss the issues of an Information Retrieval.

Department of Management Information System

Focus in IR System Design

 Our focus during IR system design is:

The two subsystems of an IR system:

Department of Management Information System

Application areas within IR

 Cross language retrieval

Department of Management Information System

Chapter 2 Text Operations

Word distribution: Zipf's Law

Department of Management Information System

Different collections have different constants c.

Department of Management Information System

Methods that Build on Zipf's Law

Department of Management Information System

Vocabulary size: Heaps’ Law

Department of Management Information System

Department of Management Information System

Issues: recall and precision

 Are all words in a document important?

Department of Management Information System

Department of Management Information System

Goals of Text Operations

 Write the main goals of Text operations.

 Improve the quality of answer set (recall-precision figures)

Department of Management Information System

 Write the main issues of Tokenization.