Information Retrieval Notes

INFORMATION
RETRIEVAL
INFORMATION RETRIEVAL
Information retrieval is a software program
It used for accessing and retrieving the most appropriate information from document
repositories ( textual information)
It is based on a particular query given by the user with the help of context based
indexing or metadata
IR system produces the existence and location of documents which consist of the
required ( relevant) information, which satisfy user’s requirements
Example of IR is Google Search to get required information

IR SYSTEM
Basic IR system
Indexing the collection of documents
Transforming the query in the same way
as the document content is represented
Comparing the description of each
document with that of the query
Listing the results in order of relevancy
THE FULL IR PROCESS
Text
Browser
/ UI
user interest Text
Text Processing and Modeling
logical view logical view

Query
user feedback Operations Indexing
Crawler
inverted index / Data
query
Access
Searching Index
retrieved docs Documents

(Web or
Ranking DB)
ranked docs
4
INFORMATION RETRIEVAL MODELS
A retrieval model consists of:
D: representation for documents
R: representation for queries
F: a modelling framework for D, Q, and the relationships among them
R(q, di): a ranking or similarity function which orders the documents with
respect to a query
CLASSICAL IR MODELS
• Boolean model
• Vector Space model
• Probabilistic model
THE BOOLEAN MODEL
Based on set theory and Boolean algebra
• Documents are sets of terms
• Queries are Boolean expressions on terms
Historically the most common model
• Library OPACs (Online Public Access Catalog)
• Dialog system
• Many web search engines, too
BOOLEAN MODEL
It is the simplest model
Based on concepts of Set theory and Boolean logic ( AND, OR, NOT)
Query Q = ti Λ tj here ti and tj are key terms
Documents with both ti and tj are retrieved.
Q = ti ν tj documents with either ti or tj retrieved.
Q = ~ ti Documents without ti are retrieved.
BOOLEAN MODEL CAN BE DEFINED AS
D − A set of words, i.e., the indexing terms present in a document. Here, each
term is either present (1) or absent (0).
Q − A Boolean expression, where terms are the index terms and operators are
logical products − AND, logical sum − OR and logical difference − NOT
F − Boolean algebra over sets of terms as well as over sets of documents

BOOLEAN RELEVANCE PREDICTION
R − A document is predicted as relevant to the query expression if and

only if it satisfies the query expression as −
((𝑡𝑒𝑥𝑡 ˅ 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛) ˄ 𝑟𝑒𝑟𝑖𝑒𝑣𝑎𝑙 ˄ ˜ 𝑡ℎ𝑒𝑜𝑟𝑦)
Each query term specifies a set of documents containing the term
AND (∧): the intersection of two sets
OR (∨): the union of two sets
NOT (¬): set inverse, or really set difference

Step1: stemming and stop word removal
Step 2 : Formation of Term x Document matrix
Step 3: Check user query is in DNF form. If not convert to DNF
Step 4: Find similarity between query and each document by applying similarity
functions.
VECTOR SPACE MODEL
mathematical framework used in information retrieval and natural language
processing (NLP) to represent and analyse textual data
represents documents and terms as vectors in a multi-dimensional space
Each dimension corresponds to a unique term in the entire corpus of documents
Each dimension corresponds to a unique

term, while the documents and queries
can be represented as a vector within
that space
HOW THE VSM WORKS?
STEP 1: Document-Term Matrix:
Document-Term Matrix (DTM) or Term-Document Matrix (TDM)
Rows in this matrix represent documents, and columns represent terms
Each cell contains a numerical value representing a term’s frequency or

importance within a document
STEP 2 :
Term Frequency-Inverse Document Frequency (TF-IDF)
 measure that reflects the importance of a term within a document relative to its
importance across all documents in the corpus
Step 3:
Vectorization
Step 4:
Cosine Similarity
To compare documents or perform text retrieval, use cosine similarity as a metric to

measure the similarity between two document vectors
COSINE SIMILARITY
Cosine similarity is a metric that measures the similarity between two
vectors in a multi-dimensional space, such as the vectors representing
documents in the VSM
Where:
A⋅B represents the dot product of vectors A and B.
∥A∥ and ∥B∥ represent the Euclidean norms (magnitudes) of vectors A
and B, respectively.
STEPS - VSM
Vector Representation: We represent documents and queries as vectors using
techniques like TF-IDF. Each document in the corpus and the query are converted
into vectors in the same high-dimensional space.
Cosine Similarity Calculation: To determine the relevance of a document to a query,

we calculate the cosine similarity between the query vector and the vectors
representing each document in the corpus.
Ranking: Documents with higher cosine similarity scores to the query are considered
more relevant and are ranked higher. Those with lower scores are ranked lower.
Step 1: Stemming and stop word removal
Step 2: intra cluster similarity
Step 3: inter cluster similarity
Step 4: cosine similarity between document vector and query vector

Information Retrieval Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Retrieval Notes

Uploaded by

Copyright:

Available Formats

INFORMATION

Example of IR is Google Search to get required information

Text Processing and Modeling

logical view logical view

retrieved docs Documents

A retrieval model consists of:

D: representation for documents

R: representation for queries

F: a modelling framework for D, Q, and the relationships among them

• Vector Space model

F − Boolean algebra over sets of terms as well as over sets of documents

R − A document is predicted as relevant to the query expression if and

((𝑡𝑒𝑥𝑡 ˅ 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛) ˄ 𝑟𝑒𝑟𝑖𝑒𝑣𝑎𝑙 ˄ ˜ 𝑡ℎ𝑒𝑜𝑟𝑦)

Each query term specifies a set of documents containing the term

AND (∧): the intersection of two sets

OR (∨): the union of two sets

NOT (¬): set inverse, or really set difference

represents documents and terms as vectors in a multi-dimensional space

Each dimension corresponds to a unique term in the entire corpus of documents

Each dimension corresponds to a unique

Document-Term Matrix (DTM) or Term-Document Matrix (TDM)

Rows in this matrix represent documents, and columns represent terms

Each cell contains a numerical value representing a term’s frequency or

To compare documents or perform text retrieval, use cosine similarity as a metric to

Cosine Similarity Calculation: To determine the relevance of a document to a query,

You might also like