Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

INFORMATION

RETRIEVAL
INFORMATION RETRIEVAL
Information retrieval is a software program

It used for accessing and retrieving the most appropriate information from document
repositories ( textual information)

It is based on a particular query given by the user with the help of context based
indexing or metadata

IR system produces the existence and location of documents which consist of the
required ( relevant) information, which satisfy user’s requirements

Example of IR is Google Search to get required information


IR SYSTEM
Basic IR system
Indexing the collection of documents
Transforming the query in the same way
as the document content is represented
Comparing the description of each
document with that of the query
Listing the results in order of relevancy
THE FULL IR PROCESS
Text
Browser
/ UI
user interest Text

Text Processing and Modeling

logical view logical view


Query
user feedback Operations Indexing
Crawler
inverted index / Data
query
Access
Searching Index

retrieved docs Documents


(Web or
Ranking DB)
ranked docs
4
INFORMATION RETRIEVAL MODELS

A retrieval model consists of:

D: representation for documents

R: representation for queries

F: a modelling framework for D, Q, and the relationships among them

R(q, di): a ranking or similarity function which orders the documents with

respect to a query
CLASSICAL IR MODELS
• Boolean model

• Vector Space model

• Probabilistic model
THE BOOLEAN MODEL
Based on set theory and Boolean algebra
• Documents are sets of terms
• Queries are Boolean expressions on terms
Historically the most common model
• Library OPACs (Online Public Access Catalog)
• Dialog system
• Many web search engines, too
BOOLEAN MODEL
It is the simplest model
Based on concepts of Set theory and Boolean logic ( AND, OR, NOT)
Query Q = ti Λ tj here ti and tj are key terms
Documents with both ti and tj are retrieved.
Q = ti ν tj documents with either ti or tj retrieved.
Q = ~ ti Documents without ti are retrieved.
BOOLEAN MODEL CAN BE DEFINED AS

D − A set of words, i.e., the indexing terms present in a document. Here, each
term is either present (1) or absent (0).

Q − A Boolean expression, where terms are the index terms and operators are
logical products − AND, logical sum − OR and logical difference − NOT

F − Boolean algebra over sets of terms as well as over sets of documents


BOOLEAN RELEVANCE PREDICTION

R − A document is predicted as relevant to the query expression if and


only if it satisfies the query expression as −

((𝑡𝑒𝑥𝑡 ˅ 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛) ˄ 𝑟𝑒𝑟𝑖𝑒𝑣𝑎𝑙 ˄ ˜ 𝑡ℎ𝑒𝑜𝑟𝑦)

Each query term specifies a set of documents containing the term

AND (∧): the intersection of two sets

OR (∨): the union of two sets

NOT (¬): set inverse, or really set difference


Step1: stemming and stop word removal
Step 2 : Formation of Term x Document matrix
Step 3: Check user query is in DNF form. If not convert to DNF
Step 4: Find similarity between query and each document by applying similarity
functions.
VECTOR SPACE MODEL
mathematical framework used in information retrieval and natural language
processing (NLP) to represent and analyse textual data

represents documents and terms as vectors in a multi-dimensional space

Each dimension corresponds to a unique term in the entire corpus of documents

Each dimension corresponds to a unique


term, while the documents and queries
can be represented as a vector within
that space
HOW THE VSM WORKS?
STEP 1: Document-Term Matrix:

Document-Term Matrix (DTM) or Term-Document Matrix (TDM)

Rows in this matrix represent documents, and columns represent terms

Each cell contains a numerical value representing a term’s frequency or


importance within a document
STEP 2 :
Term Frequency-Inverse Document Frequency (TF-IDF)
 measure that reflects the importance of a term within a document relative to its
importance across all documents in the corpus

Step 3:
Vectorization
Step 4:
Cosine Similarity

To compare documents or perform text retrieval, use cosine similarity as a metric to


measure the similarity between two document vectors
COSINE SIMILARITY
Cosine similarity is a metric that measures the similarity between two
vectors in a multi-dimensional space, such as the vectors representing
documents in the VSM

Where:
A⋅B represents the dot product of vectors A and B.
∥A∥ and ∥B∥ represent the Euclidean norms (magnitudes) of vectors A
and B, respectively.
STEPS - VSM
Vector Representation: We represent documents and queries as vectors using
techniques like TF-IDF. Each document in the corpus and the query are converted
into vectors in the same high-dimensional space.

Cosine Similarity Calculation: To determine the relevance of a document to a query,


we calculate the cosine similarity between the query vector and the vectors
representing each document in the corpus.

Ranking: Documents with higher cosine similarity scores to the query are considered
more relevant and are ranked higher. Those with lower scores are ranked lower.
Step 1: Stemming and stop word removal
Step 2: intra cluster similarity
Step 3: inter cluster similarity
Step 4: cosine similarity between document vector and query vector

You might also like