Professional Documents
Culture Documents
Information Retrieval Notes
Information Retrieval Notes
RETRIEVAL
INFORMATION RETRIEVAL
Information retrieval is a software program
It used for accessing and retrieving the most appropriate information from document
repositories ( textual information)
It is based on a particular query given by the user with the help of context based
indexing or metadata
IR system produces the existence and location of documents which consist of the
required ( relevant) information, which satisfy user’s requirements
R(q, di): a ranking or similarity function which orders the documents with
respect to a query
CLASSICAL IR MODELS
• Boolean model
• Probabilistic model
THE BOOLEAN MODEL
Based on set theory and Boolean algebra
• Documents are sets of terms
• Queries are Boolean expressions on terms
Historically the most common model
• Library OPACs (Online Public Access Catalog)
• Dialog system
• Many web search engines, too
BOOLEAN MODEL
It is the simplest model
Based on concepts of Set theory and Boolean logic ( AND, OR, NOT)
Query Q = ti Λ tj here ti and tj are key terms
Documents with both ti and tj are retrieved.
Q = ti ν tj documents with either ti or tj retrieved.
Q = ~ ti Documents without ti are retrieved.
BOOLEAN MODEL CAN BE DEFINED AS
D − A set of words, i.e., the indexing terms present in a document. Here, each
term is either present (1) or absent (0).
Q − A Boolean expression, where terms are the index terms and operators are
logical products − AND, logical sum − OR and logical difference − NOT
Step 3:
Vectorization
Step 4:
Cosine Similarity
Where:
A⋅B represents the dot product of vectors A and B.
∥A∥ and ∥B∥ represent the Euclidean norms (magnitudes) of vectors A
and B, respectively.
STEPS - VSM
Vector Representation: We represent documents and queries as vectors using
techniques like TF-IDF. Each document in the corpus and the query are converted
into vectors in the same high-dimensional space.
Ranking: Documents with higher cosine similarity scores to the query are considered
more relevant and are ranked higher. Those with lower scores are ranked lower.
Step 1: Stemming and stop word removal
Step 2: intra cluster similarity
Step 3: inter cluster similarity
Step 4: cosine similarity between document vector and query vector