Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Information Retrieval

Assignment 3

Session: 2020 – 2024

Submitted by:
Saqlain Nawaz 2020-CS-135

Supervised by:
Sir Khaldoon Syed Khurshid

Department of Computer Science


University of Engineering and Technology
Lahore Pakistan
Libraries Used:

1. os:
○ Purpose: Provides functions for interacting with the operating system,
particularly used for file operations and directory traversal.
2. nltk:
○ Purpose: The Natural Language Toolkit (NLTK) library is used for
natural language processing tasks such as tokenization, stemming,
and part-of-speech tagging.
3. nltk.corpus.stopwords:
○ Purpose: NLTK's stopwords corpus provides a list of common English
stopwords, which are words typically excluded from text analysis due to
their high frequency and low informativeness.
4. nltk.stem.PorterStemmer:
○ Purpose: The PorterStemmer class from NLTK implements the Porter
stemming algorithm, which reduces words to their root or base form,
standardizing words for analysis.

Code Flow:

Preprocessing and Creating the Inverted Index:

● The code begins by importing necessary libraries and initializing NLTK's


PorterStemmer and English stopwords.
● The create_index function is defined to create an inverted index and a
binary term-document matrix for a collection of text documents in a specified
directory.
● It iterates through each text file in the directory, reads the content, and
tokenizes it into sentences.
● For each sentence, it tokenizes it into words and tags their parts of speech
using NLTK's pos_tag.
● The code identifies words that are nouns (NN, NNS, NNP, NNPS) and not in
the list of English stopwords. These words are stemmed using the Porter
stemmer.
● Entries are added to the inverted index, where the stemmed word is the key,
and a list of filenames where the word appears is the value.
● The binary term-document matrix is also created, where each term is
associated with documents in which it appears with a binary weight of 1.
● UnicodeDecodeError exceptions are handled for files that cannot be decoded.

Representing a Query and Scoring Documents:

● The represent_query function tokenizes and stems a user's search query


and represents it as a query vector.
● The score_documents function calculates document scores based on the
query vector and the binary term-document matrix.
● Document scores are normalized by dividing them by the number of terms in
each document.

Ranking and Retrieving Documents:

● The rank_documents function sorts documents by their scores in


descending order.
● The retrieve_top_k_documents function retrieves the top-K documents
from the ranked list.

Main Execution:

● The script obtains the directory path of the code file and creates the inverted
index and binary term-document matrix using the create_index function.
● It enters a loop where the user can input search queries interactively.
● For each query, it represents the query, scores documents, ranks them,
retrieves the top 2 documents, and presents the results.
● The loop continues until the user enters "exit."

Block Diagram and Data Flow Diagram (DFD):

You might also like