Welcome to Scribd!

Assignment 3 BIM IR

Uploaded by

0% found this document useful (0 votes)

9 views5 pages

This document describes an information retrieval assignment that involves creating an inverted index and binary term-document matrix from a collection of text documents. It discusses the libraries used, preprocessing steps like tokenization and stemming, representing a query, scoring and ranking documents, and retrieving the top results. The code flow details creating the index, handling queries, scoring documents based on the query and matrix, ranking results, and returning the top matches to the user interactively.

Original Description:

BIM Model

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

0% found this document useful (0 votes)

9 views5 pages

Assignment 3 BIM IR

Uploaded by

Pac SaQii

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

Jump to Page

You are on page 1of 5

Search inside document

Information Retrieval

Assignment 3

Session: 2020 – 2024

Submitted by:
Saqlain Nawaz 2020-CS-135

Supervised by:
Sir Khaldoon Syed Khurshid

Department of Computer Science

University of Engineering and Technology
Lahore Pakistan
Libraries Used:

1. os:
○ Purpose: Provides functions for interacting with the operating system,
particularly used for file operations and directory traversal.
2. nltk:
○ Purpose: The Natural Language Toolkit (NLTK) library is used for
natural language processing tasks such as tokenization, stemming,
and part-of-speech tagging.
3. nltk.corpus.stopwords:
○ Purpose: NLTK's stopwords corpus provides a list of common English
stopwords, which are words typically excluded from text analysis due to
their high frequency and low informativeness.
4. nltk.stem.PorterStemmer:
○ Purpose: The PorterStemmer class from NLTK implements the Porter
stemming algorithm, which reduces words to their root or base form,
standardizing words for analysis.

Code Flow:

Preprocessing and Creating the Inverted Index:

● The code begins by importing necessary libraries and initializing NLTK's

PorterStemmer and English stopwords.
● The create_index function is defined to create an inverted index and a
binary term-document matrix for a collection of text documents in a specified
directory.
● It iterates through each text file in the directory, reads the content, and
tokenizes it into sentences.
● For each sentence, it tokenizes it into words and tags their parts of speech
using NLTK's pos_tag.
● The code identifies words that are nouns (NN, NNS, NNP, NNPS) and not in
the list of English stopwords. These words are stemmed using the Porter
stemmer.
● Entries are added to the inverted index, where the stemmed word is the key,
and a list of filenames where the word appears is the value.
● The binary term-document matrix is also created, where each term is
associated with documents in which it appears with a binary weight of 1.
● UnicodeDecodeError exceptions are handled for files that cannot be decoded.

Representing a Query and Scoring Documents:

● The represent_query function tokenizes and stems a user's search query

and represents it as a query vector.
● The score_documents function calculates document scores based on the
query vector and the binary term-document matrix.
● Document scores are normalized by dividing them by the number of terms in
each document.

Ranking and Retrieving Documents:

● The rank_documents function sorts documents by their scores in

descending order.
● The retrieve_top_k_documents function retrieves the top-K documents
from the ranked list.

Main Execution:

● The script obtains the directory path of the code file and creates the inverted
index and binary term-document matrix using the create_index function.
● It enters a loop where the user can input search queries interactively.
● For each query, it represents the query, scores documents, ranks them,
retrieves the top 2 documents, and presents the results.
● The loop continues until the user enters "exit."

Block Diagram and Data Flow Diagram (DFD):

Project Report
Document17 pages
Project Report
aayush tanwar
0% (2)
Assignment 2 IR
Document6 pages
Assignment 2 IR
Pac SaQii
No ratings yet
Assignment 1 IR
Document4 pages
Assignment 1 IR
Pac SaQii
No ratings yet
Lab3 IR BIM
Document14 pages
Lab3 IR BIM
Pac SaQii
No ratings yet
Lab1 IR
Document14 pages
Lab1 IR
Pac SaQii
No ratings yet
Assignment 4
Document3 pages
Assignment 4
Pac SaQii
No ratings yet
Chapter 1: Boolean Retrieval
Document9 pages
Chapter 1: Boolean Retrieval
Amber Saxena
No ratings yet
Assignment 3 NonOverlap IR
Document3 pages
Assignment 3 NonOverlap IR
Pac SaQii
No ratings yet
Ram Chandra Padwal - Pratical Guide To NLTK For Data Science
Document37 pages
Ram Chandra Padwal - Pratical Guide To NLTK For Data Science
Zander Catta Preta
No ratings yet
Pt. L R Group of Institutions Faridabad Delhi NCR
Document21 pages
Pt. L R Group of Institutions Faridabad Delhi NCR
frodykaam
No ratings yet
Lab2 IR
Document16 pages
Lab2 IR
Pac SaQii
No ratings yet
NLP - Short Assignments
Document8 pages
NLP - Short Assignments
wemela1891
No ratings yet
V3I10201482
Document3 pages
V3I10201482
Aritra Dattagupta
No ratings yet
Industrial Visit Report
Document23 pages
Industrial Visit Report
yousuf
No ratings yet
Keystroke Logging in Second Language Writing: Fabio Pruneri
Document13 pages
Keystroke Logging in Second Language Writing: Fabio Pruneri
cafio
No ratings yet
Dsbdal A7
Document65 pages
Dsbdal A7
airprojectjnv2020
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
Document19 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
Nicholas Callahan
No ratings yet
NLTK: The Natural Language Toolkit: Steven Bird Edward Loper
Document4 pages
NLTK: The Natural Language Toolkit: Steven Bird Edward Loper
Yash Gautam
No ratings yet
Openedgar: Open Source Software For Sec Edgar Analysis: Mit Computational Law Report
Document18 pages
Openedgar: Open Source Software For Sec Edgar Analysis: Mit Computational Law Report
Gunda Venkata Sai
No ratings yet
A Summer Training Report On Python and It's Libraries Under The Guidance of
Document20 pages
A Summer Training Report On Python and It's Libraries Under The Guidance of
Newton Sathyavety
No ratings yet
Text Mining
Document31 pages
Text Mining
Anonymous sETEf2rtz
No ratings yet
Text Summarization Using The T5 Transformer Model
Document3 pages
Text Summarization Using The T5 Transformer Model
onkarrborude02
No ratings yet
COURSEWORK1 Details
Document3 pages
COURSEWORK1 Details
Bouzid Moulkaf
No ratings yet
PPS Unit-5
Document23 pages
PPS Unit-5
sayyamverma0027
No ratings yet
Dakshina Ranjan Kisku Associate Professor Department of Computer Science and Engineering National Institute of Technology Durgapur
Document16 pages
Dakshina Ranjan Kisku Associate Professor Department of Computer Science and Engineering National Institute of Technology Durgapur
Agrawal Darpan
No ratings yet
The Web As Corpus: The WWW As The Largest Existing Repository of Texts
Document12 pages
The Web As Corpus: The WWW As The Largest Existing Repository of Texts
granaina
No ratings yet
Assignment No. 1: Name: Omkar Joshi Roll No: BE20S05F004 Sub: SPCC
Document6 pages
Assignment No. 1: Name: Omkar Joshi Roll No: BE20S05F004 Sub: SPCC
TECH TALKS
No ratings yet
Text Mining: Open Source Tokenization Tools - An Analysis
Document11 pages
Text Mining: Open Source Tokenization Tools - An Analysis
acii journal
No ratings yet
Information Retrieval
Document62 pages
Information Retrieval
latigudata
No ratings yet
Assignment SCDlab
Document5 pages
Assignment SCDlab
Noor-Ul Ain
No ratings yet
INternship Report
Document22 pages
INternship Report
Kaushik Joshi
No ratings yet
NLP Assignment 2
Document12 pages
NLP Assignment 2
Radhe Shyam
No ratings yet
10 Marks System Software
Document8 pages
10 Marks System Software
Rohan Ray
No ratings yet
Report On Python
Document24 pages
Report On Python
Neha Gupta
No ratings yet
Dakshina Ranjan Kisku Associate Professor Department of Computer Science and Engineering National Institute of Technology Durgapur
Document31 pages
Dakshina Ranjan Kisku Associate Professor Department of Computer Science and Engineering National Institute of Technology Durgapur
Agrawal Darpan
No ratings yet
Chapter 3 IR
Document56 pages
Chapter 3 IR
Oumer Hussen
No ratings yet
Compiler Construction Past Paper 2022 Solution
Document12 pages
Compiler Construction Past Paper 2022 Solution
Zubair Jamil
No ratings yet
LP Vi Manual
Document77 pages
LP Vi Manual
Jahan Chaware
No ratings yet
Technical Seminar Presentation-2004: Presented by
Document20 pages
Technical Seminar Presentation-2004: Presented by
Swaraj Mohapatra
No ratings yet
Python Basic
Document6 pages
Python Basic
bizzpy n
No ratings yet
A Multilingual Database Management System For Ideographic Languages
Document10 pages
A Multilingual Database Management System For Ideographic Languages
Niranjan Nageswara
No ratings yet
Programming Language Design and Implementation-Pratt
Document15 pages
Programming Language Design and Implementation-Pratt
asdf_asdfasdfasdfasd
0% (2)
Assignment 3 Proximal Node IR
Document3 pages
Assignment 3 Proximal Node IR
Pac SaQii
No ratings yet
Lecture 1 2
Document21 pages
Lecture 1 2
Kamran Khan
No ratings yet
A Framework For Literate Programming
Document8 pages
A Framework For Literate Programming
Tio Penas
No ratings yet
Elastic
Document61 pages
Elastic
rim.moussa
No ratings yet
Comparision of Different Types of Parser and Parsing Techniques
Document4 pages
Comparision of Different Types of Parser and Parsing Techniques
erpublication
No ratings yet
CD Course File
Document114 pages
CD Course File
Ashutosh Jharkhade
No ratings yet
Converting Ontologies Into DSLs
Document8 pages
Converting Ontologies Into DSLs
Jonnathan Riquelmo
No ratings yet
Introduction To Natural Language Processing and NLTK
Document23 pages
Introduction To Natural Language Processing and NLTK
Nikhil Saini
No ratings yet
Natural Language Processing: Practical 1
Document64 pages
Natural Language Processing: Practical 1
hamza
No ratings yet
Thesis Proposal
Document4 pages
Thesis Proposal
beki4
No ratings yet
FAF 233 Nicolai Petcov 8
Document5 pages
FAF 233 Nicolai Petcov 8
petcovnicola
No ratings yet
Lexical Analyzer Synopsis Final
Document20 pages
Lexical Analyzer Synopsis Final
Sourabh Nigam
0% (1)
Final Compiler
Document14 pages
Final Compiler
usf94598
No ratings yet
Lesson 2: Everything Becomes Programmable
Document26 pages
Lesson 2: Everything Becomes Programmable
Ashley Villanueva
No ratings yet
MCA System Programming MC0073
Document15 pages
MCA System Programming MC0073
Heena Adhikari
0% (1)
Unit-Ii Notes
Document17 pages
Unit-Ii Notes
Sowmya Lakshmi
No ratings yet
Text Mining Techniques
Document7 pages
Text Mining Techniques
202101639
No ratings yet
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet
Assignment 3 NonOverlap IR
Document3 pages
Assignment 3 NonOverlap IR
Pac SaQii
No ratings yet
Assignment 1 IR
Document4 pages
Assignment 1 IR
Pac SaQii
No ratings yet
Assignment 4
Document3 pages
Assignment 4
Pac SaQii
No ratings yet
Lab 1 DFD
Document1 page
Lab 1 DFD
Pac SaQii
No ratings yet
DATA-PROCESSING WAEC Syllabus
Document7 pages
DATA-PROCESSING WAEC Syllabus
okugberedominic08
No ratings yet
Aranzo Cyber Security of Government Websites of The Philippines
Document8 pages
Aranzo Cyber Security of Government Websites of The Philippines
ELLA MAE ARANZO
No ratings yet
Chapter4 Pipelining END FA11
Document84 pages
Chapter4 Pipelining END FA11
K Sri
No ratings yet
String Practical FAQ
Document8 pages
String Practical FAQ
Prerna Gour
No ratings yet
Prelim Quiz 2
Document11 pages
Prelim Quiz 2
RianneNael
No ratings yet
KV (-B) Firmware V2.2.53 - 220713 Release Note - NO
Document24 pages
KV (-B) Firmware V2.2.53 - 220713 Release Note - NO
Ionut Lapuste
No ratings yet
ABX00080 Datasheet
Document34 pages
ABX00080 Datasheet
pre freeda
No ratings yet
Yozo Log
Document2 pages
Yozo Log
Hisbullah hidayat
No ratings yet
BeckmanConnect PC Networking v2 6-2021 EN
Document3 pages
BeckmanConnect PC Networking v2 6-2021 EN
sougat 21123
No ratings yet
POSIX Threads Programming
Document15 pages
POSIX Threads Programming
Maria Luisa Nuñez Mendoza
No ratings yet
7.6 Test Data and Trace Tables To Document Dry Runs of Algorithms
Document5 pages
7.6 Test Data and Trace Tables To Document Dry Runs of Algorithms
Anisha Bushra Akond
No ratings yet
Apache and PHP Install
Document5 pages
Apache and PHP Install
Wiseman Baraka Ordination Mgongolwa
No ratings yet
Client Server
Document127 pages
Client Server
Fahad Ahmad
100% (1)
Use Case Diagram
Document8 pages
Use Case Diagram
The Mind
No ratings yet
Ict1 - Unit 4 - Msexcel
Document4 pages
Ict1 - Unit 4 - Msexcel
Everlyn D. Buglosa
No ratings yet
Stefano Markidis, Erwin Laure - Solving Software Challenges For Exascale 2015
Document154 pages
Stefano Markidis, Erwin Laure - Solving Software Challenges For Exascale 2015
Nguyen Thanh Binh
No ratings yet
JAVA Min
Document312 pages
JAVA Min
HANISHA SAALIH
No ratings yet
Checkweigher RFQ Form
Document2 pages
Checkweigher RFQ Form
wasim Khokhar
No ratings yet
Chapter 2. Pair Programming
Document15 pages
Chapter 2. Pair Programming
rONALD
No ratings yet
Hardware Abstraction Layer (Hal) Module of Motorware: User Interface
Document42 pages
Hardware Abstraction Layer (Hal) Module of Motorware: User Interface
Anitha Yala
No ratings yet
ADL300اجيفران انفرتر
Document34 pages
ADL300اجيفران انفرتر
رجل من الزمن الجميل
No ratings yet
SCM Express Trouble Shooting Guide
Document140 pages
SCM Express Trouble Shooting Guide
bob bo
100% (1)
c019147 - Web GPI Readme
Document2 pages
c019147 - Web GPI Readme
ميلاد النعيري
No ratings yet
Proposal
Document16 pages
Proposal
Vikas Sharma
No ratings yet
3G Catalog 2016 20
Document119 pages
3G Catalog 2016 20
kriz anthony zuniega
No ratings yet
ICP-Programming Assignment-I PDF
Document10 pages
ICP-Programming Assignment-I PDF
Abhijit Aroop
No ratings yet
Fyp Dis2016 Projek Info Responses - Compress
Document50 pages
Fyp Dis2016 Projek Info Responses - Compress
aleksandar7
100% (1)
Srujana Short Resume
Document2 pages
Srujana Short Resume
Harshvardhini Munwar
No ratings yet
Event Log
Document15,637 pages
Event Log
Riya Singh
No ratings yet
Variables and Data Types
Document17 pages
Variables and Data Types
Sanjeet Kumar
No ratings yet