Welcome to Scribd!

Algorithms For Information Retrieval: Index Construction

Uploaded by

0% found this document useful (0 votes)

10 views12 pages

Single pass in-memory indexing (SPIMI) is a more scalable algorithm for information retrieval index construction. SPIMI uses term instead of termIDs, writes each block's dictionary and posting list to disk as it is generated, and then starts a new dictionary for the next block. This eliminates the need for sorting tokens and results in a time complexity of O(T). Compression can further improve SPIMI efficiency by compressing terms and postings lists. However, SPIMI is still limited by the size of the indexing computer's storage and requires a static corpus.

Original Description:

Original Title

24_SPMI

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

0% found this document useful (0 votes)

10 views12 pages

Algorithms For Information Retrieval: Index Construction

Uploaded by

gpswain

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

Jump to Page

You are on page 1of 12

Search inside document

ALGORITHMS FOR INFORMATION

RETRIEVAL
Index Construction

Bhaskarjyoti Das
Department of Computer Science Engineering
ALGORITHMS FOR INFORMATION
RETRIEVAL

Single Pass In-memory Indexing

(SPIMI)

Bhaskarjyoti Das
Department of Computer Science Engineering
ALGORITHMS FOR INFORMATION RETRIEVAL
Opportunities for Improvement with BSBI Sort Based Algorithm

1. BSBI sort based indexing has excellent scaling properties

2. But …
• We work with a term id and get term-> term id from a
mapping. We always have to carry this dictionary in main
memory.
• Our assumption was: we can keep the large dictionary
in memory. But that may not be feasible !
• In BSBI, each block was sorted and kept before the merge
step. Can we avoid this expensive step ?
ALGORITHMS FOR INFORMATION RETRIEVAL
Single Pass In-memory Indexing (SPIMI)

▪Key idea 1: Generate separate dictionary for each

block i.e. no need to maintain term-term ID mapping
across blocks. Use term instead of termid in posting
list ?

▪Key idea 2: Don’t sort. Accumulate postings in

postings lists as they occur instead of first collecting
all termid-docid pairs and then sorting them.
ALGORITHMS FOR INFORMATION RETRIEVAL
Single Pass In-memory Indexing (SPIMI)

1. We are also creating a separate dictionary for each block

(no need to carry the dictionary in memory across blocks)
2. Since docids are sequentially accessed, the posting list
will be pre-sorted on docid ! We eliminate a “sort” step.
3. Compression technique can be used to store a compact
version of the index for a block (require less space on
disk)
Advantages:
1. We are able to directly track term-> docid through this
dynamic (growing) posting list . So it saves memory
2. Overall it is faster due to elimination of one sorting step
3. Compression allows Individual blocks to be larger and
overall algorithm efficiency higher ( eliminate more disk
seek time)
ALGORITHMS FOR INFORMATION RETRIEVAL
How Do these Ideas work ?

▪Till we run out of main memory, work with the current block
▪docid sequentially accessed
▪Let‘s say a document ( docid n) is getting parsed
▪If the term is new, add it to the dictionary for the current
block
▪else identify the record for the particular term that is
already existing
▪If initially allocated small posting list is full, double it
▪ Add the docid n to the posting list for this term
▪Sort on terms for lexicographic ordering
▪Write the index block (dictionary and posting list) into the disk
▪Start the next block with a new dictionary
ALGORITHMS FOR INFORMATION RETRIEVAL
SPIMI-INVERT
Parse document and generate token stream of (term-docid) pair
…….
Each call writes a Block to disk

Generate the next token pair (term, doc id)

ALGORITHMS FOR INFORMATION RETRIEVAL
SPIMI Summary

• single-pass in-memory indexing or SPIMI is more

scalable.
• SPIMI uses term instead of termIDs, writes each
block's dictionary and posting list to disk, and then
starts a new dictionary for the next block.
• Time complexity of SPIMI is O(T) since no sorting of
tokens involved and all operations are at best linear
to size of collections
• Compression can help make SPIMI even better
ALGORITHMS FOR INFORMATION RETRIEVAL
SPIMI Compression is an Added Feature !

▪Compression makes SPIMI even more efficient.

▪Compression of terms
▪Compression of postings
ALGORITHMS FOR INFORMATION RETRIEVAL
Same Limitations Like BSBI

• Collection should fit into hard disk of

the computer generating the index

• The corpus should be static

ALGORITHMS FOR INFORMATION RETRIEVAL
More Advanced Index Construction ?

▪Corpus need not be static. The corpus can

continuously change.

▪It may not be possible to fit the Corpus even in a

large hard disk of the computer processing the index.

▪For example, web corpus will never be static and

will be so large that it cannot be fitted in a hard
disk of a computer

▪This is where the Distributed and Dynamic Index

come in !!
THANK YOU

Bhaskarjyoti Das
Department of Computer Science Engineering

Optimizing Large Database Imports: Logical
Document11 pages
Optimizing Large Database Imports: Logical
Rafael Corro Molina
No ratings yet
UE20CS332 Unit2 Slides PDF
Document264 pages
UE20CS332 Unit2 Slides PDF
Gaurav Mahajan
No ratings yet
Lecture 4-Indexconstruction
Document45 pages
Lecture 4-Indexconstruction
Yash Gupta
No ratings yet
AI6122 Topic 3.1 - Index
Document40 pages
AI6122 Topic 3.1 - Index
Yujia Tian
No ratings yet
DBMS Unit - 3
Document29 pages
DBMS Unit - 3
jain jin
No ratings yet
DBMS Unit - 3
Document29 pages
DBMS Unit - 3
Nikhil Ranjan 211
No ratings yet
Assignment (DS)
Document8 pages
Assignment (DS)
Gaurav Gupta
No ratings yet
Overview - Explain - Measuring Performance - Disk Architectures - Indexes - Join Algorithms (CTD.)
Document69 pages
Overview - Explain - Measuring Performance - Disk Architectures - Indexes - Join Algorithms (CTD.)
Purushothama Reddy
No ratings yet
Algorithms For Information Retrieval: Index Construction
Document13 pages
Algorithms For Information Retrieval: Index Construction
gpswain
No ratings yet
Building A Scalable Time-Series Database Using Postgres: Mike Freedman
Document45 pages
Building A Scalable Time-Series Database Using Postgres: Mike Freedman
pokecho
No ratings yet
Os Unit 2
Document14 pages
Os Unit 2
AASTHA
No ratings yet
23 - BSBI - Part 2
Document13 pages
23 - BSBI - Part 2
gpswain
No ratings yet
Memory Management: Background Swapping Contiguous Allocation Paging Segmentation Segmentation With Paging
Document55 pages
Memory Management: Background Swapping Contiguous Allocation Paging Segmentation Segmentation With Paging
sumipriyaa
No ratings yet
Lecture (Chapter 9)
Document55 pages
Lecture (Chapter 9)
ekiholo
No ratings yet
DB2 Basics
Document33 pages
DB2 Basics
cksshimoga
No ratings yet
File Structure and Indexing
Document18 pages
File Structure and Indexing
Ankan Das
No ratings yet
Chapter 11. File Organisation and Indexes
Document56 pages
Chapter 11. File Organisation and Indexes
Pankaj Dadhich
No ratings yet
Unit-3 Os Notes
Document33 pages
Unit-3 Os Notes
linren2005
No ratings yet
Storing Data: Disks and Files: (R&G Chapter 9)
Document39 pages
Storing Data: Disks and Files: (R&G Chapter 9)
raw.junk
No ratings yet
Memory Organization
Document74 pages
Memory Organization
Sumi Chatterjee
No ratings yet
CS8492 /database Management Systems 2017 Regulations: Unit Iv Implementation Techniques
Document22 pages
CS8492 /database Management Systems 2017 Regulations: Unit Iv Implementation Techniques
Harshath M
No ratings yet
Basics of Operating Systems
Document211 pages
Basics of Operating Systems
Arpit
No ratings yet
02 Coding Conventions
Document30 pages
02 Coding Conventions
Mohsin Muzawar
No ratings yet
File Organization and Indexing
Document13 pages
File Organization and Indexing
Rashmi Das
No ratings yet
DBMS - Unit 4
Document52 pages
DBMS - Unit 4
logunaathan
No ratings yet
UNIT 4 Dbms
Document22 pages
UNIT 4 Dbms
041-IT KAVIYA
No ratings yet
CS 241 Section Week #9 (04/09/09)
Document85 pages
CS 241 Section Week #9 (04/09/09)
abel bahiru
No ratings yet
09 Indexconcurrency
Document3 pages
09 Indexconcurrency
sondos
No ratings yet
MS SQL Server Architecture: Febriliyan Samopa
Document27 pages
MS SQL Server Architecture: Febriliyan Samopa
Pramuditha Muhammad Ikhwan
No ratings yet
Part-A Q.1 Enlist A Few Objectives of The File Management System
Document7 pages
Part-A Q.1 Enlist A Few Objectives of The File Management System
Puneet Bansal
No ratings yet
Block Diagram of A DBMS: (R&G Chapter 9)
Document6 pages
Block Diagram of A DBMS: (R&G Chapter 9)
Akash Singh
No ratings yet
26 - Distributed Indexing Using MAP REDUCE
Document12 pages
26 - Distributed Indexing Using MAP REDUCE
gpswain
No ratings yet
Document Oriented Database
Document50 pages
Document Oriented Database
Karthik Amarnath
No ratings yet
Inls 623 - Database Systems Ii - File Structures, Indexing, and Hashing
Document41 pages
Inls 623 - Database Systems Ii - File Structures, Indexing, and Hashing
Akriti Agrawal
No ratings yet
Ch. 3 Lecture 1 - 3 PDF
Document83 pages
Ch. 3 Lecture 1 - 3 PDF
mikias
No ratings yet
DBMS QB 5 Ans
Document25 pages
DBMS QB 5 Ans
Ram Anand
No ratings yet
CS276 PA1 Report Rukmani Ravi Sundaram Tayyab Tariq 1.description of The Structure of The Program: Index - Py
Document2 pages
CS276 PA1 Report Rukmani Ravi Sundaram Tayyab Tariq 1.description of The Structure of The Program: Index - Py
Khozainuz Zuhri
No ratings yet
CIT 401 Lecture Note
Document46 pages
CIT 401 Lecture Note
Daniel Izevbuwa
No ratings yet
Review: (R&G Chapter 9) - Aren't Databases Great? - Relational Model - SQL
Document7 pages
Review: (R&G Chapter 9) - Aren't Databases Great? - Relational Model - SQL
khilanvekaria
No ratings yet
DS 7 5 Quest Ansrs Tuning
Document73 pages
DS 7 5 Quest Ansrs Tuning
George E. Coles
No ratings yet
C10 IR M2021 IndexConstruction SimpleandDistributed
Document42 pages
C10 IR M2021 IndexConstruction SimpleandDistributed
Naveen Kumar Kalagarla
No ratings yet
DBMS Unit V
Document17 pages
DBMS Unit V
Keshava Varma
No ratings yet
Memory Management
Document57 pages
Memory Management
1nshsankrit
No ratings yet
8 DataStorageIndexingStructures Updated
Document57 pages
8 DataStorageIndexingStructures Updated
Rosmarinus
No ratings yet
File and File Structure: Overview of Storage Device
Document29 pages
File and File Structure: Overview of Storage Device
Nabin Shrestha
No ratings yet
Course Code: CS 283 Course Title: Computer Architecture: Class Day: Friday Timing: 12:00 To 1:30
Document23 pages
Course Code: CS 283 Course Title: Computer Architecture: Class Day: Friday Timing: 12:00 To 1:30
Niaz Ahmed Khan
No ratings yet
Computer Architecture and OS
Document46 pages
Computer Architecture and OS
David Darbaidze
No ratings yet
Datastage Interview Questions
Document18 pages
Datastage Interview Questions
Ganesh Kumar
100% (1)
Weel 12 Memory
Document91 pages
Weel 12 Memory
JATIN gla
No ratings yet
Cs8493 Operating System 3
Document37 pages
Cs8493 Operating System 3
alexb072002
No ratings yet
Module 7 - Main Memory
Document41 pages
Module 7 - Main Memory
hexeko8100
No ratings yet
Lecture12 Disks and Files PartI 20feb 2018
Document41 pages
Lecture12 Disks and Files PartI 20feb 2018
Ramakrishna Yellapragada
No ratings yet
CSC 1002 Computer Architecture Third Assignment
Document4 pages
CSC 1002 Computer Architecture Third Assignment
Achyut Neupane
No ratings yet
01 - Hadoop - HDFS
Document49 pages
01 - Hadoop - HDFS
Veera V Ch
No ratings yet
Embedded Systems Unit 8 Notes
Document17 pages
Embedded Systems Unit 8 Notes
Pradeep Kumar Goud Nadikuda
No ratings yet
Execution
Document37 pages
Execution
Luyeye Pedro Lopes
No ratings yet
(17CS82) 8 Semester CSE: Big Data Analytics
Document169 pages
(17CS82) 8 Semester CSE: Big Data Analytics
Prakash G
No ratings yet
Database File Organisation Lecture
Document32 pages
Database File Organisation Lecture
mccreary.michael95
No ratings yet
CS1120 - Principles of Operating Systems: Lectures 10,11,12 And13 - Memory Management Mr. Rowan N. Elomina
Document83 pages
CS1120 - Principles of Operating Systems: Lectures 10,11,12 And13 - Memory Management Mr. Rowan N. Elomina
Melvin Leyva
No ratings yet
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
26 - Distributed Indexing Using MAP REDUCE
Document12 pages
26 - Distributed Indexing Using MAP REDUCE
gpswain
No ratings yet
Algorithms For Information Retrieval: Index Construction
Document13 pages
Algorithms For Information Retrieval: Index Construction
gpswain
No ratings yet
25 - Map Reduce Recap
Document12 pages
25 - Map Reduce Recap
gpswain
No ratings yet
23 - BSBI - Part 2
Document13 pages
23 - BSBI - Part 2
gpswain
No ratings yet
Algorithms For Information Retrieval: Index Construction
Document12 pages
Algorithms For Information Retrieval: Index Construction
gpswain
No ratings yet
Lecture Note 02 - CSE4310
Document18 pages
Lecture Note 02 - CSE4310
ken masters
No ratings yet
2.3 - Error Detection & Correction
Document19 pages
2.3 - Error Detection & Correction
bobjones
No ratings yet
K-Means Clustering
Document38 pages
K-Means Clustering
madhullika204
No ratings yet
Chapter 3 - Uniprocessor Scheduling
Document40 pages
Chapter 3 - Uniprocessor Scheduling
KarthikhaKV
No ratings yet
Signals and Systems DTFT
Document69 pages
Signals and Systems DTFT
A.Mohammad idhris
No ratings yet
Unit-6 Storage Strategies
Document43 pages
Unit-6 Storage Strategies
Shiv Patel
No ratings yet
Digital Video Processing, Second Edition, 2015: June 2015
Document9 pages
Digital Video Processing, Second Edition, 2015: June 2015
SAURABH DEGDAWALA
No ratings yet
Fixed Point Iteration: Roots of Equation
Document4 pages
Fixed Point Iteration: Roots of Equation
Mehedi Hasan Apu
No ratings yet
Sampling Rate Conversion
Document25 pages
Sampling Rate Conversion
Narendra Chaurasia
No ratings yet
BME 3112 Exp#04 FIR Filter Window Method
Document6 pages
BME 3112 Exp#04 FIR Filter Window Method
Muhammad Muinul Islam
No ratings yet
Discrete Approximation of Continuous Systems: CSE 421 Digital Control
Document15 pages
Discrete Approximation of Continuous Systems: CSE 421 Digital Control
Ahmed Younis
No ratings yet
Math 101 ST Periodical
Document2 pages
Math 101 ST Periodical
Felipe R. Illeses
50% (2)
1 s2.0 S0020025521002693 Main
Document16 pages
1 s2.0 S0020025521002693 Main
APRAJITA SRIVASTAVA
No ratings yet
What Is Computer Vision?
Document120 pages
What Is Computer Vision?
4227AKSHA KHAIRMODE
No ratings yet
Linear Programming: Artificial Variable Technique: Big - M Method
Document4 pages
Linear Programming: Artificial Variable Technique: Big - M Method
Fatima Al-Doski
No ratings yet
Machine Learning-Csen 3233-2023
Document4 pages
Machine Learning-Csen 3233-2023
cmmaity2017
No ratings yet
Unit 4 Attribute-Closure
Document6 pages
Unit 4 Attribute-Closure
Subhashree Dash
No ratings yet
8-19-21 (Submission)
Document52 pages
8-19-21 (Submission)
Hur Hasan
No ratings yet
Pre-Analysis: Example: Steady One-Dimensional Heat Conduction in A Bar
Document12 pages
Pre-Analysis: Example: Steady One-Dimensional Heat Conduction in A Bar
วรศิษฐ์ อ๋อง
No ratings yet
ML 2
Document3 pages
ML 2
bhavyaraikoti
No ratings yet
Week 1 Assignment
Document7 pages
Week 1 Assignment
Jaime Palizardo
No ratings yet
Image Transforms: DFT-Properties, Walsh, Hadamard, Discrete Cosine, Haar and Slant Transforms The Hotelling Transform
Document25 pages
Image Transforms: DFT-Properties, Walsh, Hadamard, Discrete Cosine, Haar and Slant Transforms The Hotelling Transform
ramadevi
No ratings yet
Ec8393 Fundamentals of Datastructures in C
Document12 pages
Ec8393 Fundamentals of Datastructures in C
Balaji GN
No ratings yet
Index Compression
Document38 pages
Index Compression
Utkarsh Vaish
100% (1)
Bit 3101a BSD 2101 Bac 2102 Bisf 2102 Bbit 106 Data Structures and Algorithms
Document2 pages
Bit 3101a BSD 2101 Bac 2102 Bisf 2102 Bbit 106 Data Structures and Algorithms
Quenter Njoroge
No ratings yet
Assignment On Transform Pairs
Document7 pages
Assignment On Transform Pairs
Manvendra Singh
No ratings yet
Scikit Learn
Document17 pages
Scikit Learn
RR
No ratings yet
AI Game Playing
Document46 pages
AI Game Playing
kaushi123
0% (1)
Over-Sampling Algorithm For Imbalanced Data Classification: XU Xiaolong, Chen Wen, and SUN Yanfei
Document10 pages
Over-Sampling Algorithm For Imbalanced Data Classification: XU Xiaolong, Chen Wen, and SUN Yanfei
vikasbhowate
No ratings yet
Limitations of GA
Document2 pages
Limitations of GA
Abhiram Don
No ratings yet