Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 65

Text Indexing, Storage & Compression

Information Retrieval (3170718)


Outline
❖ Text encoding: tokenization, stemming, stop words,
phrases, index optimization
❖ Index compression: lexicon compression and postings

lists compression
❖ Gap encoding, Gamma codes, Zipf's Law
❖ Index construction
2
Outline(Cont...)
❖ Postings size estimation
❖ Merge sort
❖ Dynamic indexing
❖ Positional indexes
❖ n-gram indexes
❖ Real-world issues
3
Text encoding: tokenization,
stemming, stop words,
phrases, index optimization
4
Tokenization
❖ Tokenization is a way of separating a piece of text into smaller
units called tokens.
❖ Tokens can be either words, characters, or subwords.
❖ Tokenization can be broadly classified into 3 types – word,
character, and subword (n-gram characters) tokenization.
❖ For example, consider the sentence: “Never give up”.
- The most common way of forming tokens is based on space.
5
Tokenization(Cont...)
- Assuming space as a delimiter, the tokenization of the sentence
results in 3 tokens – Never-give-up. As each token is a word, it
becomes an example of Word tokenization.
❖ Similarly, tokens can be either characters or subwords. For
example, let us consider “smarter”:
- Character tokens: s-m-a-r-t-e-r
- Subword tokens: smart-er
6
Word Tokenization
❖ Word Tokenization is the most commonly used tokenization
algorithm.
❖ It splits a piece of text into individual words based on a certain
delimiter.
❖ Depending upon delimiters, different word-level tokens are formed.
❖ Pretrained Word Embeddings such as Word2Vec and GloVe comes
under word tokenization.
7
Word Tokenization - Drawback
❖ One of the major issues with word tokens is dealing with Out Of
Vocabulary (OOV) words.
❖ OOV words refer to the new words which are encountered at
testing.
❖ These new words do not exist in the vocabulary. Hence, these
methods fail in handling OOV words.

8
Word Tokenization - Solution
❖ A small trick can rescue word tokenizers from OOV words.
❖ The trick is to form the vocabulary with the Top K Frequent Words
and replace the rare words in training data with unknown
tokens (UNK).
❖ This helps the model to learn the representation of OOV words in
terms of UNK tokens.

9
Stemming
❖ Stemming is a technique used to extract the base form of the words
by removing affixes from them.
❖ It is just like cutting down the branches of a tree to its stems.
❖ For example, the stem of the words eating, eats, eaten is eat.
❖ Search engines use stemming for indexing the words. That’s why
rather than storing all forms of a word, a search engine can store only
the stems.
❖ Helps reduce the size of the index and increases retrieval accuracy.
10
Stemming(Cont...)
❖ A stemming algorithm reduces the words “chocolates”, “chocolatey”,
“choco” to the root word, “chocolate” and “retrieval”, “retrieved”,
“retrieves” reduce to the stem “retrieve”.
❖ Stemming is an important part of the pipelining process in Natural
language processing.
❖ The input to the stemmer is tokenized words.
❖ Example: Stemming for root word "like" include: "likes", "liked",
"likely", "liking", etc. 11
Various Stemming Algorithms
❖ Porter stemming algorithm
❖ Lancaster stemming algorithm
❖ Regular Expression stemming algorithm
❖ Snowball stemming algorithm

Note: For more information, refer:


https://www.geeksforgeeks.org/introduction-to-stemming/

12
Stemming Issues
❖ Overstemming
- Over-stemming occurs when two words are stemmed from the same
root that are of different stems.
- Over-stemming can also be regarded as false-positives.
❖ Understemming
- Under-stemming occurs when two words are stemmed from the same
root that are not of different stems.
- Under-stemming can be interpreted as false-negatives.
13
Stop Words
❖ Stop words are a set of commonly used words in a language.
❖ Examples of stop words in English are “a”, “the”, “is”, “are” and etc.
❖ Stop words are commonly used in Text Mining and Natural Language
Processing (NLP) to eliminate words that are so commonly used that
they carry very little useful information.
❖ E.g., in the context of a search system, if your query is “what is a stop
word?”, you want the search system to focus on surfacing documents
that talk about stop word over documents that talk about what is a.
14
Stop Words(Cont...)
❖ This can be done by maintaining a list of stop words (which can be
manually or automatically curated) and preventing all words from
your stop word list from being analyzed.
❖ In this example, the words what is a could be eliminated, leaving only
the words: stop word.
❖ This ensures that documents that are topically relevant have a high
rank in your search results.
❖ List of English Stop Words: https://xpo6.com/list-of-english-stop-words/
15
Stop Words - Applications
❖ Supervised machine learning - removing stop words from the
feature space
❖ Clustering - removing stop words prior to generating clusters
❖ Information retrieval - preventing stop words from being indexed
❖ Text summarization - excluding stop words from contributing to
summarization scores & removing stop words when computing
ROUGE scores

16
Types of Stop Words
❖ Determiners – Determiners tend to mark nouns where a determiner
usually will be followed by a noun
examples: the, a, an, another, etc.
❖ Coordinating Conjunctions – Coordinating conjunctions connect
words, phrases, and clauses
examples: for, an, nor, but, or, yet, so, etc.
❖ Prepositions – Prepositions express temporal or spatial relations
examples: in, under, towards, before, etc. 17
Phrase Queries
❖ Many complex or technical concepts and many organization and
product names are multi word compounds or phrases.
❖ We would like to be able to pose a query such as Stanford University
by treating it as a phrase so that a sentence in a document like The
inventor Stanford Ovshinsky never went to university. is not a match.
❖ Most recent search engines support a double quotes syntax
(“stanford university”) for phrase queries, which has proven to be
very easily understood and successfully used by users. 18
Phrase Queries(Cont...)
❖ As many as 10% of web queries are phrase queries, and many more
are implicit phrase queries (such as person names), entered without use
of double quotes.
❖ To be able to support such queries, it is no longer sufficient for postings
lists to be simply lists of documents that contain individual terms.
❖ Two approaches to support phrase queries and their combination:
- Biword Indexes &
- Positional Indexes
19
Phrase Queries - Biword Index
❖ One approach to handling phrases is to consider every pair of
consecutive terms in a document as a phrase.
❖ For example, the text Friends, Romans, Countrymen would generate
the biwords:
- friends romans
- romans countrymen
❖ Here, we treat each of these biwords as a vocabulary term.
❖ Being able to process two-word phrase queries is immediate.
20
Phrase Queries - Biword Index(Cont...)
❖ Longer phrases can be processed by breaking them down.
❖ The query stanford university palo alto can be broken into the Boolean
query on biwords:
- “stanford university” AND “university palo” AND “palo alto”
❖ This query could be expected to work fairly well in practice, but there can
and will be occasional false positives.
❖ The concept of a biword index can be extended to longer sequences of
words, and if the index includes variable length word sequences, it is
generally referred to as a phrase index. 21
Phrase Queries - Positional Index
❖ For the obvious reasons, a biword index is not the standard solution.
❖ A positional index is most commonly employed for phrase queries.
❖ Here, for each term in the vocabulary, we store postings of the form
docID: <position1, position2, . . . >, where each position is a token
index in the document.
❖ Example on the next slide.

22
Phrase Queries - Positional Index Example
Example: to , 993427:
<1, 6:h 7, 18, 33, 72, 86, 231>;
<2, 5:h 1, 17, 74, 222, 255>;
<4, 5:h 8, 16, 190, 429, 433>;
<5, 2:h 363, 367>;
<7, 3:h 13, 23, 191> ; . . .>
Interpretation: The word to has a document frequency 993,477, and
occurs 6 times in document 1 at positions 7, 18, 33, etc. 23
Index compression: lexicon
compression and postings
lists compression
25
Why Index Compression?
❖ We need less disk space: Compression ratios of 1:4 are easy to achieve,
potentially cutting the cost of storing the index by 75%.
❖ Increased use of caching: Search systems use some parts of the
dictionary and the index much more than others.
❖ Faster transfer of data from disk to memory: Efficient decompression
algorithms run so fast on modern hardware that the total time of
transferring a compressed chunk of data from disk & then decompressing
it is less than transferring the same chunk of data in uncompressed form.
26
What is Inverted Index?
❖ In computer science, an inverted index (also referred to as a postings
file or inverted file) is a database index storing a mapping from
content, such as words or numbers, to its locations in a table, or in a
document or a set of documents (named in contrast to a
forward index, which maps from documents to content).
❖ The purpose of an inverted index is to allow fast full-text searches, at
a cost of increased processing when a document is added to the
database. 27
What is Inverted Index?(cont...)
❖ A forward index is a map from documents to terms (and positions).
These are used when you search within a document.
❖ An inverted index is a map from terms to documents (and
positions). These are used when you want to find a term in any
document.
❖ The inverted file may be the database file itself, rather than its index.
❖ It is the most popular data structure used in document retrieval
systems, used on a large scale, for example, in search engines. 28
Lexicon/Dictionary Compression
❖ The dictionary file, also known as lexicon contains one entry for
every token indexed from the collection.
❖ These can be k-grams, or single words, which might be preprocessed
using linguistic techniques.
❖ The lexicon is responsible for mapping every token into the position
of its corresponding posting on disk.

29
Lexicon/Dictionary Compression(Cont...)
❖ For effective ranked retrieval, every entry must, at least, store the
information about:
- The actual token, represented as a stream of characters (string)
- The number of documents the term appears in (term frequency)
- A pointer to the on-disk posting file, where the concordance
information for the term is stored.

30
Lexicon/Dictionary Compression(Cont...)
❖ When a query formed of a number of query terms arrives to the
system, the first step is to look-up the dictionary file to see if the
collection contains those terms.
❖ If so, then the dictionary pointers provide a link to the rest of the
data needed for the retrieval process. This look-up must be fast.
❖ Some options for this structure: hashing tables, search trees, and the
common dictionary-as-a-string, which is commonly used as the final
dictionary structure in IR systems. 31
Lexicon/Dictionary Compression(Cont...)
❖ Hashing tables are commonly employed during the indexing process,
search trees are useful for suffix search, and dictionary-as-a-string is
a common form of a sorted array.
❖ Hashing tables transform the character representation of the term
into an integer in a certain range.
❖ If the vocabulary set is known in advance, it is possible to design the
mapping function in such a way that every term is assigned uniquely
an integer. 32
Lexicon/Dictionary Compression(Cont...)
❖ If that is not the case, collisions may appear, and they have to be
resolved using external structures, usually hard to maintain.
❖ The look-up process using hashing tables is very fast: it involves
hashing the query term and performing an access to a vector in O(1)
time.
❖ Hashing only provides results for exact term matching.
❖ On the other hand, search trees allow for suffix search. They are a suitable
choice if the dictionary does not fit completely in main memory.
33
Lexicon/Dictionary Compression(Cont...)
❖ String search inside trees begins at the root and descends upon the tree
and performing a test at each branch to decide the path to follow.
❖ Search trees need to be balanced in order to achieve optimum efficiency.
❖ This means that the depth of the different sub-trees for a given node
must be the same or differ in 1.
❖ One of the most widely used search trees is the B-tree.
❖ This kind of search tree is convenient because it leverages the effort of
balancing.
34
Lexicon/Dictionary Compression(Cont...)
❖ The number of terms inside each node is determined by the disk block size.
❖ This is due to optimising the fetch operations of terms from disk if needed.

Fig. 1: Inverted file with 2-level dictionary-as-a-string


35
Lexicon/Dictionary Compression(Cont...)
❖ Dictionary as a string stores the whole term set as a single string.
❖ Lookups into the lexicon are realised through a simple binary search.
❖ In order to achieve better effectiveness, it is necessary for the terms to
be stored within the same amount of space.
❖ Fragmentation, due to terms not having all of them the same length, is
handled by a two-level scheme, where the first level is composed of
pointers to the beginning of each term, and the second level is
composed of the terms themselves(Figure 1, on the last slide).
36
Lexicon/Dictionary Compression(Cont...)
❖ Other forms of dictionary files exploit string compression techniques
in order to reduce their space requirements up to a point that they fit
completely in main memory.
❖ The also make the indexing process more efficient.
❖ These dictionary compression techniques are also useful in dynamic
environments, unlike hashing tables.

37
Postings Lists Compression
❖ The second data structure of an inverted file is the postings file or
concordance file which contains the information of every term
occurrence in a document.
❖ This information is not only restricted to the presence or absence of
terms.
❖ It can be enriched with other data such as frequency of terms in
documents or the exact position of a term inside a document.

38
Postings Lists Compression(Cont...)
❖ Positional information can also refer to structural parts of the
documents (e.g., fields in HTML text), or arbitrary subdivisions of the
documents, in sentences, paragraphs, or blocks of words of a certain
size.
❖ The format of the posting lists reflects the granularity of the inverted
file, addressing in which documents and positions the term appears.
❖ Its simplest form only records binary occurrences of terms in
documents:
39
- < t; df ; d , d , . . . , df >
Postings Lists Compression(Cont...)
❖ (w.r.t Last Slide) dft stands for the frequency of the term t (number of

documents in which t appears) and di is the document identifier.


❖ A document identifier is an integer bearing the internal
representation of a document in the system.
❖ As the notation implies, the document identifiers are ordered.
❖ This ordered disposition is not the only format option for postings
lists, but it is useful for providing high compression ratios.
40
Gap encoding, Gamma
codes, Zipf's Law

41
Zipf’s Law - Introduction
❖ Zipf’s Law models the distribution of terms in a corpus:
- How is the frequency of different words distributed?
- How many times does the kth most frequent word appears in a
corpus of size N words?
- Important for determining index terms and properties of
compression algorithms.

42
Zipf’s Law - Word Distribution
❖ A few words are very common.
- 2 most frequent words (e.g. “the” , “of”) can account for about 10% of
word occurrences.
❖ Most words are very rare.
- Half the words in a corpus appear only once, called hapax legomena
(Greek for “read only once”)
❖ Called a “heavy tailed” distribution, since most of the probability
mass is in the “tail”. 43
Sample Word Frequency Data
Frequent Word Number of Occurrences Percentage of Total

the 7,398,934 5.9

of 3,893790 3.1

to 3,364,653 2.7

and 3,320,687 2.6

in 2,311,785 1.8

is 1,559,147 1.2

for 1,313,561 1.0

The 1,144,860 0.9

44
Zipf’s Law
❖ Rank(r): The numerical position of a word in a list sorted by
decreasing frequency (f).
❖ Zipf (1949) “discovered” that:
- f ⋅r = k (for constant k)
❖ If probability of word of rank r is pr and N is the total number of word

occurrences:
- pr = f /N
46
Predicting Occurrence Frequencies
❖ By Zipf, a word appearing f times has rank rf =AN/f

❖ Several words may occur f times, assume rank rf applies to the last of

these.
❖ Therefore, rf words occur f or more times and rf+1 words occur f+1 or

more times.
❖ So, the number of words appearing exactly f times is:
- If = rf − rf +1 = AN/f − AN/(f +1) = AN/f( f +1)
47
Predicting Word Frequencies
❖ Assume highest ranking term occurs once and therefore has rank D =
AN/1
❖ Fraction of words with frequency f is:
- If/D = 1/f( f +1)
❖ Fraction of words appearing only once is therefore ½.

48
Zipf’s Law - Explanation
❖ Zipf’s explanation was his “principle of least effort.” Balance between
speaker’s desire for a small vocabulary and hearer’s desire for a large
one.
❖ Herbert Simon’s explanation is “rich get richer.”
❖ Li (1992) shows that just random typing of letters including a space
will generate “words” with a Zipfian distribution.

49
Zipf’s Law Impact on IR
❖ Good News: Stopwords will account for a large fraction of text so
eliminating them greatly reduces inverted index storage costs.
❖ Bad News: For most words, gathering sufficient data for meaningful
statistical analysis (e.g. for correlation analysis for query expansion)
is difficult since they are extremely rare.

50
Index Construction

51
Index Construction
❖ Hardware basics
❖ Block based sorting index method
❖ Single-pass in-memory indexing
❖ Distributed indexing
❖ Dynamic indexing

52
Hardware Basics
When building an information retrieval system, many decisions are
related to the hardware environment of the system.
❖ Accessing memory data is much faster than accessing hard disk data,
so we have to Put data in memory as much as possible, Especially for
frequently accessed data.
❖ This technique of putting frequently accessed disk data into memory
becomes a cache.

53
Hardware Basics(Cont...)
❖ When reading and writing a disk, the seek time (that is, the time to
move the head to the track where the data is located) is very time-
consuming. No data transmission is performed during the seek.
❖ To maximize the data transfer rate, Data blocks read continuously
should be stored continuously on the disk.
❖ Operating systems often start with data block Read and write as a
unit. Therefore, reading a byte from the disk may take as much time
as reading a data block. 54
Hardware Basics(Cont...)
❖ The data block size is usually 8KB, 16KB, 32KB or 64KB.
❖ We call the area in the memory that holds the read and write blocks
Buffer(buffer).
❖ The transfer of data from the disk to the memory is implemented by
the system bus instead of the processor, which means that the
processor can still process the data during disk I/O.
❖ We can use this to speed up the data transmission process, such as
compressing the data and storing it on disk. 55
Hardware Basics(Cont...)
❖ Assuming an efficient decompression algorithm is used, then Read
the disk compressed data and then decompress.
❖ It often takes less time than reading uncompressed data directly.

56
Block based sorting index(BSBI) method
The basic steps to build an index that does not contain location
information are:
1. Scan the document collection to get all the term-document ID pairs.
2. Use terms as the primary key and document ID as the secondary
keySort.
3. Organize the document ID of each term into an inverted record table.

57
Block based sorting index method(Cont...)
❖ For small-scale document collections(Last Slide), the above process
can be completed in memory and for large-scale document
collections, the above methods are powerless.
❖ Now replace the term with its ID, and the ID of each term is unique.
❖ For a large-scale document set, it is very difficult to sort all term ID-
document IDs in memory.
❖ For many large corpora, even the compressed inverted record table
cannot be loaded into memory. 58
Block based sorting index method(Cont...)
❖ Due to insufficient memory, we must use an external disk-based
sorting algorithm.
❖ The core requirement of this algorithm is to minimize the number of
random disk seeks when sorting.
❖ BSBI(blocked sort-based indexing algorithm, Block-based sorting
index algorithm)Is a solution:
1. Divide the document set into several equal parts.
2. Sort the term ID-document ID pair of each part.
59
Block based sorting index method(Cont...)
❖ In this algorithm, we choose Suitable block size, Parse the document
into term ID-document ID pairs and load them into the memory, and
sort them quickly in the memory.
❖ Convert the sorted results into inverted index format and write them
to disk.
❖ Then merge each block index into one index file at the same time.

60
Block based sorting index method(Cont...)
❖ The block-based index sorting algorithm is as follows, the algorithm
saves the inverted index of each block into the file f1, f2, ... fn, and
finally merges.
--> BSBIndexConstruction()
n<-0
while (all documents have not been processed)
do n <-- n+1
block <-- ParseNextBlock()
BSBI-Invert(block)
WriteBlockToDisk(block,fn)
MergeBlock(f1,f2,...,fn)
61
Single-pass in-memory indexing(SPIMI)
❖ The block-based sort index algorithm has good scalability, but the
disadvantages are:
❖ needs to map the term to its ID, so the mapping relationship between
the term and its ID is stored in the memory.
❖ For large-scale data sets, the memory may not be able to store it.
❖ SPIMI (single-pass in memory indexing, memory-based single-pass
scanning indexing algorithm) is more scalable.
❖ It uses terms instead of IDs.
62
Single-pass in-memory indexing(Cont...)
❖ It combines each block The dictionary is written to disk, and the new
dictionary is re-used for the next block.
1. The algorithm processes each term-document ID one by one. If the
term appears for the first time, add it to the dictionary (preferably
through a hash table), and create a new inverted record table; if the
term is not The first time it appears, it will directly return to its
inverted record table.
Note: The inverted record tables here are all in memory. 63
Single-pass in-memory indexing(Cont...)
2. Add a new document ID to the inverted record table obtained above.
Note: Unlike BSBI, there is no ordering of term ID-document ID here.
3. When the memory is exhausted, the lexical items are sorted, and the
block index containing the dictionary and the inverted record table is
written to the disk. Here, the purpose of sorting is to facilitate the
merging of blocks in the future.
4. Use the new dictionary again and repeat the above process.

64
Single-pass in-memory indexing(Cont...)
❖ Personally, there is not much difference between SPIMI and BSBI.
❖ They all do index construction based on blocks, and then merge the
blocks to get the overall inverted index table.
❖ The difference is that BSBI needs to maintain the mapping
relationship between terms and their IDs in memory.
❖ In addition, BSBI's inverted record table is sorted, while SPIMI is not
sorted.

65
Distributed Indexing
❖ In practice, the document set is usually very large.
❖ Especially for Web search engines, Web search engines usually use
distributed index construction algorithms to build indexes, which are
often divided into terms or documents and distributed on multiple
computers.
❖ Most search engines prefer to use index based on document
segmentation.
❖ https://www.programmersought.com/article/14875527520/
66
End of the Topic

67

You might also like