Indexing and Searching: Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 32

Indexing and Searching

Modern Information Retrieval


by R. Baeza-Yates and B. Ribeiro-Neto
Chapter 8

1
Outline
 Inverted Files
 Other Indices for Text
 Sequential Searching
 Pattern Matching
 Compression

2
Inverted Files
 And inverted file (or inverted index) is a word-
oriented mechanism for indexing a text collection
in order to speed up the searching task.
 Structure:vocabulary and occurrences
 Block addressing
 The text is divided in blocks, and the
occurrences point to the blocks
 Full inverted indices:exact occurrences

3
4
5
Inverted Files
 The search algorithm on an inverted index
 Vocabulary search

 Retrieval of occurrences

 Manipulation of occurrences

 Construction (split the index into two files)


 Posting file:the lists of occurrences are stored
contiguously
 The vocabulary is stored in lexicographical
order and points to its list.
6
7
Inverted Files
 For Large texts
 Partial index

 Merging two indices consists of merging


the sorted vocabularies.

8
9
Other Indices for Text
 Suffix Trees
 Suffix Arrays
 Signature Files

10
Suffix Trees and Suffix Arrays
 Each position in the text is considered as a
text suffix
 Index points are selected form the text,
which point to the beginning of the text
positions which will be retrievable

11
12
Suffix arrays
 The main drawbacks of Suffix Array are its
costly construction process.
 Allow binary searches done by comparing
the contents of each pointer.
 Supra-indices (for large suffix array)

13
14
15
Construction of Suffix Arrays for
Large Texts

16
Signature Files
 Word-oriented index structures base on hashing
 Maps words to bit masks of B bits
 Divides the text in blocks of b words each
 The mask is obtained by bitwise ORing the
signatures of all the words in the text block.
 Hash the query to a bit mask W
 If W & Bi = W, the text block may contain the
word

17
18
Sequential Searching
 Brute Force
 Knuth-Morris-Pratt
 Boyer-Moore Family
 Shift-Or
 Suffix Automaton
 Backward DAWG matching (BDM)

 BNDM

19
Knuth-Morris-Pratt

20
Boyer-Moore Family

21
Shift-Or

22
Suffix Automaton

23
24
Pattern Matching
 Searching allowing errors
 Dynamic Programming

 Automaton

 Regular Expressions and Extended patterns


 Pattern Matching Using Indices
 Inverted files

 Suffix Trees and Suffix Arrays

25
Dynamic Programming

26
Automaton

27
Regular Expressions

28
Pattern Matching Using Indices
 Inverted Files
 The types of queries such as suffix or
substring queries, searching allowing
errors and regular expressions, are solved
by a sequential search
 The restriction is to find approximate
matches or regular expressions that span
many word.

29
Pattern Matching Using Indices
 Suffix Trees
 Suffix trees are able to perform complex

searches
 Word, prefix, suffix, substring, and Range
queries
 Regular expressions

 Unrestricted approximate string matching

 Useful in specific areas

 Find the longest substring

 Find the most common substring of a fixed 30


size
Pattern Matching Using Indices
 Suffix Arrays
 Some patterns can be searched directly in
the suffix array without simulation the
suffix tree
 Word, prefix, suffix, subword search and
range search

31
Compression
 Compressed text--Huffman coding
 Taking words as symbols

 Use an alphabet of bytes instead of bits

 Compressed indices
 Inverted Files

 Suffix Trees and Suffix Arrays

 Signature Files

32

You might also like