Professional Documents
Culture Documents
Module 2-Data Structures and Algorithms For Retrieval-Cat1
Module 2-Data Structures and Algorithms For Retrieval-Cat1
Dr.D.SARASWATHI 1
Dr.D.SARASWATHI 2
DATA STRUCTURES
• Introduction to Data Structures
• Stemming Algorithms
• Inverted File Structure
• N-Gram Data Structure
• PAT Data Structure
• Signature File Structure
• Hypertext and XML Data Structures
Dr.D.SARASWATHI 3
CATALOGING AND INDEXING
INDEXING
• The transformation from received item to searchable data
structure is called indexing.
• Process can be manual or automatic.
• Creating a direct search in document data base or indirect
search through index files.
• Concept based representation: instead of transforming the
input into a searchable format some systems transform the
input into different representation that is concept based .
• Search ? Search and return item as per the incoming items
Dr.D.SARASWATHI 4
• History of indexing: shows the dependency of
information processing capabilities on manual and
then automatic processing systems .
• Indexing originally called cataloguing : oldest
technique to identity the contents of items to assist in
retrieval.
• Items overlap between full item indexing , public and
private indexing of files
Dr.D.SARASWATHI 5
Indexing process
Dr.D.SARASWATHI 6
Introduction
Dr.D.SARASWATHI 7
Motivation and Definition
Dr.D.SARASWATHI 8
Inverted Files
Dr.D.SARASWATHI 9
Dr.D.SARASWATHI 10
Inverted Search Example
Dr.D.SARASWATHI 11
Dr.D.SARASWATHI 12
Structures used in Inverted Files
Dr.D.SARASWATHI 13
Sorted Arrays
Dr.D.SARASWATHI 14
Tries
Dr.D.SARASWATHI 15
How to build an Inverted index
Dr.D.SARASWATHI 16
Dr.D.SARASWATHI 17
Dr.D.SARASWATHI 18
Dr.D.SARASWATHI 19
Algorithm
• Fetching the Document
Dr.D.SARASWATHI 20
• Removing the Stop Words
Dr.D.SARASWATHI 21
• Stem to the Root Word
Dr.D.SARASWATHI 22
• Record Document Id’s
Dr.D.SARASWATHI 23
Merge and Sort the items
Dr.D.SARASWATHI 24
Summary
Dr.D.SARASWATHI 25
N-GRAM DATA STRUCTURE
• N-Grams can be viewed as a special technique for conflation
(stemming) and as a unique data structure in information
systems.
Dr.D.SARASWATHI 27
Examples of bigrams, trigrams and pentagrams
“sea colony.”
• Bigrams (no interword symbols)
• se ea co ol lo on ny
Dr.D.SARASWATHI 30
• Zamora showed trigram analysis provided a viable data
structure for identifying misspellings and transposed
characters.
Dr.D.SARASWATHI 31
• Trigrams have been used for text compression
and to manipulate the length of index terms.
Dr.D.SARASWATHI 32
Advantage of n-grams
• They place a finite limit on the number of searchable tokens.
Dr.D.SARASWATHI 34
PAT data structure
(Practical algorithm to retrieve information coded in
alphanumeric )
• PAT structure or PAT tree or PAT array : continuous text input
data structures(string like N- Gram data structure).
Dr.D.SARASWATHI 36
• Binary tree, most common class for prefix search,But Pat
trees are sorted logically which facilitate range search, and
more accurate then inversion file .
Dr.D.SARASWATHI 37
• A substring can start at any point in the text and can be
uniquely indexed by its starting location and length.
• If all strings are to the end of the input, only the starting
location is needed since the length is the difference from the
location and the total length of the item.
• It is possible to have a substring go beyond the length of the
input stream by adding additional null characters.
• These substrings are called sistring (semi-infinite string).
Dr.D.SARASWATHI 38
• A PAT tree is an unbalanced, binary digital tree defined by
the sistrings.
• The individual bits of the sistrings decide the branching
patterns with zeros branching left and ones branching right.
• PAT trees also allow each node in the tree to specify which
bit is used to determine the branching via bit position or the
number of bits to skip from the parent node.
• This is useful in skipping over levels that do not require
branching.
Dr.D.SARASWATHI 39
Examples of sistrings
Dr.D.SARASWATHI 40
• The key values are stored at the leaf nodes (bottom
nodes) in the PAT Tree.
• For a text input of size “n” there are “n” leaf nodes
and “n-1” at most higher level nodes.
• It is possible to place additional constraints on
sistrings for the leaf nodes.
• We may be interested in limiting our searches to word
boundaries.
Dr.D.SARASWATHI 41
• Thus we could limit our sistrings to those that are immediately after
an interword symbol.
• Example of the sistrings used in generating a PAT tree.
Dr.D.SARASWATHI 42
• If the binary representations of “h” is (100), “o” is (110), “m”
is (001) and “e” is (101) then the word “home” produces the
input 100110001101.
Dr.D.SARASWATHI 43
Dr.D.SARASWATHI 44
Dr.D.SARASWATHI 45
Dr.D.SARASWATHI 46
Dr.D.SARASWATHI 47
Dr.D.SARASWATHI 48
Dr.D.SARASWATHI 49
Dr.D.SARASWATHI 50
Algorithms
Dr.D.SARASWATHI 51
PAT Binary Tree for input
“100110001101”
Dr.D.SARASWATHI 52
• The value in the intermediate nodes (indicated by
rectangles) is the number of bits to skip until the next
bit to compare that causes differences between
similar terms.
• This final version saves space, but requires comparing
a search value to the leaf node (in an oval) contents to
ensure the skipped bits match the search term (i.e.,
skipped bits are not compared).
Dr.D.SARASWATHI 53
PAT Tree skipping bits for “100110001101”
Dr.D.SARASWATHI 54
• Pat Trees (and arrays) provide an alternative structure if
string searching is the goal.
• They store the text in an alternative structure supporting
string manipulation.
• The structure does not have facilities to store more abstract
concepts and their relationships associated with an item.
• The structure has interesting potential applications, but is
not used in any major commercial products at this time.
Dr.D.SARASWATHI 55
Dr.D.SARASWATHI 56
Dr.D.SARASWATHI 57
Signature file structure
• The coding is based upon words in the code.
• The words are mapped into word signatures .
• A word signature is fixed length code with a fixed
number of bits set to 1.
• The bit positions that are set to one are determined
via a hash function of the word.
• The word signatures are Ored together to create
signature of an item..
Dr.D.SARASWATHI 58
Dr.D.SARASWATHI 59
Dr.D.SARASWATHI 60
Dr.D.SARASWATHI 61
Dr.D.SARASWATHI 62
• Partitioning of words is done in block size ,Which
is nothing but set of words, Code length is 16
bits .
• Search is accomplished by template matching on
the bit position .
• provide a practical solution applied in parallel
processing , distributed environment etc.
Dr.D.SARASWATHI 63
• To avoid signatures being too dense with “1”s, a maximum number of
words is specified and an item is partitioned into blocks of that size.
• The block size is set at five words, the code length is 16 bits and the
number of bits that are allowed to be “1” for each word is five.
• TEXT: Computer Science graduate students study (assume block size is
five words)
Dr.D.SARASWATHI 64
Superimposed Coding
Dr.D.SARASWATHI 65
Application(s)/Advantage(s)
• Signature files provide a practical solution for storing
and locating information in a number of different
situations.
• Signature files have been applied as medium size
databases, databases with low frequency of terms,
WORM devices, parallel processing machines, and
distributed environments
Dr.D.SARASWATHI 66
Optical Disk File Structure
https://youtu.be/VhaZhdfyDUY
Dr.D.SARASWATHI 67
Trie
Dr.D.SARASWATHI 68
Trie
• All the search trees are used to store the collection of numerical
values but they are not suitable for storing the collection of words or
strings.
• Trie is a data structure which is used to store the collection of strings
and makes searching of a pattern in words more easy
• The term trie came from the word retrieval
• Trie is also called as Prefix Tree and some times Digital Tree
• Multi – way tree
Dr.D.SARASWATHI 87
The following operations are performed on a
B-Tree...
• Search
• Insertion
• Deletion
Dr.D.SARASWATHI 88
Search
• Search process starts from the root node and make an n-way decision
every time.
• Where 'n' is the total number of children the node has.
• O(log n) time complexity
Dr.D.SARASWATHI 89
• Step 1 - Read the search element from the user.
• Step 2 - Compare the search element with first key value of root node in the tree.
• Step 3 - If both are matched, then display "Given node is found!!!" and terminate
the function
• Step 4 - If both are not matched, then check whether search element is smaller
or larger than that key value.
• Step 5 - If search element is smaller, then continue the search process in left
subtree.
• Step 6 - If search element is larger, then compare the search element with next
key value in the same node and repeate steps 3, 4, 5 and 6 until we find the exact
match or until the search element is compared with last key value in the leaf
node.
• Step 7 - If the last key value in the leaf node is also not matched then display
"Element is not found" and terminate the function.
Dr.D.SARASWATHI 90
Insertion Operation in B-Tree
• In a B-Tree, a new element must be added only at the leaf node.
• That means, the new keyValue is always attached to the leaf node
only
Dr.D.SARASWATHI 91
• Step 1 - Check whether tree is Empty.
• Step 2 - If tree is Empty, then create a new node with new key value
and insert it into the tree as a root node.
• Step 3 - If tree is Not Empty, then find the suitable leaf node to which
the new key value is added using Binary Search Tree logic.
• Step 4 - If that leaf node has empty position, add the new key value
to that leaf node in ascending order of key value within the node.
• Step 5 - If that leaf node is already full, split that leaf node by sending
middle value to its parent node. Repeat the same until the sending
value is fixed into a node.
• Step 6 - If the spilting is performed at root node then the middle
value becomes new root node for the tree and the height of the tree
is increased by one.
Dr.D.SARASWATHI 92
Example
• Construct a B-Tree of Order 3 by inserting numbers from 1 to 10.
Dr.D.SARASWATHI 93
Dr.D.SARASWATHI 94
Dr.D.SARASWATHI 95
Dr.D.SARASWATHI 96
Dr.D.SARASWATHI 97
Dr.D.SARASWATHI 98
Dr.D.SARASWATHI 99
Dr.D.SARASWATHI 100
Dr.D.SARASWATHI 101
• When it comes to storing and searching large
amounts of data, traditional binary search trees
can become impractical due to their poor
performance and high memory usage.
• B-Trees, also known as B-Tree or Balanced Tree,
are a type of self-balancing tree that was
specifically designed to overcome these
limitations.
Dr.D.SARASWATHI 102
Search operation in B-Tree
• Input: Search 120 in the given B-Tree.
Dr.D.SARASWATHI 103
Dr.D.SARASWATHI 104
Dr.D.SARASWATHI 105
Dr.D.SARASWATHI 106
Deletion Operation:
Dr.D.SARASWATHI 107
Dr.D.SARASWATHI 108
Dr.D.SARASWATHI 109
Dr.D.SARASWATHI 110
Dr.D.SARASWATHI 111
Dr.D.SARASWATHI 112
Dr.D.SARASWATHI 113
Applications of B-Trees:
• It is used in large databases to access data stored on the disk
• Searching for data in a data set can be achieved in significantly less
time using the B-Tree
• With the indexing feature, multilevel indexing can be achieved.
• Most of the servers also use the B-tree approach.
• B-Trees are used in CAD systems to organize and search geometric
data.
• B-Trees are also used in other areas such as natural language
processing, computer networks, and cryptography.
Dr.D.SARASWATHI 114
Advantages of B-Trees:
• B-Trees have a guaranteed time complexity of O(log n) for basic
operations like insertion, deletion, and searching, which makes them
suitable for large data sets and real-time applications.
• B-Trees are self-balancing.
• High-concurrency and high-throughput.
• Efficient storage utilization.
Dr.D.SARASWATHI 115
Disadvantages of B-Trees:
• B-Trees are based on disk-based data structures and can have a high
disk usage.
• Not the best for all cases.
• Slow in comparison to other data structures.
Dr.D.SARASWATHI 116
HYPERTEXT AND XML DATA STRUCTURES
• The advent of the Internet and its exponential growth
and wide acceptance as a new global information
network has introduced new mechanisms for
representing information.
• This structure is called hypertext and differs from
traditional information storage data structures in
format and use.
• The hypertext is Hypertext is stored in HTML format
and XML . Dr.D.SARASWATHI 117
• Both of these languages provide detailed descriptions
for subsets of text similar to the zoning.
• Hypertext allows one item to reference another item
via a embedded pointer .
• HTML defines internal structure for information
exchange over WWW on the internet.
• XML: defined by DTD, DOM, XSL, etc.
Dr.D.SARASWATHI 118
What is XML?
• Extensible Mark-up language
• XML is not a programming language
• Data description Language, where HTML is a presentation language
• Used to store and transport data. ( HTML is used to visualize data)
• Used to specify the logical structure of documents
• Used by web services to send back and forth
• XML main importance is METADATA
• Data about Data
• Example
• Article
Dr.D.SARASWATHI 119
Dr.D.SARASWATHI 120
XML Basics
• Elements
• Attributes
Dr.D.SARASWATHI 121
XML Document
Dr.D.SARASWATHI 122
XML Tree
Dr.D.SARASWATHI 123
Dr.D.SARASWATHI 124
Dr.D.SARASWATHI 125
Dr.D.SARASWATHI 126
Challenges in XML Retreival
Dr.D.SARASWATHI 127
Dr.D.SARASWATHI 128
Dr.D.SARASWATHI 129
Dr.D.SARASWATHI 130
Dr.D.SARASWATHI 131
Hidden Markov Models
• named entities (Bikel-97)
• optical character recognition (Bazzi-98)
• Topic identification (Kubala-97)
• Recently in information retrieval search -Dr. Lawrence Rabiner
(Rabiner-89)
Dr.D.SARASWATHI 132
Discrete Markov process - Example
• Three state Markov Model of the Stock Market
State 1 (S1): market decreased
State 2 (S2): market did not change
State 3 (S3): market increased in value
Dr.D.SARASWATHI 133