Download as pdf or txt
Download as pdf or txt
You are on page 1of 133

Module 2

DATA STRUCTURES AND


ALGORITHMS FOR RETRIEVAL

Dr.D.SARASWATHI 1
Dr.D.SARASWATHI 2
DATA STRUCTURES
• Introduction to Data Structures
• Stemming Algorithms
• Inverted File Structure
• N-Gram Data Structure
• PAT Data Structure
• Signature File Structure
• Hypertext and XML Data Structures

Dr.D.SARASWATHI 3
CATALOGING AND INDEXING
INDEXING
• The transformation from received item to searchable data
structure is called indexing.
• Process can be manual or automatic.
• Creating a direct search in document data base or indirect
search through index files.
• Concept based representation: instead of transforming the
input into a searchable format some systems transform the
input into different representation that is concept based .
• Search ? Search and return item as per the incoming items
Dr.D.SARASWATHI 4
• History of indexing: shows the dependency of
information processing capabilities on manual and
then automatic processing systems .
• Indexing originally called cataloguing : oldest
technique to identity the contents of items to assist in
retrieval.
• Items overlap between full item indexing , public and
private indexing of files
Dr.D.SARASWATHI 5
Indexing process

Dr.D.SARASWATHI 6
Introduction

Dr.D.SARASWATHI 7
Motivation and Definition

Dr.D.SARASWATHI 8
Inverted Files

Dr.D.SARASWATHI 9
Dr.D.SARASWATHI 10
Inverted Search Example

Dr.D.SARASWATHI 11
Dr.D.SARASWATHI 12
Structures used in Inverted Files

Dr.D.SARASWATHI 13
Sorted Arrays

Dr.D.SARASWATHI 14
Tries

Dr.D.SARASWATHI 15
How to build an Inverted index

Dr.D.SARASWATHI 16
Dr.D.SARASWATHI 17
Dr.D.SARASWATHI 18
Dr.D.SARASWATHI 19
Algorithm
• Fetching the Document

Dr.D.SARASWATHI 20
• Removing the Stop Words

Dr.D.SARASWATHI 21
• Stem to the Root Word

Dr.D.SARASWATHI 22
• Record Document Id’s

• Then I have to combine them into a single text as follows,


Retriev==>docId104007&docId154033

Dr.D.SARASWATHI 23
Merge and Sort the items

Dr.D.SARASWATHI 24
Summary

Dr.D.SARASWATHI 25
N-GRAM DATA STRUCTURE
• N-Grams can be viewed as a special technique for conflation
(stemming) and as a unique data structure in information
systems.

• N-Grams are a fixed length consecutive series of “n”


characters.

• Unlike stemming that generally tries to determine the stem of


a word that represents the semantic meaning of the word, n-
grams do not care about semantics.
Dr.D.SARASWATHI 26
• The searchable data structure is transformed into
overlapping n-grams, which are then used to create
the searchable database.

Dr.D.SARASWATHI 27
Examples of bigrams, trigrams and pentagrams
“sea colony.”
• Bigrams (no interword symbols)
• se ea co ol lo on ny

• Trigrams (no interword symbols)


• sea col olo lon ony

• Trigrams (with interword symbol #)


• #se sea ea# #co col olo lon ony ny#
Dr.D.SARASWATHI 28
• Pentagrams (with interword symbol #)
• #sea# #colo colon olony lony#

• The symbol # is used to represent the interword symbol which


is anyone of a set of symbols (e.g., blank, period, semicolon,
colon, etc.).

• Each of the n-grams created becomes a separate processing


tokens and are searchable.

• It is possible that the same n-gram can be created multiple


times from a single word. Dr.D.SARASWATHI 29
Uses :
• Widely used as cryptography in world war II Spelling errors detection
and correction
• Use bigrams for conflating terms.
• N-grams as a potential erroneous words.
• Damerau specified 4 categories of errors:

Dr.D.SARASWATHI 30
• Zamora showed trigram analysis provided a viable data
structure for identifying misspellings and transposed
characters.

• This impacts information systems as a possible basis for


identifying potential input errors for correction as a
procedure within the normalization process.

• Frequency of occurrence of n-gram patterns can also be


used for identifying the language of an item.

Dr.D.SARASWATHI 31
• Trigrams have been used for text compression
and to manipulate the length of index terms.

• To encode profiles for the Selective


Dissemination of Information.

• To store the searchable document file for


retrospective search databases.

Dr.D.SARASWATHI 32
Advantage of n-grams
• They place a finite limit on the number of searchable tokens.

• The maximum number of unique n-grams that can be


generated, MaxSeg, can be calculated as a function of n
which is the length of the n-grams, and
• ƛ which is the number of processable symbols from the
alphabet (i.e., non-interword symbols).
Dr.D.SARASWATHI 33
• Disadvantage:
longer the n gram the size of inversion list increase.
Performance has 85 % precision .

Dr.D.SARASWATHI 34
PAT data structure
(Practical algorithm to retrieve information coded in
alphanumeric )
• PAT structure or PAT tree or PAT array : continuous text input
data structures(string like N- Gram data structure).

• The input stream is transformed into a searchable data


structure consisting of substrings, all substrings are unique.

• Each position in a input string is a anchor point for a sub


string.
Dr.D.SARASWATHI 35
• In creation of PAT trees each position in the input string
is the anchor point for a sub-string that starts at that
point and includes all new text up to the end of the
input.

• All substrings are unique.

• This view of text lends itself to many different search


processing structures.

Dr.D.SARASWATHI 36
• Binary tree, most common class for prefix search,But Pat
trees are sorted logically which facilitate range search, and
more accurate then inversion file .

• PAT trees provide alternate structure if supporting strings


search.

Dr.D.SARASWATHI 37
• A substring can start at any point in the text and can be
uniquely indexed by its starting location and length.
• If all strings are to the end of the input, only the starting
location is needed since the length is the difference from the
location and the total length of the item.
• It is possible to have a substring go beyond the length of the
input stream by adding additional null characters.
• These substrings are called sistring (semi-infinite string).

Dr.D.SARASWATHI 38
• A PAT tree is an unbalanced, binary digital tree defined by
the sistrings.
• The individual bits of the sistrings decide the branching
patterns with zeros branching left and ones branching right.
• PAT trees also allow each node in the tree to specify which
bit is used to determine the branching via bit position or the
number of bits to skip from the parent node.
• This is useful in skipping over levels that do not require
branching.
Dr.D.SARASWATHI 39
Examples of sistrings

Dr.D.SARASWATHI 40
• The key values are stored at the leaf nodes (bottom
nodes) in the PAT Tree.
• For a text input of size “n” there are “n” leaf nodes
and “n-1” at most higher level nodes.
• It is possible to place additional constraints on
sistrings for the leaf nodes.
• We may be interested in limiting our searches to word
boundaries.
Dr.D.SARASWATHI 41
• Thus we could limit our sistrings to those that are immediately after
an interword symbol.
• Example of the sistrings used in generating a PAT tree.

Dr.D.SARASWATHI 42
• If the binary representations of “h” is (100), “o” is (110), “m”
is (001) and “e” is (101) then the word “home” produces the
input 100110001101.

Dr.D.SARASWATHI 43
Dr.D.SARASWATHI 44
Dr.D.SARASWATHI 45
Dr.D.SARASWATHI 46
Dr.D.SARASWATHI 47
Dr.D.SARASWATHI 48
Dr.D.SARASWATHI 49
Dr.D.SARASWATHI 50
Algorithms

Dr.D.SARASWATHI 51
PAT Binary Tree for input
“100110001101”

Dr.D.SARASWATHI 52
• The value in the intermediate nodes (indicated by
rectangles) is the number of bits to skip until the next
bit to compare that causes differences between
similar terms.
• This final version saves space, but requires comparing
a search value to the leaf node (in an oval) contents to
ensure the skipped bits match the search term (i.e.,
skipped bits are not compared).
Dr.D.SARASWATHI 53
PAT Tree skipping bits for “100110001101”

Dr.D.SARASWATHI 54
• Pat Trees (and arrays) provide an alternative structure if
string searching is the goal.
• They store the text in an alternative structure supporting
string manipulation.
• The structure does not have facilities to store more abstract
concepts and their relationships associated with an item.
• The structure has interesting potential applications, but is
not used in any major commercial products at this time.

Dr.D.SARASWATHI 55
Dr.D.SARASWATHI 56
Dr.D.SARASWATHI 57
Signature file structure
• The coding is based upon words in the code.
• The words are mapped into word signatures .
• A word signature is fixed length code with a fixed
number of bits set to 1.
• The bit positions that are set to one are determined
via a hash function of the word.
• The word signatures are Ored together to create
signature of an item..
Dr.D.SARASWATHI 58
Dr.D.SARASWATHI 59
Dr.D.SARASWATHI 60
Dr.D.SARASWATHI 61
Dr.D.SARASWATHI 62
• Partitioning of words is done in block size ,Which
is nothing but set of words, Code length is 16
bits .
• Search is accomplished by template matching on
the bit position .
• provide a practical solution applied in parallel
processing , distributed environment etc.
Dr.D.SARASWATHI 63
• To avoid signatures being too dense with “1”s, a maximum number of
words is specified and an item is partitioned into blocks of that size.
• The block size is set at five words, the code length is 16 bits and the
number of bits that are allowed to be “1” for each word is five.
• TEXT: Computer Science graduate students study (assume block size is
five words)

Dr.D.SARASWATHI 64
Superimposed Coding

Dr.D.SARASWATHI 65
Application(s)/Advantage(s)
• Signature files provide a practical solution for storing
and locating information in a number of different
situations.
• Signature files have been applied as medium size
databases, databases with low frequency of terms,
WORM devices, parallel processing machines, and
distributed environments

Dr.D.SARASWATHI 66
Optical Disk File Structure

https://youtu.be/VhaZhdfyDUY

Dr.D.SARASWATHI 67
Trie

Dr.D.SARASWATHI 68
Trie
• All the search trees are used to store the collection of numerical
values but they are not suitable for storing the collection of words or
strings.
• Trie is a data structure which is used to store the collection of strings
and makes searching of a pattern in words more easy
• The term trie came from the word retrieval
• Trie is also called as Prefix Tree and some times Digital Tree
• Multi – way tree

Trie is a tree like data structure used to store collection of strings.


Dr.D.SARASWATHI 69
Dr.D.SARASWATHI 70
Dr.D.SARASWATHI 71
Dr.D.SARASWATHI 72
Dr.D.SARASWATHI 73
Dr.D.SARASWATHI 74
Dr.D.SARASWATHI 75
Word searching
Dr.D.SARASWATHI 76
Dr.D.SARASWATHI 77
Dr.D.SARASWATHI 78
Dr.D.SARASWATHI 79
Dr.D.SARASWATHI 80
Dr.D.SARASWATHI 81
Dr.D.SARASWATHI 82
Word Deletion
Dr.D.SARASWATHI 83
Dr.D.SARASWATHI 84
B-Tree
• In search trees like binary search tree, AVL Tree, Red-Black tree, etc.,
every node contains only one value (key) and a maximum of two
children.
• B-Tree, in which a node contains more than one value (key) and more
than two children.
• B-Tree was developed in the year 1972 by Bayer and McCreight with
the name Height Balanced m-way Search Tree. Later it was named as
B-Tree

B-Tree is a self-balanced search tree in which every node contains multiple


keys and has more than two children.
Dr.D.SARASWATHI 85
B-Tree of Order m has the following
properties...
• Property #1 - All leaf nodes must be at same level.
• Property #2 - All nodes except root must have at least [m/2]-1 keys and
maximum of m-1 keys.
• Property #3 - All non leaf nodes except root (i.e. all internal nodes) must
have at least m/2 children.
• Property #4 - If the root node is a non leaf node, then it must have atleast
2 children.
• Property #5 - A non leaf node with n-1 keys must have n number of
children.
• Property #6 - All the key values in a node must be in Ascending Order.
Dr.D.SARASWATHI 86
Example
• B-Tree of Order 4 contains a maximum of 3 key values in a node and
maximum of 4 children for a node.

Dr.D.SARASWATHI 87
The following operations are performed on a
B-Tree...
• Search
• Insertion
• Deletion

Dr.D.SARASWATHI 88
Search
• Search process starts from the root node and make an n-way decision
every time.
• Where 'n' is the total number of children the node has.
• O(log n) time complexity

Dr.D.SARASWATHI 89
• Step 1 - Read the search element from the user.
• Step 2 - Compare the search element with first key value of root node in the tree.
• Step 3 - If both are matched, then display "Given node is found!!!" and terminate
the function
• Step 4 - If both are not matched, then check whether search element is smaller
or larger than that key value.
• Step 5 - If search element is smaller, then continue the search process in left
subtree.
• Step 6 - If search element is larger, then compare the search element with next
key value in the same node and repeate steps 3, 4, 5 and 6 until we find the exact
match or until the search element is compared with last key value in the leaf
node.
• Step 7 - If the last key value in the leaf node is also not matched then display
"Element is not found" and terminate the function.

Dr.D.SARASWATHI 90
Insertion Operation in B-Tree
• In a B-Tree, a new element must be added only at the leaf node.
• That means, the new keyValue is always attached to the leaf node
only

Dr.D.SARASWATHI 91
• Step 1 - Check whether tree is Empty.
• Step 2 - If tree is Empty, then create a new node with new key value
and insert it into the tree as a root node.
• Step 3 - If tree is Not Empty, then find the suitable leaf node to which
the new key value is added using Binary Search Tree logic.
• Step 4 - If that leaf node has empty position, add the new key value
to that leaf node in ascending order of key value within the node.
• Step 5 - If that leaf node is already full, split that leaf node by sending
middle value to its parent node. Repeat the same until the sending
value is fixed into a node.
• Step 6 - If the spilting is performed at root node then the middle
value becomes new root node for the tree and the height of the tree
is increased by one.
Dr.D.SARASWATHI 92
Example
• Construct a B-Tree of Order 3 by inserting numbers from 1 to 10.

Dr.D.SARASWATHI 93
Dr.D.SARASWATHI 94
Dr.D.SARASWATHI 95
Dr.D.SARASWATHI 96
Dr.D.SARASWATHI 97
Dr.D.SARASWATHI 98
Dr.D.SARASWATHI 99
Dr.D.SARASWATHI 100
Dr.D.SARASWATHI 101
• When it comes to storing and searching large
amounts of data, traditional binary search trees
can become impractical due to their poor
performance and high memory usage.
• B-Trees, also known as B-Tree or Balanced Tree,
are a type of self-balancing tree that was
specifically designed to overcome these
limitations.
Dr.D.SARASWATHI 102
Search operation in B-Tree
• Input: Search 120 in the given B-Tree.

Dr.D.SARASWATHI 103
Dr.D.SARASWATHI 104
Dr.D.SARASWATHI 105
Dr.D.SARASWATHI 106
Deletion Operation:

Dr.D.SARASWATHI 107
Dr.D.SARASWATHI 108
Dr.D.SARASWATHI 109
Dr.D.SARASWATHI 110
Dr.D.SARASWATHI 111
Dr.D.SARASWATHI 112
Dr.D.SARASWATHI 113
Applications of B-Trees:
• It is used in large databases to access data stored on the disk
• Searching for data in a data set can be achieved in significantly less
time using the B-Tree
• With the indexing feature, multilevel indexing can be achieved.
• Most of the servers also use the B-tree approach.
• B-Trees are used in CAD systems to organize and search geometric
data.
• B-Trees are also used in other areas such as natural language
processing, computer networks, and cryptography.

Dr.D.SARASWATHI 114
Advantages of B-Trees:
• B-Trees have a guaranteed time complexity of O(log n) for basic
operations like insertion, deletion, and searching, which makes them
suitable for large data sets and real-time applications.
• B-Trees are self-balancing.
• High-concurrency and high-throughput.
• Efficient storage utilization.

Dr.D.SARASWATHI 115
Disadvantages of B-Trees:
• B-Trees are based on disk-based data structures and can have a high
disk usage.
• Not the best for all cases.
• Slow in comparison to other data structures.

Dr.D.SARASWATHI 116
HYPERTEXT AND XML DATA STRUCTURES
• The advent of the Internet and its exponential growth
and wide acceptance as a new global information
network has introduced new mechanisms for
representing information.
• This structure is called hypertext and differs from
traditional information storage data structures in
format and use.
• The hypertext is Hypertext is stored in HTML format
and XML . Dr.D.SARASWATHI 117
• Both of these languages provide detailed descriptions
for subsets of text similar to the zoning.
• Hypertext allows one item to reference another item
via a embedded pointer .
• HTML defines internal structure for information
exchange over WWW on the internet.
• XML: defined by DTD, DOM, XSL, etc.

Dr.D.SARASWATHI 118
What is XML?
• Extensible Mark-up language
• XML is not a programming language
• Data description Language, where HTML is a presentation language
• Used to store and transport data. ( HTML is used to visualize data)
• Used to specify the logical structure of documents
• Used by web services to send back and forth
• XML main importance is METADATA
• Data about Data
• Example
• Article

Dr.D.SARASWATHI 119
Dr.D.SARASWATHI 120
XML Basics
• Elements
• Attributes

Dr.D.SARASWATHI 121
XML Document

Dr.D.SARASWATHI 122
XML Tree

Dr.D.SARASWATHI 123
Dr.D.SARASWATHI 124
Dr.D.SARASWATHI 125
Dr.D.SARASWATHI 126
Challenges in XML Retreival

Dr.D.SARASWATHI 127
Dr.D.SARASWATHI 128
Dr.D.SARASWATHI 129
Dr.D.SARASWATHI 130
Dr.D.SARASWATHI 131
Hidden Markov Models
• named entities (Bikel-97)
• optical character recognition (Bazzi-98)
• Topic identification (Kubala-97)
• Recently in information retrieval search -Dr. Lawrence Rabiner
(Rabiner-89)

Dr.D.SARASWATHI 132
Discrete Markov process - Example
• Three state Markov Model of the Stock Market
State 1 (S1): market decreased
State 2 (S2): market did not change
State 3 (S3): market increased in value

• The movement between states can be defined by a state transition


matrix with state transitions

Dr.D.SARASWATHI 133

You might also like