IT202-DS-Unit 5 - Advanced Search Techniques

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 87

IT202 – DS - Unit 5

Advanced Search Techniques

1
Unit 5 - Syllabus
Advanced Search Techniques
Binary search tree – B-tree indexing – B+ trees – Trie indexing – AVL
trees - Hash table – hash functions – collision resolution and open
addressing
Binary Search Tree

Binary search tree is a binary tree with the following properties:


• The left sub-tree of a node N contains values that are less than N’s value.
• The right sub-tree of a node N contains values that are greater than N’s
value.
• Both the left and the right sub trees also satisfy these properties and,
thus, are binary search trees.
Binary Search Tree - operations
Create a binary search tree using the following data elements:
45, 39, 56, 12, 34, 78, 32, 10, 89, 54, 67, 81
Binary Search Tree - operations
Create a binary search tree using the following data elements:
45, 39, 56, 12, 34, 78, 32, 10, 89, 54, 67, 81
Binary Search Tree - operations
Searching a node with value 67 in the given binary search tree
Binary Search Tree - operations
Searching a node with value 40 in the given binary search tree
Binary Search Tree - operations
Inserting node with value 12 in the given binary search tree
Binary Search Tree - operations
Inserting node with value 55 in the given binary search tree
Binary Search Tree - operations
Deleting a Node that has No Children - delete 78
Binary Search Tree - operations
Deleting a Node that one Child - delete 54
Binary Search Tree - operations
Deleting a Node with two Children - delete 56
B Trees
A B-tree of order m can have a maximum of m–1 keys and m pointers to
its sub-trees. A B-tree may contain a large number of key values and
pointers to sub-trees. Storing a large number of keys in a single node
keeps the height of the tree relatively small.
Eg: A B-Tree of order 4 given below
B-Tree Properties

1. Every node in the B tree has at most (maximum) m children.


2. Every node in the B tree except the root node and leaf nodes has at
least (minimum) m/2 children. This condition helps to keep the tree
bushy so that the path from the root node to the leaf is very short, even
in a tree that stores a lot of data.
3. The root node has at least two children if it is not a terminal (leaf)
node.
4. All leaf nodes are at the same level.
Operations on B-Tree
While performing insertion and deletion operations in a B tree, the
number of child nodes may change.
So, in order to maintain a minimum number of children, the internal
nodes may be joined or split.
Possible operations on B-Tree: search, creation / insertion, and
deletion
Searching for an Element in a B Tree – search 59

Successful Search : 59
Time required: O(log2n)
Searching for an Element in a B Tree – search 9

UnSuccessful Search : 9
Time required: O(log2n)
Inserting a New Element in a B Tree - steps
1. Search the B tree to find the leaf node where the new key value
should be inserted.
2. If the leaf node is not full, that is, it contains less than m–1 key values,
then insert the new element in the node keeping the node’s elements
ordered.
3. If the leaf node is full, that is, the leaf node already contains m–1 key
values, then
◦ (a) insert the new value in order into the existing set of keys,
◦ (b) split the node at its median into two nodes (note that the split nodes are half full), and
◦ (c) push the median element up to its parent’s node. If the parent’s node is already full, then
split the parent node by following the same steps.
Inserting a New Element in a B Tree

Look at the B tree of order 5 given below and insert 8, 9, 39, and 4 into it.
Inserting a New Element in a B Tree – insert 8 (before and after)
Inserting a New Element in a B Tree – insert 9 (in the same node as degree is 5)
Inserting a New Element in a B Tree – insert 39 after splitting the node
Inserting a New Element in a B Tree – insert 4 – 2 level splitting
Deleting an Element from a B Tree - steps
1. Locate the leaf node which has to be deleted.
2. If the leaf node contains more than the minimum number of key values (more than m/2
elements), then delete the value.
3. Else if the leaf node does not contain m/2 elements, then fill the node by taking an
element either from the left or from the right sibling.
◦ (a) If the left sibling has more than the minimum number of key values, push its largest key into its parent’s node
and pull down the intervening element from the parent node to the leaf node where the key is deleted.
◦ (b) Else, if the right sibling has more than the minimum number of key values, push its smallest key into its parent
node and pull down the intervening element from the parent node to the leaf node where the key is deleted.
4. Else, if both left and right siblings contain only the minimum number of elements, then
create a new leaf node by combining the two leaf nodes and the intervening element of
the parent node (ensuring that the number of elements does not exceed the maximum
number of elements a node can have, that is, m). If pulling the intervening element from
the parent node leaves it with less than the minimum number of keys in the node, then
propagate the process upwards, thereby reducing the height of the B tree.
Delete values 93, 201, 180, and 72 from B Tree of order 5
Delete 93 (within the node)
Delete 201 (deletion at non leaf node – replaced by bottom node data
Delete 180 (with one level upward rotation)
Delete 72 (with node merging and height reduction)
Create a B Tree
Create a B tree of order 5 with elements: 3, 14, 7, 1, 8, 5, 11, 17, 13, 6, 23, 12, 20, 26, 4, 16, 18, 24, 25, and 19.
B+ Trees
1. A B+ tree is a variant of a B tree which stores sorted data in a way that allows
for efficient insertion, retrieval, and removal of records, each of which is
identified by a key.
2. A B tree can store both keys and records in its interior nodes, a B+ tree, in
contrast, stores all the records at the leaf level of the tree; only keys are stored
in the interior nodes.
3. The leaf nodes of a B+ tree are often linked to one another in a linked list. This
has an added advantage of making the queries simpler and more efficient.
4. Typically, B+ trees are used to store large amounts of data that cannot be
stored in the main memory.
5. With B+ trees, the secondary storage (magnetic disk) is used to store the leaf
nodes of trees and the internal nodes of trees are stored in the main memory.
B+ Tree of order 3
Comparison between B Tree and B+ Tree
Insertion in B+ Tree of order 4 -
Deletion in B+ Tree – delete node 15
AVL Trees
AVL tree is a self-balancing binary search tree invented by G.M.
Adelson-Velsky and E.M. Landis
In an AVL tree, the heights of the two sub-trees of a node may differ
by at most one.
Due to this property, the AVL tree is also known as a height-balanced
tree.
The key advantage of using an AVL tree is that it takes O(log n) time
to perform search, insert, and delete operations in an average case as
well as the worst case because the height of the tree is limited to
O(log n).
AVL Tree and Balance Factor
The structure of an AVL tree is the same as that of a binary search
tree but it stores an additional variable called the BalanceFactor.
The balance factor of a node is calculated by subtracting the height
of its right sub-tree from the height of its left sub-tree.
A binary search tree in which every node has a balance factor of –1,
0, or 1 is said to be height balanced.
A node with any other balance factor is considered to be unbalanced
and requires rebalancing of the tree.
Balance factor = Height (left sub-tree) – Height (right sub-tree)
AVL Tree with Balance factor
Operations on AVL Trees - Search
Searching in an AVL tree is performed exactly the same way as it is
performed in a binary search tree.
Due to the height-balancing of the tree, the search operation takes
O(log n) time to complete.
Inserting a New Node in an AVL Tree
Insertion in an AVL tree is also done in the same way as it is done in a binary
search tree.
In the AVL tree, the new node is always inserted as the leaf node.
But the step of insertion is usually followed by an additional step of rotation.
Rotation is done to restore the balance of the tree.
If insertion of the new node does not disturb the balance factor, that is, if the
balance factor of every node is still –1, 0, or 1, then rotations are not required.
During insertion, the new node is inserted as the leaf node, so it will always have a
balance factor equal to zero. The only nodes whose balance factors will change are
those which lie in the path between the root of the tree and the newly inserted
node
Insertion in AVL Trees – rotation categories
LL rotation: The new node is inserted in the left sub-tree of the left
sub-tree of the critical node.
RR rotation: The new node is inserted in the right sub-tree of the
right sub-tree of the critical node.
LR rotation: The new node is inserted in the right sub-tree of the left
sub-tree of the critical node.
RL rotation: The new node is inserted in the left sub-tree of the right
sub-tree of the critical node.
LL rotation in an AVL tree – insert 18
RR rotation in an AVL tree – insert 89
LR rotation in an AVL tree
RL rotation in an AVL tree
Creation of an AVL Tree
Construct an AVL tree by inserting the following elements in the given order. 63, 9, 19,
27, 18, 108, 99, 81.
Creation of an AVL Tree
Construct an AVL tree by inserting the following elements in the given order. 63, 9, 19,
27, 18, 108, 99, 81.
Deleting a Node from an AVL Tree
Deletion of a node in an AVL tree is similar to that of binary search
trees.
Deletion may disturb the AVLness of the tree, so to rebalance the AVL
tree, we need to perform rotations.
There are two classes of rotations that can be performed on an AVL
tree after deleting a given node.
These rotations are R rotation and L rotation
Deleting a Node from an AVL Tree
On deletion of node X from the AVL tree, if node A becomes the
critical node (closest ancestor node on the path from X to the root
node that does not have its balance factor as 1, 0, or –1), then the
type of rotation depends on whether X is in the left sub-tree of A or
in its right sub-tree.
If the node to be deleted is present in the left sub-tree of A, then L
rotation is applied, else if X is in the right sub-tree, R rotation is
performed.
R0 Rotation – delete 72
Let B be the root of the left or right sub-tree of A (critical node). R0
rotation is applied if the balance factor of B is 0.
R1 rotation in an AVL tree – delete 72
Let B be the root of the left or right sub-tree of A (critical node). R1
rotation is applied if the balance factor of B is 1.
R–1 Rotation in AVL Tree
Let B be the root of the left or right sub-tree of A (critical node). R–1
rotation is applied if the balance factor of B is –1.
AVL Tree Deletion
Delete 52, 36 and 61
Hashing and collision - Preamble
Binary search and binary search trees are efficient algorithms to search
for an element (in O(logn) time order).
But what if we want to perform the search operation in time proportional
to O(1). Two solutions possible as given below for employee records
storage.
Employee id = array index (Infeasible) Employee key value = array index (wastage of storage)

Hashing – transformation key vale to the array index (feasible and less storage)
Hash Tables
Hash table is a data structure in which keys are mapped to array positions by a hash
function. A value stored in a hash table can be searched in O(1) time by using a hash
function which generates an array index or address from the key. This process of mapping
the keys to appropriate locations (or indices) in a hash table is called hashing. Note that
keys k2 and k6 point to the same memory location. This is known as collision.
Hash table - Preamble
Binary search and binary search trees are efficient algorithms to
search for an element.
But what if we want to perform the search operation in time
proportional to O(1)
There are two solutions to this problem.
Hash Functions
A hash function is a mathematical formula which, when applied to a
key, produces an integer which can be used as an index for the key in
the hash table.
The main aim of a hash function is that elements should be relatively,
randomly, and uniformly distributed.
It produces a unique set of integers within some suitable range in order
to reduce the number of collisions.
In practice, there is no hash function that eliminates collisions
completely. A good hash function can only minimize the number of
collisions by spreading the elements uniformly throughout the array.
Hash Function - Types
In real-world applications we have alphanumeric keys rather simple
numeric keys. In such cases, the ASCII value of the character can be
used to transform it into its equivalent numeric key.
Hash Function Types or Transformation Methods
1. Division method
2. Multiplication method
3. Mid square method
4. Folding method
Hash Function - Division method
This method divides x by M and then uses the remainder obtained. In this
case, the hash function can be given as h(x) = x mod M
Example
◦ Calculate the hash values of keys 1234 and 5462 with M = 97.
◦ h(1234) = 1234 % 97 = 70
◦ h(5642) = 5642 % 97 = 16

It is the most simple method of hashing an integer, requires only a single division operation, and so
works very fast. However, extra care should be taken to select a suitable value for M.

A potential drawback of the division method is that while using this method, consecutive keys
map to consecutive hash values. On one hand, this is good as it ensures that consecutive keys do
not collide, but on the other, it also means that consecutive array locations will be occupied. This
may lead to degradation in performance.
Hash Function - Multiplication method
The steps involved in the multiplication method are as follows:
◦ Step 1: Choose a constant A such that 0 < A < 1.
◦ Step 2: Multiply the key k by A.
◦ Step 3: Extract the fractional part of kA.
◦ Step 4: Multiply the result of Step 3 by the size of hash table (m).
Hence, the hash function can be given as:
h(k) = լm (kA mod 1)˩ where (kA mod 1) gives the fractional part of kA and m is the total number of indices in the hash table.
Example
Given a hash table of size 1000, map the key 12345 to an appropriate location in the hash table use A = 0.618033, m = 1000,
and k = 12345

The greatest advantage of this method is that it works practically with any value of A. Although the algorithm works better
with some values, the optimal choice depends on the characteristics of the data being hashed.
Hash Function – Mid square method
The mid-square method is a good hash function which works in two
steps:
◦ Step 1: Square the value of the key. That is, find k2.
◦ Step 2: Extract the middle r digits of the result obtained in Step 1.
The algorithm works well because most or all digits of the key value contribute to the result.
This is because all the digits in the original key value contribute to produce the middle digits
of the squared value. Therefore, the result is not dominated by the distribution of the
bottom digit or the top digit of the original key value.
In the mid-square method, the same r digits must be chosen from all the keys. Therefore,
the hash function can be given as:
h(k) = s
where s is obtained by selecting r digits from k2.
Hash Function – Mid square method
Calculate the hash value for keys 1234 and 5642 using the mid-square method. The hash table has 100 memory
locations.
Solution Note that the hash table has 100 memory locations whose indices vary from 0 to 99.
This means that only two digits are needed to map the key to a location in the hash table, so r = 2.
When k = 1234, k2 = 1522756, h(1234) = 27
When k = 5642, k2 = 31832164, h(5642) = 21
Observe that the 3rd and 4th digits starting from the right are chosen.
Hash Function – Folding method
The folding method works in the following two steps:
Step 1: Divide the key value into a number of parts. That is, divide k into parts k1, k2, ..., kn,
where each part has the same number of digits except the last part which may have lesser
digits than the other parts.
Step 2: Add the individual parts. That is, obtain the sum of k1 + k2 + … + kn. The hash value is
produced by ignoring the last carry, if any.
Example: Given a hash table of 100 locations, calculate the hash value using folding method
for keys 5678, 321, and 34567.
Solution: Since there are 100 memory locations to address, we will break the key into parts
where each part except the last contain two digits. The hash value is obtained as below.
COLLISIONS
Collisions occur when the hash function maps two different keys to the same location.
Obviously, two records cannot be stored in the same location.
Therefore, a method used to solve the problem of collision, also called collision resolution
technique, is applied. The TWO popular methods of resolving collisions are:
1. Collision Resolution by Open Addressing
2. Collision Resolution by Chaining
Collision Resolution by Open Addressing
Once a collision takes place, open addressing or closed hashing computes new positions
using a probe sequence and the next record is stored in that position.
In this technique, all the values are stored in the hash table.
The hash table contains two types of values: sentinel values (e.g., –1) and data values. The
presence of a sentinel value indicates that the location contains no data value at present but
can be used to hold a value
The process of examining memory locations in the hash table is called probing. Open
addressing technique can be implemented using linear probing, quadratic probing, double
hashing, and rehashing.
Linear Probing
The simplest approach to resolve a collision is linear probing. In this technique, if a value is
already stored at a location generated by h(k), then the following hash function is used to
resolve the collision:
h(k, i) = [h’(k) + i] mod m, i = 1, 2, 3, ... m-1

Where m is the size of the hash table, h’(k) = (k mod m), and i is the probe number that varies
from 0 to m–1. Therefore, for a given key k, first the location generated by [h’(k) mod m] is
probed because for the first time i=0. If the location is free, the value is stored in it, else the
second probe generates the address of the location given by [h’(k) + 1]mod m. Similarly, if the
location is occupied, then subsequent probes generate the address as [h’(k) + 2]mod m, [h’(k) +
3]mod m, [h’(k) + 4]mod m, [h’(k) + 5]mod m, and so on, until a free location is found.
Linear Probing – Exercise (Hash Table Insertion)
Consider a hash table of size 10. Using linear probing, insert the keys
72, 27, 36, 24, 63, 81, 92, and 101 into the table.
Let h’(k) = k mod m, m = 10
Linear Probing – Exercise - (Hash Table Insertion)
Consider a hash table of size 10. Using linear probing, insert the keys
72, 27, 36, 24, 63, 81, 92, and 101 into the table.
Let h’(k) = k mod m, m = 10
After Inserting 27, 36, 24, 63, 81, 92,

After Inserting 101, using linear Probing h(k, i) = [h’(k) + i] mod m, i = 1, 2, 3, ... 7
0 1 2 3 4 5 6 7 8 9
-1 81 72 63 24 92 36 27 101 -1
Searching a Value using Linear Probing
Search for the key applying hash function.
If the key does not match, then
◦ the search function begins a sequential search of the array that continues until the value is found,
or
◦ the search function encounters a vacant location in the array, indicating that the value is not
present, or
◦ the search function terminates because it reaches the end of the table and the value is not
present.
Exercise: Search 24, 101, 200
0 1 2 3 4 5 6 7 8 9
-1 81 72 63 24 92 36 27 101 -1
Open Addressing – Quadratic Probing
h(k,i) = (h’(k)+i2) mod m, m-hash table size, i = 0, 1, 2, 3,..
Exercise: Consider a hash table of size 10. Using quadratic probing,
insert the keys 72, 27, 36, 24, 63, 81, and 101 into the table
0 1 2 3 4 5 6 7 8 9
-1 81 72 63 24 -1 36 27 -1 -1

Insert 101
0 1 2 3 4 5 6 7 8 9
-1 81 72 63 24 101 36 27 -1 -1
Searching a Value using Quadratic Probing
0 1 2 3 4 5 6 7 8 9
-1 81 72 63 24 101 36 27 -1 -1

The search function begins a sequential search of the array that continues
until:
the value is found, or the search function encounters a vacant location in
the array, indicating that the value is
not present, or
the search function terminates because it reaches the end of the table and
the value is not present
Open Addressing – Double hashing
Two hashing functions used to avoid repeated collisions
h(k, i) = [h1(k) + ih2(k)] mod m
where m is the size of the hash table, h1(k) and h2(k) are two hash
functions
h1(k) = k mod m,
h2(k) = k mod m',
i is the probe number that varies from 0 to m–1,
m' is chosen to be less than m.
We can choose m' = m–1 or m–2.
Exercise
Consider a hash table of size = 10. Using double hashing, insert the
keys 72, 27, 36, 24, 63, 81, 92, and 101 into the table. Take h1 = (k
mod 10) and h2 = (k mod 8).
0 1 2 3 4 5 6 7 8 9
-1 81 72 63 24 -1 36 27 -1 -1

Insert 92 makes collision at position 2 double hashing done with i =1


(result 6)
h(92) = (h1(92) + i*h2(92)) mod 10 = 6 (collision again) – double
hashing again with i=2 (result 0)
Rehashing
When the hash table becomes nearly full, the number of collisions
increases, thereby degrading the performance of insertion and
search operations. In such cases, a better option is to create a new
hash table with size double of the original hash table.
Collision Resolution by Chaining
Chained Hash Table
Searching for a value in a chained hash table is as simple as scanning a
linked list for an entry with the given key. Insertion operation appends
the key to the end of the linked list pointed by the hashed location.
Deleting a key requires searching the list and removing the element.
Exercise
Insert the keys 7, 24, 18, 52, 36, 54, 11, and 23 in a chained hash table of 9 memory
locations. Use h(k) = k mod m.
Bucket Hashing
In closed hashing, all the records are directly stored in the hash
table.
◦ Each record with a key value k is stored in a location called its home position. The home
position is calculated by applying some hash function.
◦ In case the home position of the record with key k is already occupied by another record
then the record will be stored in some other location in the hash table. This other location
will be determined by the technique that is used for resolving collisions.
Bucket Hashing
One implementation of closed hashing groups the hash table
into buckets where M slots of the hash table are divided into B
buckets.
Therefore,
◦ Each bucket contains M/B slots.
◦ Now when a new record has to be inserted, the hash function computes the home
position.
◦ If the slot is free, the record is inserted. Otherwise, the bucket’s slots are
sequentially searched until an open slot is found.
◦ In case, the entire bucket is full, the record is inserted into an overflow bucket. The
overflow bucket has infinite capacity at the end of the table and is shared by all the
buckets.
Exercise – Bucket hashing
Create a hash table with the following data and hash function. Assume that the hash table has 3 slots in each
bucket and linear probing for resolving overflow.
Data[21] = {17, 09, 34, 56, 11, 71, 86, 55, 22, 10, 4, 39, 49, 52, 82, 13, 40, 31, 35, 28, 44}
H(X) = X mod 9
Bucket No Slot 1 Slot 2 Slot 3 Numbers in bold and black are in the respective
0 9
Hashed bucket location and available slot
44
1 55 10 82 Numbers in bold and red are in the slots after
overflow with linear probing
2 56 11 28
3 39
Questions
4 22 4 49 - how many items are not in the home bucket
5 86 13 40
- how many comparisons are required for searching – 56,
6 31 82,17, 40,31, 85, 72
7 34 52 - identify the changes in the positions of items on
removing / deleting 22
8 17 71 35
Trie – Data Structure
•It is a multiway search tree
•Data or key is made of a subset of symbols.
•i.e binary data is made of 0’s and 1’s;
•strings made of alphabets
•The search time depends on the length of the key
•Accessing a trie is based on the symbols or characters of the key
•The first level or the root of the trie corresponds to the first character of the key
•The second level corresponds to the second character of the key and so on
•Trie provides faster retrieval with partial key but is heavy on memory space due to multiway branching
•Trie satisfies the following properties
•There are two types of nodes – branch node and data node
•A branch node has as many address fields as the number of possible symbols that constitute a key
•An address field of a branch node stores the address of another branch node or that of a data node
•A data node stores the actual key which is a string of characters continued from the symbol set
Trie – an example – alphabetical data
Trie – an example – numerical data
Searching for a key in tries
The search begins at the root. The first letter of the key is used as an
index to determine which branch to take
An empty branch means that the key being sought is not in the tree
Otherwise we use the second letter of the key to determine the
branch at the next level and so continue
When we reach the end of the word, the information pointer directs
us to the desired information
A NULL information pointer shows that the string is not a word in the
trie
No. of searches = no. of characters in the key
Trie – insertion – points to remember
A branch node has as many address fields as the number of characters in
the symbol set from which the keys are formed
This set of characters is ordered, meaning their position is fixed in the set.
This positional parameter or index is required for identifying the exact
address field corresponding to the symbol for address / pointer update
We must trace our way down the trie to the appropriate point and set the
data pointer to the information record for the new key
If on the way we hit a NULL branch in the trie, we must not terminate the
search but create new nodes and put them into the trie so as to complete
the path corresponding to the new key
Trie - insertion
Trie - insertion
Deletion from a Trie
We trace down the path corresponding to the key being deleted and
when we reach the appropriate node we set the corresponding data
member to NULL
If now this node has all its members NULL (all branches and the data
members) then we should delete this node. To do so
◦ We can set up a stack of pointers to the nodes on the path from root to the last node
reached
◦ Alternatively we can use recursion in the deletion algorithm and avoid the need for
explicit stack

You might also like