Professional Documents
Culture Documents
IT202-DS-Unit 5 - Advanced Search Techniques
IT202-DS-Unit 5 - Advanced Search Techniques
IT202-DS-Unit 5 - Advanced Search Techniques
1
Unit 5 - Syllabus
Advanced Search Techniques
Binary search tree – B-tree indexing – B+ trees – Trie indexing – AVL
trees - Hash table – hash functions – collision resolution and open
addressing
Binary Search Tree
Successful Search : 59
Time required: O(log2n)
Searching for an Element in a B Tree – search 9
UnSuccessful Search : 9
Time required: O(log2n)
Inserting a New Element in a B Tree - steps
1. Search the B tree to find the leaf node where the new key value
should be inserted.
2. If the leaf node is not full, that is, it contains less than m–1 key values,
then insert the new element in the node keeping the node’s elements
ordered.
3. If the leaf node is full, that is, the leaf node already contains m–1 key
values, then
◦ (a) insert the new value in order into the existing set of keys,
◦ (b) split the node at its median into two nodes (note that the split nodes are half full), and
◦ (c) push the median element up to its parent’s node. If the parent’s node is already full, then
split the parent node by following the same steps.
Inserting a New Element in a B Tree
Look at the B tree of order 5 given below and insert 8, 9, 39, and 4 into it.
Inserting a New Element in a B Tree – insert 8 (before and after)
Inserting a New Element in a B Tree – insert 9 (in the same node as degree is 5)
Inserting a New Element in a B Tree – insert 39 after splitting the node
Inserting a New Element in a B Tree – insert 4 – 2 level splitting
Deleting an Element from a B Tree - steps
1. Locate the leaf node which has to be deleted.
2. If the leaf node contains more than the minimum number of key values (more than m/2
elements), then delete the value.
3. Else if the leaf node does not contain m/2 elements, then fill the node by taking an
element either from the left or from the right sibling.
◦ (a) If the left sibling has more than the minimum number of key values, push its largest key into its parent’s node
and pull down the intervening element from the parent node to the leaf node where the key is deleted.
◦ (b) Else, if the right sibling has more than the minimum number of key values, push its smallest key into its parent
node and pull down the intervening element from the parent node to the leaf node where the key is deleted.
4. Else, if both left and right siblings contain only the minimum number of elements, then
create a new leaf node by combining the two leaf nodes and the intervening element of
the parent node (ensuring that the number of elements does not exceed the maximum
number of elements a node can have, that is, m). If pulling the intervening element from
the parent node leaves it with less than the minimum number of keys in the node, then
propagate the process upwards, thereby reducing the height of the B tree.
Delete values 93, 201, 180, and 72 from B Tree of order 5
Delete 93 (within the node)
Delete 201 (deletion at non leaf node – replaced by bottom node data
Delete 180 (with one level upward rotation)
Delete 72 (with node merging and height reduction)
Create a B Tree
Create a B tree of order 5 with elements: 3, 14, 7, 1, 8, 5, 11, 17, 13, 6, 23, 12, 20, 26, 4, 16, 18, 24, 25, and 19.
B+ Trees
1. A B+ tree is a variant of a B tree which stores sorted data in a way that allows
for efficient insertion, retrieval, and removal of records, each of which is
identified by a key.
2. A B tree can store both keys and records in its interior nodes, a B+ tree, in
contrast, stores all the records at the leaf level of the tree; only keys are stored
in the interior nodes.
3. The leaf nodes of a B+ tree are often linked to one another in a linked list. This
has an added advantage of making the queries simpler and more efficient.
4. Typically, B+ trees are used to store large amounts of data that cannot be
stored in the main memory.
5. With B+ trees, the secondary storage (magnetic disk) is used to store the leaf
nodes of trees and the internal nodes of trees are stored in the main memory.
B+ Tree of order 3
Comparison between B Tree and B+ Tree
Insertion in B+ Tree of order 4 -
Deletion in B+ Tree – delete node 15
AVL Trees
AVL tree is a self-balancing binary search tree invented by G.M.
Adelson-Velsky and E.M. Landis
In an AVL tree, the heights of the two sub-trees of a node may differ
by at most one.
Due to this property, the AVL tree is also known as a height-balanced
tree.
The key advantage of using an AVL tree is that it takes O(log n) time
to perform search, insert, and delete operations in an average case as
well as the worst case because the height of the tree is limited to
O(log n).
AVL Tree and Balance Factor
The structure of an AVL tree is the same as that of a binary search
tree but it stores an additional variable called the BalanceFactor.
The balance factor of a node is calculated by subtracting the height
of its right sub-tree from the height of its left sub-tree.
A binary search tree in which every node has a balance factor of –1,
0, or 1 is said to be height balanced.
A node with any other balance factor is considered to be unbalanced
and requires rebalancing of the tree.
Balance factor = Height (left sub-tree) – Height (right sub-tree)
AVL Tree with Balance factor
Operations on AVL Trees - Search
Searching in an AVL tree is performed exactly the same way as it is
performed in a binary search tree.
Due to the height-balancing of the tree, the search operation takes
O(log n) time to complete.
Inserting a New Node in an AVL Tree
Insertion in an AVL tree is also done in the same way as it is done in a binary
search tree.
In the AVL tree, the new node is always inserted as the leaf node.
But the step of insertion is usually followed by an additional step of rotation.
Rotation is done to restore the balance of the tree.
If insertion of the new node does not disturb the balance factor, that is, if the
balance factor of every node is still –1, 0, or 1, then rotations are not required.
During insertion, the new node is inserted as the leaf node, so it will always have a
balance factor equal to zero. The only nodes whose balance factors will change are
those which lie in the path between the root of the tree and the newly inserted
node
Insertion in AVL Trees – rotation categories
LL rotation: The new node is inserted in the left sub-tree of the left
sub-tree of the critical node.
RR rotation: The new node is inserted in the right sub-tree of the
right sub-tree of the critical node.
LR rotation: The new node is inserted in the right sub-tree of the left
sub-tree of the critical node.
RL rotation: The new node is inserted in the left sub-tree of the right
sub-tree of the critical node.
LL rotation in an AVL tree – insert 18
RR rotation in an AVL tree – insert 89
LR rotation in an AVL tree
RL rotation in an AVL tree
Creation of an AVL Tree
Construct an AVL tree by inserting the following elements in the given order. 63, 9, 19,
27, 18, 108, 99, 81.
Creation of an AVL Tree
Construct an AVL tree by inserting the following elements in the given order. 63, 9, 19,
27, 18, 108, 99, 81.
Deleting a Node from an AVL Tree
Deletion of a node in an AVL tree is similar to that of binary search
trees.
Deletion may disturb the AVLness of the tree, so to rebalance the AVL
tree, we need to perform rotations.
There are two classes of rotations that can be performed on an AVL
tree after deleting a given node.
These rotations are R rotation and L rotation
Deleting a Node from an AVL Tree
On deletion of node X from the AVL tree, if node A becomes the
critical node (closest ancestor node on the path from X to the root
node that does not have its balance factor as 1, 0, or –1), then the
type of rotation depends on whether X is in the left sub-tree of A or
in its right sub-tree.
If the node to be deleted is present in the left sub-tree of A, then L
rotation is applied, else if X is in the right sub-tree, R rotation is
performed.
R0 Rotation – delete 72
Let B be the root of the left or right sub-tree of A (critical node). R0
rotation is applied if the balance factor of B is 0.
R1 rotation in an AVL tree – delete 72
Let B be the root of the left or right sub-tree of A (critical node). R1
rotation is applied if the balance factor of B is 1.
R–1 Rotation in AVL Tree
Let B be the root of the left or right sub-tree of A (critical node). R–1
rotation is applied if the balance factor of B is –1.
AVL Tree Deletion
Delete 52, 36 and 61
Hashing and collision - Preamble
Binary search and binary search trees are efficient algorithms to search
for an element (in O(logn) time order).
But what if we want to perform the search operation in time proportional
to O(1). Two solutions possible as given below for employee records
storage.
Employee id = array index (Infeasible) Employee key value = array index (wastage of storage)
Hashing – transformation key vale to the array index (feasible and less storage)
Hash Tables
Hash table is a data structure in which keys are mapped to array positions by a hash
function. A value stored in a hash table can be searched in O(1) time by using a hash
function which generates an array index or address from the key. This process of mapping
the keys to appropriate locations (or indices) in a hash table is called hashing. Note that
keys k2 and k6 point to the same memory location. This is known as collision.
Hash table - Preamble
Binary search and binary search trees are efficient algorithms to
search for an element.
But what if we want to perform the search operation in time
proportional to O(1)
There are two solutions to this problem.
Hash Functions
A hash function is a mathematical formula which, when applied to a
key, produces an integer which can be used as an index for the key in
the hash table.
The main aim of a hash function is that elements should be relatively,
randomly, and uniformly distributed.
It produces a unique set of integers within some suitable range in order
to reduce the number of collisions.
In practice, there is no hash function that eliminates collisions
completely. A good hash function can only minimize the number of
collisions by spreading the elements uniformly throughout the array.
Hash Function - Types
In real-world applications we have alphanumeric keys rather simple
numeric keys. In such cases, the ASCII value of the character can be
used to transform it into its equivalent numeric key.
Hash Function Types or Transformation Methods
1. Division method
2. Multiplication method
3. Mid square method
4. Folding method
Hash Function - Division method
This method divides x by M and then uses the remainder obtained. In this
case, the hash function can be given as h(x) = x mod M
Example
◦ Calculate the hash values of keys 1234 and 5462 with M = 97.
◦ h(1234) = 1234 % 97 = 70
◦ h(5642) = 5642 % 97 = 16
It is the most simple method of hashing an integer, requires only a single division operation, and so
works very fast. However, extra care should be taken to select a suitable value for M.
A potential drawback of the division method is that while using this method, consecutive keys
map to consecutive hash values. On one hand, this is good as it ensures that consecutive keys do
not collide, but on the other, it also means that consecutive array locations will be occupied. This
may lead to degradation in performance.
Hash Function - Multiplication method
The steps involved in the multiplication method are as follows:
◦ Step 1: Choose a constant A such that 0 < A < 1.
◦ Step 2: Multiply the key k by A.
◦ Step 3: Extract the fractional part of kA.
◦ Step 4: Multiply the result of Step 3 by the size of hash table (m).
Hence, the hash function can be given as:
h(k) = լm (kA mod 1)˩ where (kA mod 1) gives the fractional part of kA and m is the total number of indices in the hash table.
Example
Given a hash table of size 1000, map the key 12345 to an appropriate location in the hash table use A = 0.618033, m = 1000,
and k = 12345
The greatest advantage of this method is that it works practically with any value of A. Although the algorithm works better
with some values, the optimal choice depends on the characteristics of the data being hashed.
Hash Function – Mid square method
The mid-square method is a good hash function which works in two
steps:
◦ Step 1: Square the value of the key. That is, find k2.
◦ Step 2: Extract the middle r digits of the result obtained in Step 1.
The algorithm works well because most or all digits of the key value contribute to the result.
This is because all the digits in the original key value contribute to produce the middle digits
of the squared value. Therefore, the result is not dominated by the distribution of the
bottom digit or the top digit of the original key value.
In the mid-square method, the same r digits must be chosen from all the keys. Therefore,
the hash function can be given as:
h(k) = s
where s is obtained by selecting r digits from k2.
Hash Function – Mid square method
Calculate the hash value for keys 1234 and 5642 using the mid-square method. The hash table has 100 memory
locations.
Solution Note that the hash table has 100 memory locations whose indices vary from 0 to 99.
This means that only two digits are needed to map the key to a location in the hash table, so r = 2.
When k = 1234, k2 = 1522756, h(1234) = 27
When k = 5642, k2 = 31832164, h(5642) = 21
Observe that the 3rd and 4th digits starting from the right are chosen.
Hash Function – Folding method
The folding method works in the following two steps:
Step 1: Divide the key value into a number of parts. That is, divide k into parts k1, k2, ..., kn,
where each part has the same number of digits except the last part which may have lesser
digits than the other parts.
Step 2: Add the individual parts. That is, obtain the sum of k1 + k2 + … + kn. The hash value is
produced by ignoring the last carry, if any.
Example: Given a hash table of 100 locations, calculate the hash value using folding method
for keys 5678, 321, and 34567.
Solution: Since there are 100 memory locations to address, we will break the key into parts
where each part except the last contain two digits. The hash value is obtained as below.
COLLISIONS
Collisions occur when the hash function maps two different keys to the same location.
Obviously, two records cannot be stored in the same location.
Therefore, a method used to solve the problem of collision, also called collision resolution
technique, is applied. The TWO popular methods of resolving collisions are:
1. Collision Resolution by Open Addressing
2. Collision Resolution by Chaining
Collision Resolution by Open Addressing
Once a collision takes place, open addressing or closed hashing computes new positions
using a probe sequence and the next record is stored in that position.
In this technique, all the values are stored in the hash table.
The hash table contains two types of values: sentinel values (e.g., –1) and data values. The
presence of a sentinel value indicates that the location contains no data value at present but
can be used to hold a value
The process of examining memory locations in the hash table is called probing. Open
addressing technique can be implemented using linear probing, quadratic probing, double
hashing, and rehashing.
Linear Probing
The simplest approach to resolve a collision is linear probing. In this technique, if a value is
already stored at a location generated by h(k), then the following hash function is used to
resolve the collision:
h(k, i) = [h’(k) + i] mod m, i = 1, 2, 3, ... m-1
Where m is the size of the hash table, h’(k) = (k mod m), and i is the probe number that varies
from 0 to m–1. Therefore, for a given key k, first the location generated by [h’(k) mod m] is
probed because for the first time i=0. If the location is free, the value is stored in it, else the
second probe generates the address of the location given by [h’(k) + 1]mod m. Similarly, if the
location is occupied, then subsequent probes generate the address as [h’(k) + 2]mod m, [h’(k) +
3]mod m, [h’(k) + 4]mod m, [h’(k) + 5]mod m, and so on, until a free location is found.
Linear Probing – Exercise (Hash Table Insertion)
Consider a hash table of size 10. Using linear probing, insert the keys
72, 27, 36, 24, 63, 81, 92, and 101 into the table.
Let h’(k) = k mod m, m = 10
Linear Probing – Exercise - (Hash Table Insertion)
Consider a hash table of size 10. Using linear probing, insert the keys
72, 27, 36, 24, 63, 81, 92, and 101 into the table.
Let h’(k) = k mod m, m = 10
After Inserting 27, 36, 24, 63, 81, 92,
After Inserting 101, using linear Probing h(k, i) = [h’(k) + i] mod m, i = 1, 2, 3, ... 7
0 1 2 3 4 5 6 7 8 9
-1 81 72 63 24 92 36 27 101 -1
Searching a Value using Linear Probing
Search for the key applying hash function.
If the key does not match, then
◦ the search function begins a sequential search of the array that continues until the value is found,
or
◦ the search function encounters a vacant location in the array, indicating that the value is not
present, or
◦ the search function terminates because it reaches the end of the table and the value is not
present.
Exercise: Search 24, 101, 200
0 1 2 3 4 5 6 7 8 9
-1 81 72 63 24 92 36 27 101 -1
Open Addressing – Quadratic Probing
h(k,i) = (h’(k)+i2) mod m, m-hash table size, i = 0, 1, 2, 3,..
Exercise: Consider a hash table of size 10. Using quadratic probing,
insert the keys 72, 27, 36, 24, 63, 81, and 101 into the table
0 1 2 3 4 5 6 7 8 9
-1 81 72 63 24 -1 36 27 -1 -1
Insert 101
0 1 2 3 4 5 6 7 8 9
-1 81 72 63 24 101 36 27 -1 -1
Searching a Value using Quadratic Probing
0 1 2 3 4 5 6 7 8 9
-1 81 72 63 24 101 36 27 -1 -1
The search function begins a sequential search of the array that continues
until:
the value is found, or the search function encounters a vacant location in
the array, indicating that the value is
not present, or
the search function terminates because it reaches the end of the table and
the value is not present
Open Addressing – Double hashing
Two hashing functions used to avoid repeated collisions
h(k, i) = [h1(k) + ih2(k)] mod m
where m is the size of the hash table, h1(k) and h2(k) are two hash
functions
h1(k) = k mod m,
h2(k) = k mod m',
i is the probe number that varies from 0 to m–1,
m' is chosen to be less than m.
We can choose m' = m–1 or m–2.
Exercise
Consider a hash table of size = 10. Using double hashing, insert the
keys 72, 27, 36, 24, 63, 81, 92, and 101 into the table. Take h1 = (k
mod 10) and h2 = (k mod 8).
0 1 2 3 4 5 6 7 8 9
-1 81 72 63 24 -1 36 27 -1 -1