Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 49

Hashing Techniques

•It is a very efficient method to search the exact data items based on hash table.

 Locating the storage block of a record by the hash value h(k) of its key
k.

• A hash function is a mathematical function that maps the search key to the
address where the actual records are placed and passes it to the operating system,
and the record is retrieved .

• The user gives the key, the function

1
Mapping in a hashed file
Internal Hashing
• Hashing an internal file is called internal hashing.
• Internal Hashing is implemented as a hash table
through the use of an array of records. (In memory)
• An array index range of 0 to m-1.
• A function that transforms the hash field value into
an integer between 0 to m-1 is used.
• A common one is h(k) =k mod m. which yields a
value used for the record address.
• For character string the numeric ASCII code of the character can be
used.
• For Eg: Hashing algorithm applying mod hash function to character
string k of 20 character.

2
Internal Hashing

3
Figure 1.9 Modulo division 13.4
Internal Hashing (con’t)
• Collisions occur when a hash field value of a
record being inserted hashes to an address
that already contains a different record.

• The process of finding another position for


this record is called collision resolution.

5
Collisions
Internal Hashing
Open Addressing:
Once a position specified by the hash address is found to be occupied, the
program checks the subsequent positions in order until an unused position
is found.

Chaining:
Various overflow locations are maintained by extending the array with a
number of overflow positions.
A pointer field is added to each record location.
A collision is resolved by placing the new record in an unused overflow
location and setting the pointer of the occupied hash address location to
the address of that overflow location.

Multiple hashing:
If the first hash function results in a collision, then the program applies a
second hash function. If another collision results, the program uses open
addressing or applies a third hash function and then uses open addressing
if necessary.
7
1.Open addressing resolution

Figure 1.11 Open addressing resolution 13.8


2.Linked list resolution

Figure 1.12 Linked list resolution 13.9


External Hashing
• Hashing for disk files is called external
hashing.
• The target address space is made of buckets,
each of which holds multiple records.
• A bucket is either one disk block or a cluster of
contiguous blocks.
• The hash function maps a key into a relative
bucket number, rather than an absolute block
address for the bucket

10
External Hashing

11
Types of External Hashing
• Using a fixed address space is called static
hashing.
• Dynamically changing address space:
– Extendible hashing / Dynamic hashing
– Linear hashing

12
Static Hashing:
The hashing scheme where a fixed number of buckets ‘m’
is allocated for storage of records is called static hashing.

• Based on the hash value a bucket number is determined in


the block directory array which yields the block address.

• If n records fit into each block. This method allows up to


n*M records to be stored.

A major drawback of the static hashing scheme is that the


hash address space is fixed. So the file cannot be
expanded or shrunk dynamically.

13
Static hashing
Extendible Hashing
• In Extendible Hashing, a type of directory is
maintained as an array of 2d bucket addresses.
Where d refers to the first d high (left most) order
bits and is referred to as the global depth of the
directory. However, there does NOT have to be a
DISTINCT bucket for each directory entry.
• A local depth d’ is stored with each bucket to
indicate the number of bits used for that bucket.

15
16
16- 10000
4- 00100
6- 00110
22- 10110
24- 11000
5- 0101
Initially, the global-depth and local-depth is always 1.
Assume bucket size is 3
16- 10000
4- 00100
6- 00110
22- 10110
24- 11000
5- 0101
16- 10000
4- 00100
6- 00110
22- 10110
24- 11000
5- 0101
16- 10000
4- 00100
6- 00110
22- 10110
24- 11000
5- 0101
Overflow (Bucket Splitting)
• When an overflow in a bucket occurs that
bucket is split.
• This is done by dynamically allocating a new
bucket and redistributing the contents of the
old bucket between the old and new buckets
based on the increased local depth d’+1 of
both these buckets.

22
Overflow (Bucket Splitting)
• Now the new bucket’s address must be
added to the directory.
• If the overflow occurred in a bucket whose
current local depth d’ is less than or equal
to the global depth d adjust the directory
entries accordingly. (No change in the
directory size is made.)

23
Overflow (Bucket Doubling)
• If the overflow occurred in a bucket whose
current local depth d’ is now greater than the
global depth d you must increase the global
depth accordingly.
• This results in a doubling of the directory size
for each time d is increased by 1 and
appropriate adjustment of the entries.

24
Linear Hashing
• Linear Hashing allows the hash file to expand
and shrink its number of buckets dynamically
without needing a directory.
• It starts with M buckets numbered 0 to M-1
and use the mod hash function
h(K)= K mod M
as the initial hash function called hi.

25
Linear Hashing (Con’t)

• Overflow is handled by chaining individual


overflow chains for each bucket.
• It works by methodically splitting the
original buckets; starting with bucket 0,
redistributing the contents of bucket 0
between bucket 0 and bucket M (the new
bucket) using a secondary hash function:
h i+1(K) = K mod 2M
27
Linear Hashing (Con’t)
• This splitting of buckets is done in order (0,1,
…,M-1) REGARDLESS of which bucket the
collision occurred. To keep track of the next
bucket to be split we will use n. So n would
be incremented to 1.
• When a record hashes to a bucket less than
n we use the secondary hash function to
determine which of the two buckets it
belongs in.
28
Linear Hashing (Con’t)
• When all of the original M buckets have
been split and we have 2M buckets and
n=M
• We reset M to 2M, n to 0 and change our
secondary hash function to our primary
hash function.
• Shrinking of the file is done based on the
load factor using the reverse of splitting.

29
Redundant Array of Independent Disks
• RAID or Redundant Array of Independent Disks,

• is a technology to connect multiple secondary


storage devices and use them as a single storage
media.

• RAID consists of an array of disks in which multiple


disks are connected together
 Increase capacity
 Higher availability
 Increased performance

30
RAID Levels
0 Striped array with no fault tolerance
1 Disk mirroring
3 Parallel access array with dedicated parity
disk
4 Striped array with independent disks and a
dedicated parity disk
5 Striped array with independent disks and
distributed parity
6 Striped array with independent disks and
dual distributed parity
DATA PROTECTION: RAID - 31
Solution: Exploit Parallelism data
striping
Stripe the data is distributed transparently
across an array of disk to make them as a
single disk
Example: consider a big file striped across N
disks
• stripe width is S bytes
• hence each stripe unit is S/N bytes
• Ssequential read
S of S bytes at S
a time

s0,0 s0,1 s0,2 • • s s1,0 s1,1 s1,2 • • s


• 0,N-1 • 2,N-1•
s2,0 s2,1 s2,2 • • s
• 1,N-1 ••••

Disk 0 Disk 1 Disk 2 Disk N-1


S0,0 S0,1 S0,2 ••• S0,N-1
S1,0 S1,1 S1,2 ••• S1,N-1
S2,0 S2,1 S2,2 ••• S2,N-1
•••

•••

•••

•••
Disk Mirroring

RAID
Block 0
1 Block 0
1
Controller

Host

DATA PROTECTION: RAID - 33


Disk Mirroring
Mirroring is a technique whereby data is stored on two different HDDs,
yielding two copies of data.In the event of one HDD failure, the data is
intact on the surviving HDD
RAID 0
◦ In this level, a striped array of disks is implemented.
◦ The data is broken down into blocks and the blocks are
distributed among disks.
◦ Each disk receives a block of data to write/read in parallel.
◦ It enhances the speed and performance of the storage
device.
◦ There is no parity and backup in Level 0.

35
RAID Level-0
file data block 0 block 1 block 2 block 3 block 4

0 block 0 0 block 1
1 block 2 1 block 3
sectors 2 block 4 2
sectors
3 3
4 4
5 5

Disk 0 Disk 1
Data is striped across the HDDs in a RAID set
RAID 1
Data is mirrored to improve fault tolerance
A RAID 1 group consists of at least two mirrored disk to provide
redundancy and improved read.

In the event of disk failure, the impact on data recovery is the least
among all RAID implementations.
RAID 1 is suitable for applications that require high availability.
RAID Level-1
file data block 0 block 1 block 2 block 3 block 4

0 block 0 0 block 0
1 block 1 1 block 1
2 block 2 2 block 2
sectors 3 block 3 sectors
3 block 3
4 block 4 4 block 4
5 5

Disk 0 Disk 1
Raid level 2
This uses bit level striping. it stripes the bits across the disks.

In the above diagram b1, b2, b3 are bits. E1, E2, E3 are error correction codes.

You need two groups of disks. One group of disks are used to write the data, another

group is used to write the error correction codes.

This uses Hamming error correction code (ECC), and stores this information in the

redundancy disks.

When data is written to the disks, it calculates the ECC code for the data on the fly, and

stripes the data bits to the data-disks, and writes the ECC code to the redundancy

disks.

When data is read from the disks, it also reads the corresponding ECC code from the
Raid level 2
RAID 3: Bit Interleaved
10010011 Parity
11001101 P
10010011
...
Striped physical 1 0 0 1 0 0 1 1 0
records 1 1 0 0 1 1 0 1 1
1 0 0 1 0 0 1 1 0
Logical record
Physical record
•Error detection and correction
•One separate parity disk
•Splitting the bits of each byte across multiple disks : bit –level striping
• always reads and writes complete stripes of data across all disks,
as the drives operate in parallel.
There are no partial writes that update one out of many strips in a stripe.
• Only one request can be serviced at a time
Targeted for high bandwidth applications: Multimedia, Image
Processing

CSCE430/830 Disk Storage Systems: RAID


Redundant Array of
Inexpensive Disks RAID 3:
Parity Disk
10010011
11001101
10010011 P
...
logical record 1 1 1 1
0 1 0 1
Striped physical 1 0
records 1 0
0 0 0 0
P contains sum of 0 1 0 1
ther disks per stripe
mod 2 (“parity”) 0 1 0 1
f disk fails, subtract 1 0 1 0
P from sum of other
isks to find missing information1 1 1 1

CS252/Culler
2/7/02
Lec 6.42
RAID 3 – Parallel Transfer
with Dedicated Parity Disk

Bit 1 RAID
Bit 0
3
2 Bit 0
Controller
Bit1
Parity
Generated
Bit 2
Host
Bit3
P0123

DATA PROTECTION: RAID - 43


RAID 4: Block Interleaved
Parity
block 0 block 1 block 2 block 3 P(0-3)
block 4 block 5 block 6 block 7 P(4-7)
block 8 block 9 block 10 block 11 P(8-11)
block 12 block 13 block 14 block 15 P(12-15)

•Uses block level striping.


.
•Uses multiple data disks, and a dedicated disk to store parity.

•Allow for parallel access by multiple I/O requests


• Doing multiple small reads is now faster than before.

CSCE430/830 Disk Storage Systems: RAID


RAID 4 – Striping with
Dedicated Parity Disk Block 0
Block 4

Block 1
Block 5

Parity
RAID Block 2
Block 0 Block 0
Generated
Controller Block 6
P0123
Block 3
Host Block 7

P0123
P4567

RAID ARRAYS - 45
RAID 5: Block Interleaved
Distributed-Parity
block 0 block 1 block 2 block 3 P(0-3)

block 4 block 5 block 6 P(4-7) block 7

block 8 block 9 P(8-11) block 10 block 11


block 12 P(12-15) block 13 block 14 block 15
P(16-19) block 16 block 17 block 18 block 19

•It uses striping and the disks (strips) are independently accessible.
•The difference between RAID 4 and RAID 5 is the parity location.
•In RAID 4, parity is written to a dedicated disk, creating a write bottleneck
for the parity disk.
•In RAID 5, parity is distributed across all disks.
•The distribution of parity in RAID 5 overcomes the write bottleneck.
RAID 5 – Independent Disks
with Distributed Parity Block 0
Block 4

Block 1
Block 5

Parity
RAID Block 2
Block 0
4 Block 40
Generated
Controller Block 6
P4
05 1627
3
Block 3
Host
P4567

P0123
Block 7

DATA PROTECTION: RAID - 47


RAID 6 – Dual Parity RAID
Two disk failures in a RAID set leads to data
unavailability and data loss in single-parity schemes,
such as RAID-3, 4, and 5
RAID-6 protects against two disk failures by maintaining
two parities
◦ Horizontal parity which is the same as RAID-5 parity
◦ Diagonal parity is calculated by taking diagonal sets of data
blocks from the RAID set members

DATA PROTECTION: RAID - 48

You might also like