Professional Documents
Culture Documents
Tutorial On B+Tree & Hash Indexing: Index
Tutorial On B+Tree & Hash Indexing: Index
TUTORIAL ON
B+TREE & HASH INDEXING
Index
l An index is a data structure that supports efficient access to data (i.e.,
row(s)) without having to scan entire table
l Based on a search key: rows having a particular value for the search key
attributes can be quickly located
Matching
Condition Set of records
on index Records
attribute
value
(search key)
2
CSD Univ. of Crete Fall 2005
Index Structure
l Index entries
u The row itself (index and table are integrated in this case), or
u Search key value and a pointer to a row having that value (table is
stored separately in this case)
l Location mechanism
u Algorithm + data structure for locating an index entry with a given
search key value
u Various data structures can serve as indexes, e.g.,
· simple indexes on sorted (sequential) files
· secondary indexes on unsorted (heap) files
· hierarchical index structures (B-trees)
· hash tables
l Index entries are stored in accordance with the search key value
u Entries with the same search key value are stored together (hash,
B+- tree)
u Entries may be sorted on search key value (B+- tree)
4
CSD Univ. of Crete Fall 2005
Index Structure
Search key S
value
Location Mechanism
Location mechanism
makes it easy to find
index entry for S
S Index entries
Storage Structure
l Structure of file containing a table
u Heap file (no index, not integrated)
u Sorted file (no index, not integrated)
u Integrated file containing index and rows (index entries contain rows
in this case) e.g.,
· ISAM
· B+ tree
· Hash
l The simplest of cases: given a sorted file (data file), create another file
(index file) consisting of key-pointer pairs:
u a search key K is associated with a pointer pointing to a record of the
data file that has the search key K 6
CSD Univ. of Crete Fall 2005
Contains table
and (main) index
Location mechanism
Index entries
Index file
(secondary index)
Storage
structure
l In this case, the storage structure might
be a heap or sorted file, but is generally
an integrated file with a main index
8
CSD Univ. of Crete Fall 2005
Clustered Index
l Clustered index: index entries and rows are ordered in the same way
u An integrated storage structure is always clustered (since rows and
index entries are the same)
u The particular index structure (e.g., hash, tree) dictates how the rows
are organized in the storage structure
· There can be at most one clustered index on a table
u CREATE TABLE generally generally creates an integrated, clustered
(main) index
l Good for range searches when a range of search key values is requested
u Use location mechanism to locate index entry at start of range
· This locates first row
u Subsequent rows are stored in successive locations if index is
clustered (not so if unclustered)
u Minimizes page transfers and maximizes likelihood of cache hits 10
CSD Univ. of Crete Fall 2005
Storage structure
contains table
and (main) index;
rows are contained
in index entries
11
12
CSD Univ. of Crete Fall 2005
Unclustered Index
l Unclustered (secondary) index: index entries and rows are not ordered
in the same way
13
14
CSD Univ. of Crete Fall 2005
l Sparse index: not every record in the data file is in the index
u index entry for each page of data file
· index indicates the block of records
l A good compromise
u Sparse index with one index entry per block
u Tradeoff between access time and space overhead
16
CSD Univ. of Crete Fall 2005
Id Name Dept
Sparse,
clustered
index sorted
on Id
Dense,
data file sorted
unclustered
on Id
index sorted
on Name 17
Sparse Index
18
CSD Univ. of Crete Fall 2005
20
CSD Univ. of Crete Fall 2005
l Example: need 1.000 index blocks (4MB) for the 100.000 data blocks
l A dense index can answer the query “Does there exist a record with key
value K?” without retrieving the block containing the record
l A sparse index needs a disk I/O to retrieve the block on which the record
with key K might be found
l To find a record with key K, we search the index for the largest key that
is < = K and follow the pointer to the data block; then we search the
block for the record
22
CSD Univ. of Crete Fall 2005
23
Two-Level Index
24
CSD Univ. of Crete Fall 2005
Multilevel Index
Secondary Indices
l Must be dense
u Since the records are not physically stored in the search key order
l Can use multiple levels
u A sequential scan of the records in the search key of a secondary
index is often very slow
· Require many block accesses
l Updates of the data lead to updates of both indices
u A likely significant overhead on modification of the database 26
CSD Univ. of Crete Fall 2005
Secondary Indices
Secondary Indices
Sequence
• Sparse index field
30
30 50
20
80 20
100 70
90 80
... 40
100
10
does not make sense! 90
60
28
CSD Univ. of Crete Fall 2005
Secondary Indices
10 30
20 50
10 30
40 20
50 70
90
... 50
60 80
70 40
sparse
high ... 100
level 10
90
Lowest level has to be dense 60
29
l Assume that the order of records with identical search key values does
not matter
l A dense index can be built with one entry with key K for each record that
has search key K
l Searching: look for the first record with key K in the index file and find all
the other records with the same value (they must immediately follow)
30
CSD Univ. of Crete Fall 2005
Duplicate Keys
Dense Index
10
10 10
10
10 10
20 20
20 20
30 30
30 30
30 30
40
Record pointer 45
32
CSD Univ. of Crete Fall 2005
Dense Index
10
10 10
20
30 10
40 20
20
Record pointer 30
30
30
40
45
33
Sparse Index
10
10 10
10
20 10
30 20
Be careful
20
when searching 30
for 20 or 30! block pointer
30
30
40
45
34
CSD Univ. of Crete Fall 2005
one option...
10 20
10 10
10
20 20
Problem: 40
20
excess overhead! 30 10
• disk space 40 40
• search time 40
10
40 40
40
... 30
40
36
CSD Univ. of Crete Fall 2005
37
20 l
10 10
20
30 20
40 40
50 10
60 40
... 10 l
40
Yet another idea :
Chain records with same key? 30 l
40 l
Problems:
• Need to add fields to records, messes up maintenance
• Need to follow chain to know records 38
CSD Univ. of Crete Fall 2005
20
10 10
20
30 20
40 40
50 10
60 40
...
10
40
30
40
buckets
39
l Indexes:
u Name: primary
u Dept: secondary
u Floor: secondary
40
CSD Univ. of Crete Fall 2005
Toy 2nd
Intersect toy bucket and 2nd Floor bucket to get set of matching EMP’s!!!
41
l If deleted record was the only record in the file with its particular search-
key value, the search-key is deleted from the index
43
45
B+ Tree Properties
l A B+-Tree is a balanced tree whose leaves contain a sequence of key-
pointer pairs
u Same height for paths from root to leaf
u Given a search-key K, nearly same access time for different K values
47
B+ Tree Properties
l Search keys are sorted in order
K1 K2
u K1 < K2 < … <Kn
P1 P2 P3
l Non-leaf Node S1 S2 S3
u Each key-search value in subtree Si pointed by Pi are < Ki and >=
Ki-1
u Key values in S1 < K1
u K1 <= Key values in S2 < K2
P3
P1 K1 K2 …
P2
l Leaf Node Record of K1 Record of K2
48
CSD Univ. of Crete Fall 2005
49
B+ Tree Searching
l Given a search-value k
u Startfrom the root, look for the largest search-key value (Kl) in the
node which is <= k
u Follow pointer Pl+1 to next level, until reach a leaf node
K1 K2 … Kl Kl+1 Kn
P1 P2 P3 Pl+1 Pn Pn+1
u If k is found to be equal to Kl in the leaf, follow Pl to search the
record or bucket
Kl Kl+1
Record of Kl Pl k = Kl
Record of Kl
… 50
CSD Univ. of Crete Fall 2005
B+ Tree Insertion
l Overflow
u When number of search-key values exceed n
Insert 8
7 9 13 15
l Leaf Node
u Split into two nodes:
· 1st node contains é(n+1)/2ù values
· 2nd node contains remaining values
· Copy the smallest search-key value of the 2nd node to parent node
7 8 9 13 15
51
7 8 13 15
52
CSD Univ. of Crete Fall 2005
B+ Tree Deletion
l Underflow Delete 10
u When number of search-key values < é(n+1)/2ù-1 9 10
l Leaf Node 13 18
u Redistributeto sibling
9 10 13 14 16
· Right node not less than left node
· Replace the between-value in parent by their
smallest value of the right node 14 18
13 18 22 18 22
9 10 13 14 9 13 14 53
13 18 22 18 22
9 10 14 16 9 13 14 16
54
CSD Univ. of Crete Fall 2005
Example on B+ Tree
l Construct a B+ tree
u Assume the tree is initially empty
u All paths from root to leaf are of the same length
u The number of pointers that will fit in one node is four (n=3)
u Leaf nodes must have between 2 and 3 values (ë(n+1)/2û and n)
u Non-leaf nodes other than root must have between 2 and 4 children
(é(n+1)/2ù and n+1)
u Root must have at least 2 children
55
Insertions
l Insert 2, 3, 5 2 3 5
l Insert 7, 11
2 3 5 7 11
Next: Insert 17 56
CSD Univ. of Crete Fall 2005
Insert 17
5 11
2 3 5 7 11 17
Next: Insert 19 57
Insert 19
5 11
2 3 5 7 11 17 19
Next: Insert 23 58
CSD Univ. of Crete Fall 2005
Insert 23
5 11 19
2 3 5 7 11 17 19 23
Next: Insert 29 59
Insert 29
5 11 19
2 3 5 7 11 17 19 23 29
Next: Insert 31 60
CSD Univ. of Crete Fall 2005
Insert 31
11
5 19 29
2 3 5 7 11 17 19 23 29 31
Next: Insert 9 61
Insert 9
11
5 19 29
2 3 5 7 9 11 17 19 23 29 31
Next: Insert 10 62
CSD Univ. of Crete Fall 2005
Insert 10
11
5 9 19 29
2 3 5 7 9 10 11 17 19 23 29 31
Next: Insert 8 63
Insert 8
11
19 29
5 9
2 3 5 7 8 9 10 11 17 19 23 29 31
Next: Delete 23 64
CSD Univ. of Crete Fall 2005
Delete 23
11
5 9 19
2 3 5 7 8 9 10 11 17 19 29 31
Next: Delete 19 65
Delete 19
11
5 9 19
2 3 5 7 8 9 10 11 17 29 31
Next: Delete 17 66
CSD Univ. of Crete Fall 2005
Delete 17
5 11
2 3 5 7 8 9 10 11 29 31
Next: Delete 10 67
Delete 10
5 29
2 3 5 7 8 9 11 29 31
Next: Delete 11 68
CSD Univ. of Crete Fall 2005
Delete 11
5 9
2 3 5 7 8 9 29 31
69
Bob, 21, 3.7, CS Kane, 19, 3.8, ME Louis, 32, 4, LS Chris, 22, 3.9, CS
Mary, 24, 3, ECE Lam, 22, 2.8, ME Martha, 29, 3.8, CS Chad, 28, 2.3, LS
Tom, 20, 3.2, EE Chang, 18, 2.5, CS James, 24, 3.1, ME Leila, 20, 3.5, LS
Kathy, 18, 3.8, LS Vera, 17, 3.9, EE Pat, 19, 2.8, EE Shideh, 16, 4, CS
70
CSD Univ. of Crete Fall 2005
(2.3, (4, 2)) (3, (1,2)) (3.7, (1, 1)) (3.9, (4,1))
(2.5, (2,3)) (3.1, (3,3)) (3.8, (3,2)) (3.9, (2,4))
(2.8, (2,2)) (3.2, (1,3) (3.8, (2,1)) (4, (3,1))
(2.8, (3,4)) (3.5, (4,3)) (3.8, (1,4)) (4, (4,4))
Bob, 21, 3.7, CS Kane, 19, 3.8, ME Louis, 32, 4, LS Chris, 22, 3.9, CS
Mary, 24, 3, ECE Lam, 22, 2.8, ME Martha, 29, 3.8, CS Chad, 28, 2.3, LS
Tom, 20, 3.2, EE Chang, 18, 2.5, CS James, 24, 3.1, ME Leila, 20, 3.5, LS
Kathy, 18, 3.8, LS Vera, 17, 3.9, EE Pat, 19, 2.8, EE Shideh, 16, 4, CS
71
(2.3, (1, 1)) (3, (2,1)) (3.7, (3, 1)) (3.9, (4,1))
(2.5, (1,2)) (3.1, (2,2)) (3.8, (3,2)) (3.9, (4,2))
(2.8, (1,3)) (3.2, (2,3) (3.8, (3,3)) (4, (4,3))
(2.8, (1,4)) (3.5, (2,4)) (3.8, (3,4)) (4, (4,4))
Chad, 28, 2.3, LS Mary, 24, 3, ECE Bob, 21, 3.7, CS Chris, 22, 3.9, CS
Chang, 18, 2.5, CS James, 24, 3.1, ME Kathy, 18, 3.8, LS Vera, 17, 3.9, EE
Lam, 22, 2.8, ME Tom, 20, 3.2, EE Kane, 19, 3.8, ME Louis, 32, 4, LS
Pat, 19, 2.8, EE Leila, 20, 3.5, LS Martha, 29, 3.8, CS Shideh, 16, 4, CS 72
CSD Univ. of Crete Fall 2005
Clustered B+-Tree
l A clustered B+-tree on the gpa attribute
Chad, 28, 2.3, LS Mary, 24, 3, ECE Bob, 21, 3.7, CS Chris, 22, 3.9, CS
Chang, 18, 2.5, CS James, 24, 3.1, ME Kathy, 18, 3.8, LS Vera, 17, 3.9, EE
Lam, 22, 2.8, ME Tom, 20, 3.2, EE Kane, 19, 3.8, ME Louis, 32, 4, LS
Pat, 19, 2.8, EE Leila, 20, 3.5, LS Martha, 29, 3.8, CS Shideh, 16, 4, CS
Clarifications on B+ Trees
l B+ trees can be used to store relations as well as index structures
l In the drawn B+ trees we assume (this is not the only scheme) that an
internal node with q pointers stores the maximum keys of each of the
first q-1 subtrees it is pointing to; that is, it contains q-1 keys
l Before B+-tree can be generated the following parameters have to be
chosen (based on the available block size; it is assumed one node is
stored in one block):
u the order p (n+1, so far) of the tree (p is the maximum number of
pointers an intermediate node might have; if it is not a root it must
have between ((p+1)/2) and p pointers; ‘/’ is integer division)
u the maximum number m of entries in the leaf node can hold (in
general leaf nodes (except the root) must hold between (m+1)/2 and
m entries)
l Internal nodes usually store more entries than leaf nodes
74
CSD Univ. of Crete Fall 2005
u m ≤ 2000/(4+12+4) = 2000/20
77
79
l Insertion algorithm is the same, but attention must be kept when an old
value is inserted in a leaf and we must split
u The parent might get a NULL key value pointing to the second leaf of
the split
l Deletion algorithm is the same, but attention must be kept when there is
underflow, because in this case parent node must be updated
80
CSD Univ. of Crete Fall 2005
81
82
CSD Univ. of Crete Fall 2005
B+-Tree Performance
l Tree levels
u Tree Fanout
· Size of key
· Page utilization
l Tree maintenance
u Online
· On inserts
· On deletes
u Offline
l Tree locking
B+-Tree Performance
84
CSD Univ. of Crete Fall 2005
B+-Tree Locking
l Tree Traversal
u Update, Read
u Insert, Delete
· phantom problem: need for range locking
85
B+-Tree Locking
T1 lock
A
B C
T1 lock T1 lock
D
E F
86
CSD Univ. of Crete Fall 2005
l Pointer size = 5 bytes; index entry size = 30 (5+25); index entries per
page = 33
l Assume that there are 1.000 job titles. Index on job title contains only
two levels (assuming that all index nodes are full)
87
u We use the index to retrieve the first record - since the file is clustered,
the remaining Manager records are stored sequentially after the first one
l Clustered B+ tree index on ename
u No use of index. File Scan
88
CSD Univ. of Crete Fall 2005
89
l What is the minimum Height of the B+ Tree? How many block accesses
for the following query using B+ Tree with min. height?
SELECT A
FROM r
WHERE A = 150
90
CSD Univ. of Crete Fall 2005
l Total no. of block access = index block access + data block access = 13 91
B+ trees Summary
but ..…
l maintenance overhead
l overkill for small static files
l duplicate keys?
l relies on random distribution of key values for efficient insertion
92
CSD Univ. of Crete Fall 2005
Hashing
l Buckets:
u Set up an area to keep the records: Primary area
u Divide primary area into buckets
u Issue: Non-uniform distribution of the records
u More than one key may give same key number: Collision
l Separate Chaining:
u New record with the same hash function value has been added:
How to solve?
u Add the address of a new bucket in a special area called overflow
area
u Keep some overflow area in each cylinder
95
Static Hashing
l Static Hashing
uA hash function h maps a search-key value K to an address of a
bucket
u Commonly used hash function hash value mod nB where nB is the no.
of buckets
u E.g. h(Brighton) = (2+18+9+7+8+20+15+14) mod 10 = 93 mod 10 = 3
No. of buckets = 10
Static Hashing
97
Static Hashing
l Clustered Hash Index :
u The index structure and its buckets are represented as a file (say
file.hash)
u The relation is stored in file.hash (I.e., each entry in file.hash
corresponds to a record in relation)
u Assuming no duplicates: the record can be accessed in 1 I/O
Bob, 21, 3.7, CS Kane, 19, 3.8, ME Louis, 32, 4, LS Chris, 22, 3.9, CS
Mary, 24, 3, ECE Lam, 22, 2.8, ME Martha, 29, 3.8, CS Chad, 28, 2.3, LS
Tom, 20, 3.2, EE Chang, 18, 2.5, CS James, 24, 3.1, ME Leila, 20, 3.5, LS
Kathy, 18, 3.8, LS Vera, 17, 3.9, EE Pat, 19, 2.8, EE Shideh, 16, 4, CS
99
Bob, 21, 3.7, CS Kane, 19, 3.8, ME Louis, 32, 4, LS Chris, 22, 3.9, CS
Mary, 24, 3, ECE Lam, 22, 2.8, ME Martha, 29, 3.8, CS Chad, 28, 2.3, LS
Tom, 20, 3.2, EE Chang, 18, 2.5, CS James, 24, 3.1, ME Leila, 20, 3.5, LS
Kathy, 18, 3.8, LS Vera, 17, 3.9, EE Pat, 19, 2.8, EE Shideh, 16, 4, CS100
CSD Univ. of Crete Fall 2005
101
Bob, 21, 3.7, CS Kane, 19, 3.8, ME Louis, 32, 4, LS Chris, 22, 3.9, CS
Mary, 24, 3, ECE Lam, 22, 2.8, ME Martha, 29, 3.8, CS Chad, 28, 2.3, LS
Tom, 20, 3.2, EE Chang, 18, 2.5, CS James, 24, 3.1, ME Leila, 20, 3.5, LS
Kathy, 18, 3.8, LS Vera, 17, 3.9, EE Pat, 19, 2.8, EE Shideh, 16, 4, CS102
CSD Univ. of Crete Fall 2005
Shideh, 16, 4, CS
Leila, 20, 3.5, LS
James, 24, 3.1, ME
Mary, 24, 3, ECE Bob, 21, 3.7, CS Kathy, 18, 3.8, LS Kane, 19, 3.8, ME
Louis, 32, 4, LS Vera, 17, 3.9, EE Lam, 22, 2.8, ME Pat, 19, 2.8, EE
Tom, 20, 3.2, EE Martha, 29, 3.8, CS Chris, 22, 3.9, CS
Chad, 28, 2.3, LS Chang, 18, 2.5, CS 103
Mary, 24, 3, ECE Bob, 21, 3.7, CS Kathy, 18, 3.8, LS Kane, 19, 3.8, ME
Louis, 32, 4, LS Vera, 17, 3.9, EE Lam, 22, 2.8, ME Pat, 19, 2.8, EE
Tom, 20, 3.2, EE Martha, 29, 3.8, CS Chris, 22, 3.9, CS
Chad, 28, 2.3, LS Chang, 18, 2.5, CS 104
CSD Univ. of Crete Fall 2005
105
Load Factor
l The Load Factor:
u Lf: The number of records in the file divided by the number of places for
the records in the primary area
u Primary area should be 70 % to 90 % full
l Lf = R / (B * c)
u Lf à Load factor
u R à # of records in the file
u c à # of records which one bucket can hold
u B à # of buckets (blocks) in the primary area
l How the choice of Lf & c can affect time for fetching a record?
Fetching Using Buckets & Chains
As long as there is no overflow,
Tf = s + r + dtt
s à average seek time
r à average rotational latency
dtt à time to transfer the data in one bucket 106
CSD Univ. of Crete Fall 2005
l To sum up:
u Performance can be enhanced by the choice of bucket size and of
load factor
u Space utilization: Policy for handling overflow
u If the file size does not grow so much, hashing can give excellent
performance & space utilization
107
Database Growth
l Consider a database grows too much. Will buckets suffice?
l x à average chain length on the overflow
l Tf (successful) = s + r + dtt + (x/2) * (s + r + dtt)
u One access to get to the first record in the chain
u Traverse half of the chain?
l Tf (unsuccessful) = s + r + dtt + x * (s + r + dtt)
u One access to get to the first record in the chain
u Traverse the whole part of the chain
l Deletion:
u Merely mark the deleted record; when the whole hash table is
reorganized, marked records are not copied to the new table
u Replace the deleted record with the last record; de-allocate empty
buckets
· Cost : Td = Tf (unsucessful) + 2r
108
CSD Univ. of Crete Fall 2005
Database Growth
l Insertion:
u New insertions are at the end of the chain
u Assume, no new buckets are necessary
· Ti = Tf (unsuccessful) + 2r (A fetch and a rotation of the disk)
l Sequential Operations:
u Hashing is useless for Sequential operations. Why?
u Given than n sequential reads are to processed: Tx = n * Tf
l If the growth of a file can be predicted, hashing with buckets is excellent
u Range queries? No!
u Mixture of sequential operations, range searches, and individual
searches: Prefer B+ trees
l File grows considerably: Reorganization is necessary
u Linear hashing
u Bounded Index Extensible Hashing
u No file reorganization is required 109
Extensible Hashing
l Extensible Hashing
u Hash function returns b bits
u Only the prefix i bits are used to hash the item
u There are 2i entries in the bucket address table
u Let ij be the length of the common hash prefix for data bucket j, there
are 2(i-ij) entries in bucket address table points to j
i2
bucket2
i3
bucket3
110
CSD Univ. of Crete Fall 2005
Splitting
l Splitting - Case 1: ij=i
u Only one entry in bucket address table points to data bucket j
u i++; Split data bucket j to j, z; ij=iz=i; Rehash all items previously in j;
2 3 3
000
00 2 001
01
010 2
10
011
11
100
1 101
110 1
111
111
Splitting (cont.)
3 2 3
000 000 2
001 001
010 010
2
011 011
100 100 2
101 101
110 1 110
111 111
2
112
CSD Univ. of Crete Fall 2005
l Show the extensible hash structure if the hash function is h(x)=x mod 8
and buckets can hold three records
l Insert 2, 3, 5
0 0
0… 2, 3, 5
Insert 7
1
2, 3
1
0…
1
1…
5, 7
Insert 11
1
2, 3, 11
1
0…
1
1…
5, 7
Insert 17
2
17
2
2
00…
2, 3, 11
01…
10…
1
11… 5, 7
Insert 19
2
3
17
000
001 3
010 2
011
3
100
3, 11, 19
101
110 1
111 5, 7
Next: Insert 23 117
Insert 23
2
3
17
000
001 3
010 2
011
3
100
3, 11, 19
101
110 1
111 5, 7, 23
Next: Insert 29 118
CSD Univ. of Crete Fall 2005
Insert 29
2
3 17
000
001 3
010 2
011 3
100 3, 11, 19
101
2
110
5, 29
111
2
Next: Insert 31 7, 23 119
Insert 31
2
3 17
000
001 3
010 2
011 3
100 3, 11, 19
101
2
110
5, 29
111
2
Next: Delete 17 7, 23,31 120
CSD Univ. of Crete Fall 2005
Delete 17
2
3
000
001 3
010 2
011
3
100
3, 11, 19
101
2
110
111 5, 29
2
Next: Insert 17, 27 7, 23,31 121
Insert 17, 27
2
3
17
000
001 3 Overflow bucket
.vs. dir doubling
010 2 Why?
011
3 3
100
3, 11, 19 27
101
2
110
111 5, 29
2
17
2
2
00
01 2
10
2
11
5, 29
2
7, 23,31
123
v h(v)
pete 11010
mary 00000
jane 11110
bill 00000
john 01001
vince 10101
Location
karen 10111
mechanism
127
B
B
B
B 129
Next: Insert 43
131
Next: Insert 37
132
CSD Univ. of Crete Fall 2005
00
Next: Insert 29
133
00
01
Next: Insert 22, 66, 34
134
CSD Univ. of Crete Fall 2005
00
01
Next: Insert 50
10
135
136
CSD Univ. of Crete Fall 2005
137
•insert 0101
0000 0101 •now R=4 >1.7B (R/B = 2) Rule
1010 1111 à B++ Level++
à get new bucket 10 and distribute
00 01 keys between buckets 00 and 10
139
140
CSD Univ. of Crete Fall 2005
0001
0111
l Example:
u Initially 8 primary areas
u Load factor of 70 %
u Bucket factor of 10
u Initially 56 records
146
CSD Univ. of Crete Fall 2005
147
148
CSD Univ. of Crete Fall 2005
l Replace the deleted record with the last record in the chain
l If the last overflow is empty, de-allocate it
l When c * Lf is less than # needed, decrement primary area by
one bucket
149
Deletion
l Linear hashing table
contraction: To keep
the load factor constant,
the hash table can be
contracted by 1 bucket
every time there are
Lf*c records less than
are needed for the fixed
Now delete 4 records from 010, deallocating the chain. Also delete 2 records from load factor desired. On
011 to 1 from 1000. Then contract the table.
the right, where the
load factor is 70% and
the bucket factor is 10,
a bucket is removed
from the primary area
whenever there are 7
records less than
needed for a Lf of 70%
150
CSD Univ. of Crete Fall 2005
l Option 2:
u Partition the space of hash values
151
Hash-Index Summary
l Extendible Hashing avoids overflow pages by splitting a full bucket when a
new data entry is to be added to it (duplicates may require overflow pages)
u If directory fits in memory, equality search answered with one disk
access; else two
u Directory grows in spurts, can become large for skewed hash
distributions
· Collisions, or multiple entries with same hash value cause problems!
l Linear Hashing avoids directory by splitting buckets round-robin, and using
overflow pages
u Overflow pages not likely to be long
u Duplicates handled easily
u Space utilization could be lower than Extendible Hashing, since splits
not concentrated on “dense” data areas
· Can tune criterion for triggering splits to trade-off slightly longer
chains for better space utilization
l For hash-based indexes, a skewed data distribution is one in which the
hash values of data entries are not uniformly distributed! 152
CSD Univ. of Crete Fall 2005
Hash-Index Summary
l Hashing is the best file structure for single-record fetches (e.g., ATM,
airline reservations):
u Insertion & deletions are very fast
u There are more insertions than deletions
u Reorganization schemes: Linear & Bounded Extensible Hashing
l Linear Hashing:
l As new records are added, new buckets are appended to the primary
area
l Load factor is kept constant
l Bounded Extensible Hashing:
l After index reaches a predetermined size, primary area buckets
double in size
l First few digits of the hash number give the location of the index entry
in memory
l Next few determine the address of the data bucket
l Subsequent digits yield the offset of the block 153
Hash-Index Summary
l Formulas:
u Load factor is calculated by: Lf = R / (B * c)
u Single record fetch without overflow: Tf = s + r + dtt
u If there are overflows with an average # of chains x:
· Tf(successful) = ( 1 + x/2 ) * ( s + r + dtt )
· Tf(unsuccessful) = ( 1 + x ) * ( s + r + dtt )
u For insertion and deletion we use
· TI = TD = TF(unsuccessful) + 2r
154
CSD Univ. of Crete Fall 2005
Overall Summary
l Tree-structured indexes
l ISAM
l Static structure
l Only leaf pages modified
l Overflow pages needed
l B+ tree
l Dynamic structure
l Inserts/deletes leave tree height-balanced, cost = logFN
l High fanout(F), depth rarely more than 3 or 4
l Hash-based indexes
l Static Hashing
l Long overflow chains
l Extensible Hashing
l Directory to keep track of buckets, doubles periodically
l Avoids overflow pages by splitting a full bucket when a new data entry
is to be added to it
l Linear Hashing
l Avoids directory by splitting buckets round-robin, and using overflow 155
pages
Extensible Hashing
156