Professional Documents
Culture Documents
CS 245: Database System Principles: Notes 4: Indexing
CS 245: Database System Principles: Notes 4: Indexing
CS 245: Database System Principles: Notes 4: Indexing
Principles
Notes 4: Indexing
Hector Garcia-Molina
CS 245 Notes 4 1
Chapter 4
CS 245 Notes 4 2
Topics
• Conventional indexes
• B-trees
• Hashing schemes
CS 245 Notes 4 3
Sequential File
10
20
30
40
50
60
70
80
90
100
CS 245 Notes 4 4
Dense Index Sequential File
10 10
20 20
30
40
30
40
50
60 50
70 60
80
70
90 80
100 90
110 100
120
CS 245 Notes 4 5
Sparse Index Sequential File
10 10
30 20
50
70
30
40
90
110 50
130 60
150
70
170 80
190 90
210 100
230
CS 245 Notes 4 6
Sparse 2nd level Sequential File
10 10 10
90 30 20
170 50
250 70
30
40
90
330 50
110
410 60
130
490
150
570 70
170 80
190 90
210 100
230
CS 245 Notes 4 7
• Comment:
{FILE,INDEX} may be contiguous
or not (blocks chained)
CS 245 Notes 4 8
Question:
CS 245 Notes 4 9
Notes on pointers:
BP
RP
CS 245 Notes 4 10
Notes on pointers:
(2) If file is contiguous, then we can omit
pointers (i.e., compute them)
CS 245 Notes 4 11
R1
K1 R2
K2
K3 R3
K4
R4
CS 245 Notes 4 12
R1
K1 R2 say:
K2 1024 B
per block
K3 R3
K4
R4
• if we want K3 block:
get it at offset
(3-1)1024
= 2048 bytes
CS 245 Notes 4 13
Sparse vs. Dense Tradeoff
(Later:
– sparse better for insertions
– dense needed for secondary indexes)
CS 245 Notes 4 14
Terms
• Index sequential file
• Search key ( primary key)
• Primary index (on Sequencing field)
• Secondary index
• Dense index (all Search Key values in)
• Sparse index
• Multi-level index
CS 245 Notes 4 15
Next:
• Duplicate keys
• Deletion/Insertion
• Secondary indexes
CS 245 Notes 4 16
Duplicate keys
10
10
10
20
20
30
30
30
40
45
CS 245 Notes 4 17
Duplicate keys
Dense index, one way to implement?
10
10 10
10
10 10
20 20
20 20
30 30
30 30
30 30
40
45
CS 245 Notes 4 18
Duplicate keys
Dense index, better way?
10
10 10
20
30 10
40 20
20
30
30
30
40
45
CS 245 Notes 4 19
Duplicate keys
Sparse index, one way?
10
10 10
10
20 10
30 20
20
30
30
30
40
45
CS 245 Notes 4 20
Duplicate keys
Sparse index, one way?
10
careful if looking
10 10
for 20 or 30!
10
20 10
30 20
20
30
30
30
40
45
CS 245 Notes 4 21
Duplicate keys
Sparse index, another way?
– place first new key from block
10
10 10
20
30 10
30 20
20
30
30
30
40
45
CS 245 Notes 4 22
Duplicate keys
Sparse index, another way?
– place first new key from block
10
should 10 10
20
this be 30 10
40? 30 20
20
30
30
30
40
45
CS 245 Notes 4 23
Summary Duplicate values,
primary index
• Index may point to first instance of
each value only
File
Index a
a a
.
.
b
CS 245 Notes 4 24
Deletion from sparse index
10
10 20
30
50 30
70 40
90 50
110 60
130 70
150 80
CS 245 Notes 4 25
Deletion from sparse index
– delete record 40
10
10 20
30
50 30
70 40
90 50
110 60
130 70
150 80
CS 245 Notes 4 26
Deletion from sparse index
– delete record 40
10
10 20
30
50 30
70 40
90 50
110 60
130 70
150 80
CS 245 Notes 4 27
Deletion from sparse index
– delete record 30
10
10 20
30
50 30
70 40
90 50
110 60
130 70
150 80
CS 245 Notes 4 28
Deletion from sparse index
– delete record 30
10
10 20
40 30
50 30 40
70 40
90 50
110 60
130 70
150 80
CS 245 Notes 4 29
Deletion from sparse index
– delete records 30 & 40
10
10 20
30
50 30
70 40
90 50
110 60
130 70
150 80
CS 245 Notes 4 30
Deletion from sparse index
– delete records 30 & 40
10
10 20
30
50 30
70 40
90 50
110 60
130 70
150 80
CS 245 Notes 4 31
Deletion from sparse index
– delete records 30 & 40
10
10 20
50 30
70 50 30
70 40
90 50
110 60
130 70
150 80
CS 245 Notes 4 32
Deletion from dense index
10
10 20
20
30 30
40 40
50 50
60 60
70 70
80 80
CS 245 Notes 4 33
Deletion from dense index
– delete record 30
10
10 20
20
30 30
40 40
50 50
60 60
70 70
80 80
CS 245 Notes 4 34
Deletion from dense index
– delete record 30
10
10 20
20
30 30 40
40 40
50 50
60 60
70 70
80 80
CS 245 Notes 4 35
Deletion from dense index
– delete record 30
10
10 20
20
40 30 30 40
40 40
50 50
60 60
70 70
80 80
CS 245 Notes 4 36
Insertion, sparse index case
10
10 20
30
40 30
60
40
50
60
CS 245 Notes 4 37
Insertion, sparse index case
– insert record 34
10
10 20
30
40 30
60
40
50
60
CS 245 Notes 4 38
Insertion, sparse index case
– insert record 34
10
10 20
30
40 30
60 34
40
50
• our lucky day! 60
we have free space
where we need it!
CS 245 Notes 4 39
Insertion, sparse index case
– insert record 15
10
10 20
30
40 30
60
40
50
60
CS 245 Notes 4 40
Insertion, sparse index case
– insert record 15
10
10 20 15
20 30
40 30 20
60 30
40
50
60
CS 245 Notes 4 41
Insertion, sparse index case
– insert record 15
10
10 20 15
20 30
40 30 20
60 30
40
50
• Illustrated: Immediate reorganization 60
• Variation:
– insert new block (chained file)
– update index
CS 245 Notes 4 42
Insertion, sparse index case
– insert record 25
10
10 20
30
40 30
60
40
50
60
CS 245 Notes 4 43
Insertion, sparse index case
– insert record 25
10 25
10 20
30
40 30 overflow blocks
60 (reorganize later...)
40
50
60
CS 245 Notes 4 44
Insertion, dense index case
• Similar
• Often more expensive . . .
CS 245 Notes 4 45
Secondary indexes
Sequence
field
30
50
20
70
80
40
100
10
90
60
CS 245 Notes 4 46
Secondary indexes
Sequence
• Sparse index field
30
30
20
50
80 20
100 70
90 80
... 40
100
10
90
60
CS 245 Notes 4 47
Secondary indexes
Sequence
• Sparse index field
30
30
20
50
80 20
100 70
90 80
... 40
100
10
does not make sense! 90
60
CS 245 Notes 4 48
Secondary indexes
Sequence
• Dense index field
30
50
20
70
80
40
100
10
90
60
CS 245 Notes 4 49
Secondary indexes
Sequence
• Dense index field
10 30
20 50
30
40 20
70
50
60
80
70
40
... 100
10
90
60
CS 245 Notes 4 50
Secondary indexes
Sequence
• Dense index field
10 30
20 50
10 30
40 20
50
70
90
... 50
60
80
sparse 70
40
high ... 100
level 10
90
60
CS 245 Notes 4 51
With secondary indexes:
• Lowest level is dense
• Other levels are sparse
CS 245 Notes 4 52
Duplicate values & secondary indexes
20
10
20
40
10
40
10
40
30
40
CS 245 Notes 4 53
Duplicate values & secondary indexes
one option...
10 20
10 10
10
20 20
40
20
30 10
40 40
40
10
40 40
40
... 30
40
CS 245 Notes 4 54
Duplicate values & secondary indexes
one option...
10 20
10 10
10
20
Problem: 20
40
20
excess overhead! 30 10
• disk space 40 40
40
• search time 10
40 40
40
... 30
40
CS 245 Notes 4 55
Duplicate values & secondary indexes
another option...
10 20
10
20
20
40
10
30
40
40 10
40
30
40
CS 245 Notes 4 56
Duplicate values & secondary indexes
another option...
10 20
10
20
Problem: 20
40
variable size 10
records in 30
40
index! 40 10
40
30
40
CS 245 Notes 4 57
Duplicate values & secondary indexes
20
10 10
20
30 20
40 40
50 10
60 40
...
10
40
Another idea (suggested in class): 30
Chain records with same key?
40
CS 245 Notes 4 58
Duplicate values & secondary indexes
20
10 10
20
30 20
40 40
50 10
60 40
...
10
40
Another idea (suggested in class): 30
Chain records with same key?
40
Problems:
• Need to add fields to records
• Need to follow chain to know records
CS 245 Notes 4 59
Duplicate values & secondary indexes
20
10 10
20
30 20
40 40
50 10
60 40
...
10
40
30
40
buckets
CS 245 Notes 4 60
Why “bucket” idea is useful
Indexes Records
Name: primary EMP (name,dept,floor,...)
Dept: secondary
Floor: secondary
CS 245 Notes 4 61
Query: Get employees in
(Toy Dept) ^ (2nd floor)
Dept. index EMP Floor index
Toy 2nd
CS 245 Notes 4 62
Query: Get employees in
(Toy Dept) ^ (2nd floor)
Dept. index EMP Floor index
Toy 2nd
Documents
...the cat is
fat ...
...was raining
cats and dogs...
...Fido the
dog ...
CS 245 Notes 4 64
This idea used in
text information retrieval
Documents
cat
...the cat is
fat ...
dog
...was raining
cats and dogs...
...Fido the
dog ...
Inverted lists
CS 245 Notes 4 65
IR QUERIES
• Find articles with “cat” and “dog”
• Find articles with “cat” or “dog”
• Find articles with “cat” and not “dog”
CS 245 Notes 4 66
IR QUERIES
• Find articles with “cat” and “dog”
• Find articles with “cat” or “dog”
• Find articles with “cat” and not “dog”
CS 245 Notes 4 67
Common technique:
more info in inverted list
d1
cat Title 5
Author 10
Abstract 57
d3 d2
CS 245 Notes 4 68
Posting: an entry in inverted list.
Represents occurrence of
term in article
Size of a list: 1 Rare words or
(in postings) miss-spellings
CS 245 Notes 4 69
IR DISCUSSION
• Stop words
• Truncation
• Thesaurus
• Full text vs. Abstracts
• Vector model
CS 245 Notes 4 70
Vector space model
w1 w2 w3 w4 w5 w6 w7 …
DOC = <1 0 0 1 1 0 0 …>
CS 245 Notes 4 71
Vector space model
w1 w2 w3 w4 w5 w6 w7 …
DOC = <1 0 0 1 1 0 0 …>
CS 245 Notes 4 72
• Tricks to weigh scores + normalize
CS 245 Notes 4 73
• How to process V.S. Queries?
w1 w2 w3 w4 w5 w6 …
Q=<0 0 0 1 1 0 …>
CS 245 Notes 4 74
• Try Stanford Libraries
• Try Google, Yahoo, ...
CS 245 Notes 4 75
Summary so far
• Conventional index
– Basic Ideas: sparse, dense, multi-level…
– Duplicate Keys
– Deletion/Insertion
– Secondary indexes
– Buckets of Postings List
CS 245 Notes 4 76
Conventional indexes
Advantage:
- Simple
- Index is sequential file
good for scans
Disadvantage:
- Inserts expensive, and/or
- Lose sequentiality & balance
CS 245 Notes 4 77
Example Index (sequential)
10
20
30
continuous
40
50
60
free space
70
80
90
CS 245 Notes 4 78
Example Index (sequential)
10
20 39
30 31
33 35
continuous 36
40
50
60 32
38
free space 34
70
80
90 overflow area
(not sequential)
CS 245 Notes 4 79
Outline:
• Conventional indexes
• B-Trees NEXT
• Hashing schemes
CS 245 Notes 4 80
• NEXT: Another type of index
– Give up on sequentiality of index
– Try to get “balance”
CS 245 Notes 4 81
3
5
CS 245
11
30
30
35
100
101
B+Tree Example
110
100
Notes 4
Root
120
130
150
156 120
179 150
180
180
n=3
200
82
Sample non-leaf
57
81
95
to keys to keys to keys to keys
< 57 57 k<81 81k<95 95
CS 245 Notes 4 83
Sample leaf node:
From non-leaf node
to next leaf
in sequence
57
81
95
with key 57
with key 81
with key 85
To record
To record
To record
CS 245 Notes 4 84
In textbook’s notation n=3
Leaf:
30 35
30
35
Non-leaf:
30
30
CS 245 Notes 4 85
Size of nodes: n+1 pointers
(fixed)
n keys
CS 245 Notes 4 86
Don’t want nodes to be too empty
• Use at least
CS 245 Notes 4 87
n=3
Non-leaf
120
150
180
30
30
35
3
5
CS 245 Notes 4 88
B+tree rules tree of order n
CS 245 Notes 4 89
(3) Number of pointers/keys for B+tree
CS 245 Notes 4 90
Insert into B+tree
CS 245 Notes 4 91
(a) Insert key = 32 n=3
100
30
11
30
31
3
5
CS 245 Notes 4 92
(a) Insert key = 32 n=3
100
30
11
30
31
32
3
5
CS 245 Notes 4 93
(a) Insert key = 7 n=3
100
30
11
30
31
3
5
CS 245 Notes 4 94
(a) Insert key = 7 n=3
100
30
57
11
30
31
3
5
CS 245 Notes 4 95
(a) Insert key = 7 n=3
100
30
7
57
11
30
31
3
5
CS 245 Notes 4 96
(c) Insert key = 160 n=3
100
120
150
180
180
200
150
156
179
CS 245 Notes 4 97
(c) Insert key = 160 n=3
100
120
150
180
180
200
150
156
179
160
179
CS 245 Notes 4 98
(c) Insert key = 160 n=3
100
120
150
180
180
180
200
150
156
179
160
179
CS 245 Notes 4 99
(c) Insert key = 160 n=3
160
100
120
150
180
180
180
200
150
156
179
160
179
CS 245 Notes 4 100
(d) New root, insert 45 n=3
10
20
30
12
10
20
25
30
32
40
1
2
3
10
20
30
12
10
20
25
30
32
40
40
45
1
2
3
40
10
20
30
12
10
20
25
30
32
40
40
45
1
2
3
new root
30
40
10
20
30
12
10
20
25
30
32
40
40
45
1
2
3
100
10
40
10
20
30
40
50
100
10
40
40
10
20
30
40
50
100
10
40
10
20
30
35
40
50
CS 245 Notes 4 108
(c) Redistribute keys
n=4
– Delete 50
40 35
100
10
35
10
20
30
35
40
50
CS 245 Notes 4 109
(d) Non-leaf coalese
n=4
– Delete 37
25
10
20
30
40
37
30
26
10
14
20
22
25
40
45
1
3
25
10
20
30
40
30
37
30
26
10
14
20
22
25
40
45
1
3
25
40
10
20
30
40
30
37
30
26
10
14
20
22
25
40
45
1
3
25
new root
40
25
10
20
30
40
30
37
30
26
10
14
20
22
25
40
45
1
3
127 keys k2
k3 k3
• Buffering
– B-tree: has fixed buffer requirements
– Static index: must read several overflow
blocks to be efficient
(large & variable size
buffers needed for this)
Of course not!
Should try to keep root in memory
at all times
(and perhaps some nodes from second level)
f(n)
nopt n
• Idea:
– Avoid duplicate keys
– Have record pointers in non-leaf nodes
CS 245
20
30
40
25
50
45
60
B-tree example
70
80
90 85 65
Notes 4
100 105 125
110
120
130
140 145
150 165
160
n=2
170
136
180
B-tree example n=2
• sequence pointers
not useful now!
125
(but keep space for simplicity)
65
105
145
165
25
45
85
110
160
100
120
130
140
150
170
180
20
10
30
40
50
60
70
80
90
30
10
20
30
10
20
• Afterwards:
20
–
–
10
25
30
CS 245 Notes 4 139
So, for B-trees:
MAX MIN
Tree Rec Keys Tree Rec Keys
Ptrs Ptrs Ptrs Ptrs
Non-leaf
non-root n+1 n n (n+1)/2 (n+1)/2-1 (n+1)/2-1
Leaf
non-root 1 n n 1 n/2 n/2
Root
non-leaf n+1 n n 2 1 1
Root
Leaf 1 n n 1 1 1
B+trees preferred!
B+ B
ooooooooooooo ooooooooo
156 records 108 records
Total = 116
B+ B
ooooooooooooo ooooooooo
156 records 108 records
Total = 116
• Conclusion:
– For fixed block size,
– B+ tree is better because it is bushier
CS 245 Notes 4 152
An Interesting Problem...
• What is a good index structure when:
– records tend to be inserted with keys
that are larger than existing values?
(e.g., banking records with growing data/time)
– we want to remove older data
•advantage: no deletions
•disadvantage: approximate windows