Download as pdf or txt
Download as pdf or txt
You are on page 1of 122

Discussion Worksheet: https://tinyurl.

com/yxltttyr

Discussion Slides: https://tinyurl.com/yy54ns56

Discussion 5
Sorting & Hashing

Please turn on your videos!


Announcements
Vitamin 5 (Sorting and Hashing) is due Friday at 11:59 PM.
● This is the last module in-scope for the midterm, so we strongly recommend
you work through the vitamin before the midterm.

Midterm 1 on Monday, Oct 5 at 5:30-7:30 PM (@174 for logistics)


● Review Session on Friday, Oct 2 from 7-9 PM on Zoom (will be recorded)
○ @194 on Piazza for more info
External Algorithms
External Algorithms
● Traditional algorithms assume all data fit in memory
● External algorithms are designed for the case when there
is more data than space in memory
○ We can’t just access/modify values whenever we want:
disk accesses are very expensive
● Typical strategy is to divide and conquer - start with chunks
of data that do fit in memory, and work from there
○ For sorting: External Merge Sort
Sorting
2-Way External Merge Sort
● We first sort small amounts of
data into runs of sorted tuples
● Given 2 runs of sorted tuples,
we can merge them into 1
larger run of sorted tuples
○ Same as in-memory
mergesort
○ Stream in the two runs and
stream out the new run
2-Way External Merge Sort
● How many buffer pages do
we need?
2-Way External Merge Sort
● How many buffer pages do
we need?
○ 3 - we need 2 input buffers
(one for each input run)
and 1 output buffer (for the
new merged run)
2-Way External Merge Sort
● How many passes over the
data do we need?
2-Way External Merge Sort
● How many passes over the
data do we need?
○ 1 + ⌈log2(N)⌉ - we make 1
pass to sort each page and
get N runs (“Pass 0”). We
merge 2 runs at a time, so
we need ⌈log2(N)⌉ passes
to merge everything.
2-Way External Merge Sort
● How many I/Os does this take?
2-Way External Merge Sort
● How many I/Os does this take?
○ 2N(1 + ⌈log2(N)⌉) - we have
to read and write the entire
file with each pass.
General External Merge Sort
● 2-Way external merge sort only uses 1 page of memory in
Pass 0, and 3 pages in the merge passes
● We can do better if we have more memory!
○ Pass 0: sort more pages at once → fewer runs to merge
■ If we have B buffer pages, we can sort B pages at
once!
○ Pass 1-n: merge more runs at once → finish faster
■ If we have B buffer pages, we can merge B-1 runs
at once!
General External Merge Sort
● How do we merge B-1 runs at once?
○ Look at the first tuple of each run that hasn’t been
written to output
■ Can use a min priority queue to do efficiently
○ Output the tuple with lowest value
○ Repeat.
General External Merge Sort
● How many passes do we need?
○ We sort B pages at once, so we have ⌈N/B⌉ runs after
Pass 0
○ We merge B-1 pages at once, so we have to do
⌈logB-1(# runs )⌉ merge passes
○ So we have 1 + ⌈logB-1(⌈N/B⌉)⌉ passes over the data
I/O Total (So Far): 0
General External Merge Sort
B=4, N=8
Sort 8 data pages

1 data page: 6, 1 Goal


1 data page: 25, 20
0, 1 2, 3 4, 6
1 data page: 0, 10
1 data page: 9, 17
7, 8 9, 10 11, 12
1 data page: 7, 8
1 data page: 12, 2
15, 17 20, 25
1 data page: 4, 11
1 data page: 15, 3
I/O Total (So Far): 4
General External Merge Sort
B=4, N=8: Pass 0, Run 1
Read 4 pages into memory: 4 IOs

1 data page: 6, 1
1 data page: 25, 20
6, 1, 25, 20, 0, 10, 9, 17
1 data page: 0, 10
1 data page: 9, 17

Load B data pages into buffer pages in memory, and sort them all at once.
I/O Total (So Far): 4
General External Merge Sort
B=4, N=8: Pass 0, Run 1
In-memory sort

1 data page: 6, 1
1 data page: 25, 20
0, 1, 6, 9, 10, 17, 20, 25
1 data page: 0, 10
1 data page: 9, 17

Load B data pages into buffer pages in memory, and sort them all at once.
I/O Total (So Far): 8
General External Merge Sort
B=4, N=8: Pass 0, Run 1
Write 4 pages to disk: 4 IOs

1 data page: 6, 1
1 sorted run of 4 pages
1 data page: 25, 20
0, 1, 6, 9, 10, 17, 20, 25 0, 1 6, 9 10, 17 20, 25
1 data page: 0, 10
1 data page: 9, 17

Load B data pages into buffer pages in memory, and sort them all at once.
I/O Total (So Far): 12
General External Merge Sort
B=4, N=8: Pass 0, Run 2
Read 4 pages into memory: 4 IOs

1 data page: 7, 8
1 data page: 12, 2
7, 8, 12, 2, 4, 11, 15, 3
1 data page: 4, 11
1 data page: 15, 3

Load B data pages into buffer pages in memory, and sort them all at once.
I/O Total (So Far): 12
General External Merge Sort
B=4, N=8: Pass 0, Run 2
In-memory sort

1 data page: 7, 8
1 data page: 12, 2
2, 3, 4, 7, 8, 11, 12, 15
1 data page: 4, 11
1 data page: 15, 3

Load B data pages into buffer pages in memory, and sort them all at once.
I/O Total (So Far): 16
General External Merge Sort
B=4, N=8: Pass 0, Run 2
Write 4 pages to disk: 4 IOs

1 data page: 7, 8
1 sorted run of 4 pages
1 data page: 12, 2
2, 3, 4, 7, 8, 11, 12, 15 2, 3 4, 7 8, 11 12, 15
1 data page: 4, 11
1 data page: 15, 3

Load B data pages into buffer pages in memory, and sort them all at once.
I/O Total (So Far): 16
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

0, 1 6, 9 10, 17 20, 25 input buffer:


input buffer:
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer:
2, 3 4, 7 8, 11 12, 15

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 18
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

0, 1 6, 9 10, 17 20, 25 input buffer: 0, 1


input buffer: 2, 3
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer:
2, 3 4, 7 8, 11 12, 15

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 18
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

6, 9 10, 17 20, 25 input buffer: 0, 1


input buffer: 2, 3
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer:
4, 7 8, 11 12, 15

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 18
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

6, 9 10, 17 20, 25 input buffer: 0, 1


input buffer: 2, 3
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer:
4, 7 8, 11 12, 15

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 18
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

6, 9 10, 17 20, 25 input buffer: 1


input buffer: 2, 3
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 0
4, 7 8, 11 12, 15

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 18
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

6, 9 10, 17 20, 25 input buffer: 1


input buffer: 2, 3
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 0
4, 7 8, 11 12, 15

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 18
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

6, 9 10, 17 20, 25 input buffer: [empty]


input buffer: 2, 3
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 0, 1
4, 7 8, 11 12, 15

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 20
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

6, 9 10, 17 20, 25 input buffer: 6, 9


input buffer: 2, 3
0, 1
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: [empty]
4, 7 8, 11 12, 15

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 20
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

10, 17 20, 25 input buffer: 6, 9


input buffer: 2, 3
0, 1
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: [empty]
4, 7 8, 11 12, 15

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 20
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

10, 17 20, 25 input buffer: 6, 9


input buffer: 3
0, 1
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 2
4, 7 8, 11 12, 15

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 20
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

10, 17 20, 25 input buffer: 6, 9


input buffer: 3
0, 1
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 2
4, 7 8, 11 12, 15

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 20
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

10, 17 20, 25 input buffer: 6, 9


input buffer: [empty]
0, 1
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 2, 3
4, 7 8, 11 12, 15

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 22
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

10, 17 20, 25 input buffer: 6, 9


input buffer: 4, 7
0, 1 2, 3
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: [empty]
4, 7 8, 11 12, 15

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 22
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

10, 17 20, 25 input buffer: 6, 9


input buffer: 4, 7
0, 1 2, 3
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: [empty]
8, 11 12, 15

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 22
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

10, 17 20, 25 input buffer: 6, 9


input buffer: 7
0, 1 2, 3
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 4
8, 11 12, 15

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 22
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

10, 17 20, 25 input buffer: 6, 9


input buffer: 7
0, 1 2, 3
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 4
8, 11 12, 15

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 22
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

10, 17 20, 25 input buffer: 9


input buffer: 7
0, 1 2, 3
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 4, 6
8, 11 12, 15

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 23
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

10, 17 20, 25 input buffer: 9


input buffer: 7
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: [empty]
8, 11 12, 15

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 23
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

10, 17 20, 25 input buffer: 9


input buffer: [empty]
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 7
8, 11 12, 15

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 24
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

10, 17 20, 25 input buffer: 9


input buffer: 8, 11
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 7
8, 11 12, 15

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 24
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

10, 17 20, 25 input buffer: 9


input buffer: 8, 11
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 7
12, 15

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 24
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

10, 17 20, 25 input buffer: 9


input buffer: 11
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 7, 8
12, 15

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 25
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

10, 17 20, 25 input buffer: 9


input buffer: 11
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: [empty]
12, 15 7, 8

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 25
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

10, 17 20, 25 input buffer: [empty]


input buffer: 11
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 9
12, 15 7, 8

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 26
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

10, 17 20, 25 input buffer: 10, 17


input buffer: 11
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 9
12, 15 7, 8

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 26
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

20, 25 input buffer: 10, 17


input buffer: 11
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 9
12, 15 7, 8

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 26
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

20, 25 input buffer: 17


input buffer: 11
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 9, 10
12, 15 7, 8

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 27
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

20, 25 input buffer: 17


input buffer: 11
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: [empty]
12, 15 7, 8 9, 10

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 27
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

20, 25 input buffer: 17


input buffer: [empty]
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 11
12, 15 7, 8 9, 10

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 28
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

20, 25 input buffer: 17


input buffer: 12, 15
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 11
12, 15 7, 8 9, 10

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 28
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

20, 25 input buffer: 17


input buffer: 12, 15
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 11
7, 8 9, 10

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 28
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

20, 25 input buffer: 17


input buffer: 15
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 11, 12
7, 8 9, 10

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 29
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

20, 25 input buffer: 17


input buffer: 15
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: [empty]
7, 8 9, 10 11, 12

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 29
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

20, 25 input buffer: 17


input buffer: [empty]
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 15
7, 8 9, 10 11, 12

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 29
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

20, 25 input buffer: 17


input buffer: [empty]
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 15
7, 8 9, 10 11, 12

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 29
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

20, 25 input buffer: [empty]


input buffer: [empty]
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 15, 17
7, 8 9, 10 11, 12

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run at a time. Store
sorted results in output buffer. Write to disk when output buffer is full.
I/O Total (So Far): 31
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

20, 25 input buffer: 20, 25


input buffer: [empty]
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: [empty]
7, 8 9, 10 11, 12

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run
at a time. Store sorted results in output buffer. Write to disk when output 15, 17
buffer is full.
I/O Total (So Far): 31
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

input buffer: 20, 25


input buffer: [empty]
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: [empty]
7, 8 9, 10 11, 12

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run
at a time. Store sorted results in output buffer. Write to disk when output 15, 17
buffer is full.
I/O Total (So Far): 31
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

input buffer: 25
input buffer: [empty]
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 20
7, 8 9, 10 11, 12

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run
at a time. Store sorted results in output buffer. Write to disk when output 15, 17
buffer is full.
I/O Total (So Far): 31
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

input buffer: 25
input buffer: [empty]
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 20
7, 8 9, 10 11, 12

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run
at a time. Store sorted results in output buffer. Write to disk when output 15, 17
buffer is full.
I/O Total (So Far): 31
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

input buffer: [empty]


input buffer: [empty]
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: 20, 25
7, 8 9, 10 11, 12

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run
at a time. Store sorted results in output buffer. Write to disk when output 15, 17
buffer is full.
I/O Total (So Far): 32
General External Merge Sort
B=4, N=8: Pass 1
Read 2 sorted runs of 4 pages into memory: 8 IOs; Write 1 sorted run of 8 pages to disk: 8 IOs
Run 1: 1 sorted run of 4 pages

input buffer: [empty]


input buffer: [empty]
0, 1 2, 3 4, 6
Run 2: 1 sorted run of 4 pages input buffer: [unused]
output buffer: [empty]
7, 8 9, 10 11, 12

Reserve B-1 input buffers and 1 output buffer. Load 1 page from each run
at a time. Store sorted results in output buffer. Write to disk when output 15, 17 20, 25
buffer is full.
General External Merge Sort - Sanity Check
B=4, N=8

Cost = 2N * (1 + ⌈ logB-1(⌈N/B⌉) ⌉)
= 2(8) * (1 + ⌈ log3(2) ⌉)
0, 1 2, 3 4, 6
= 16 * (1 + 1)
= 32 I/Os ✓ 7, 8 9, 10 11, 12

15, 17 20, 25
Worksheet
Worksheet - Sorting (a) - (c)
You have 4 buffer pages and your file has a total of 108 pages
of records to sort.

How many passes would it take to sort the file?

How many runs would each pass produce?

What is the total cost for this sort process in terms of I/O?
Worksheet - Sorting (a) - (c)
B=4, N=108: Pass 0 - 108 IOs (Read) + 108 IOs (Write)
Load B data pages into memory, sort all values in memory, write back to disk

1 data page
input buffer
1 data page input buffer
1 data page input buffer 1 sorted
run of 4
1 data page input buffer pages

This process happens once for each of our ceil(108/4) = 27 runs because we have 108
pages total, and we sort 4 pages each time.
Worksheet - Sorting (a) - (c)
B=4, N=108: Pass 1 - 108 IOs (Read) + 108 IOs (Write)
Load B-1 data pages into memory, sort all values in memory, write sorted runs back to disk

1 sorted run of 4 pages input buffer


1 sorted run of 4 pages input buffer

1 sorted run of 4 pages input buffer


output buffer 1 sorted run
of 12 pages

We started off with 27 runs of 4 pages each. We can merge B-1 = 3 runs at a time, so we produce
ceil(27/3) = 9 runs of 4*3 = 12 pages at the end of Pass 1.
Worksheet - Sorting (a) - (c)
B=4, N=108: Pass 2 - 108 IOs (Read) + 108 IOs (Write)
Load B-1 data pages into memory, sort all values in memory, write sorted runs back to disk

1 sorted run of 12 pages input buffer


1 sorted run of 12 pages input buffer
1 sorted run of 12 pages input buffer
output buffer 1 sorted run
of 36 pages

We started off with 9 runs of 12 pages each. We can merge B-1 = 3 runs at a time, so we
produce ceil(9/3) = 3 runs of 12*3 = 36 pages at the end of Pass 2.
Worksheet - Sorting (a) - (c)
B=4, N=108: Pass 3 - 108 IOs (Read) + 108 IOs (Write)
Load B-1 data pages into memory, sort all values in memory, write sorted runs back to disk

1 sorted run of 36 pages input buffer


1 sorted run of 36 pages input buffer
1 sorted run of 36 pages input buffer
output buffer 1 sorted run of
108 pages

We started off with 3 runs of 36 pages each. We can merge B-1 = 3 runs at a time, so we
produce ceil(3/3) = 1 run of 36*3 = 108 pages at the end of Pass 3.
Since we’ve produced 1 sorted run containing all our data, external sorting is now complete.
Worksheet - Sorting (a)
You have 4 buffer pages and your file has a total of 108 pages
of records to sort.

How many passes would it take to sort the file?


Pass 0 - ceil(108/4) = 27 sorted runs of 4 pages each
Pass 1 - ceil(27/3) = 9 sorted runs of 12 pages each
Pass 2 - ceil(9/3) = 3 sorted runs of 36 pages each
Pass 3 - Sorted file (1 run)
Total = 4 passes
Worksheet - Sorting (b)
You have 4 buffer pages and your file has a total of 108 pages
of records to sort.

How many runs would each pass produce?


Pass 0 - 27 sorted runs (of 4 pages each)
Pass 1 - 9 sorted runs (of 12 pages each)
Pass 2 - 3 sorted runs (of 36 pages each)
Pass 3 - 1 sorted run (of 108 pages)
Worksheet - Sorting (c)
You have 4 buffer pages and your file has a total of 108 pages
of records to sort.

What is the total cost for this sort process in terms of I/O?

4 passes * 2 (read + write per pass) * 108 (pages in the file)


= 864 I/Os
Worksheet - Sorting (d)
You have 4 buffer pages and your file has a total of 108 pages
of records to sort.

If the pages were already sorted individually, how many


passes would it take to sort the file and how many IOs would
it be instead?
Worksheet - Sorting (d)
You have 4 buffer pages and your file has a total of 108 pages of records to sort.

If the pages were already sorted individually, how many passes would it take to sort the file
and how many IOs would it be instead?
These pages are individually sorted, so because we don't know how the pages will be sorted
together, the IO cost does not change! Pass 0 is still going to need to produce ceil(N/B) sorted
runs of B pages each, and so on and so forth. As a result, you would still require 4 passes and 864
IOs.

1
Input Buffer: 1 3 Input Buffer: [empty]
3
Input Buffer: 2 5 Input Buffer: [empty]

2 Output Buffer: 3 5 Output Buffer: [empty] 1 2 3 5


1 2
5
Worksheet - Sorting (e)
If we wanted to sort N pages with B buffer pages in at most p
total passes, write an expression relating the minimum buffer
pages B needed with N and p. What do you notice about B
when p = 1?
Worksheet - Sorting (e)
Since we want (# of passes after pass 0) ≤ p − 1, we set the
equation logB-1(N/B) ≤ p − 1. Rearranging results in B(B-1)p-1 ≥ N.
If p = 1, this means that B ≥ N which, conceptually, means that
if we want to sort N pages in 1 pass, all of them must fit into
memory at the same time.
Hashing
Hashing
● We want to be able to group together tuples with the
same key value
● Partition the data with hash function(s) applied on the key -
all tuples with a certain key will be in the same partition
● Useful for removing duplicates (all duplicates will be
grouped together), grouping data (for GROUP BY)
● Also can be useful for looking up data (but not in-scope
for this class)
External Hashing
● We can’t build an in-memory hash table if there’s too much
data!
● Start by splitting up data into smaller pieces!
○ Use a hash function hp to partition the data
■ Stream partitions to disk
○ If we have B pages of buffer, we can split the data into
B-1 partitions (1 buffer page reserved for streaming
data in)
External Hashing
External Hashing
● If the partitions are small enough to fit in memory (at most
B pages), we can load them in and make an in-memory
hash table for each one, one at a time
○ Then we can apply duplicate removal, aggregation,
etc. in memory
○ Every tuple in a partition has the same value when hp is
applied!
○ In-memory hash table must use a different hash
function (we call it hr) that is independent of hp
External Hashing
External Hashing
● Hashing requires good hash functions that are not subject
to data skew
○ The hash function ideally distributes keys evenly
across all partitions - otherwise we might get a really
large partition (requiring recursive partitioning) and a
bunch of small ones
● Assume perfect hash functions in this class (distributes
data perfectly evenly) unless stated otherwise
External Hashing Example

• Goal: Group squares by color


• Setup: 12 squares, each page fits 2 squares. We can
hold 4 pages in memory.
• N = 6, B = 4
External Hashing Example: Pass 1

N=6, B=4
Assign colors to 3
partitions using our
hash function:
{G,P} → 1
{B} → 2
{R, Y} → 3
External Hashing Example: Pass 1

N=6, B=4
Assign colors to 3
partitions using our
hash function:
{G,P} → 1
{B} → 2
{R, Y} → 3
External Hashing Example: Pass 1

N=6, B=4
Assign colors to 3
partitions using our
hash function:
{G,P} → 1
{B} → 2
{R, Y} → 3
External Hashing Example: Pass 1

N=6, B=4
Assign colors to 3
partitions using our
hash function:
{G,P} → 1
{B} → 2
{R, Y} → 3
External Hashing Example: Pass 1

N=6, B=4
Assign colors to 3
partitions using our
hash function:
{G,P} → 1
{B} → 2
{R, Y} → 3
External Hashing Example: Pass 1

N=6, B=4 Our hash function: {G,P} → 1, {B} → 2, {R, Y} → 3


External Hashing Example: Pass 1

N=6, B=4 Our hash function: {G,P} → 1, {B} → 2, {R, Y} → 3


External Hashing Example: Pass 1

N=6, B=4 Our hash function: {G,P} → 1, {B} → 2, {R, Y} → 3


External Hashing Example: Pass 1

N=6, B=4 Our hash function: {G,P} → 1, {B} → 2, {R, Y} → 3


External Hashing Example: Pass 1

N=6, B=4 Our hash function: {G,P} → 1, {B} → 2, {R, Y} → 3


External Hashing Example: Pass 1

N=6, B=4 Our hash function: {G,P} → 1, {B} → 2, {R, Y} → 3


External Hashing Example: Pass 1

N=6, B=4 Our hash function: {G,P} → 1, {B} → 2, {R, Y} → 3


External Hashing Example: Pass 1

N=6, B=4 Our hash function: {G,P} → 1, {B} → 2, {R, Y} → 3


External Hashing Example: Pass 1

N=6, B=4 Our hash function: {G,P} → 1, {B} → 2, {R, Y} → 3


External Hashing Example: Pass 1

N=6, B=4 Our hash function: {G,P} → 1, {B} → 2, {R, Y} → 3


External Hashing Example: Pass 1

N=6, B=4 Our hash function: {G,P} → 1, {B} → 2, {R, Y} → 3


External Hashing Example: Pass 1

N=6, B=4 Our hash function: {G,P} → 1, {B} → 2, {R, Y} → 3


External Hashing Example: Pass 1

N=6, B=4 Our hash function: {G,P} → 1, {B} → 2, {R, Y} → 3


External Hashing Example: Pass 1

N=6, B=4 Our hash function: {G,P} → 1, {B} → 2, {R, Y} → 3


External Hashing Example: Pass 1

N=6, B=4 Our hash function: {G,P} → 1, {B} → 2, {R, Y} → 3


External Hashing Example: Pass 2
Create in-memory table for each partition.

N=6, B=4

Pages read (6) !=


pages written (7)
External Hashing Example: Pass 2
Create in-memory table for each partition.

N=6, B=4
External Hashing Example: Pass 2
Create in-memory table for each partition.

N=6, B=4

Green

Purple
External Hashing
● What if the partitions are too big after the first pass?
○ We apply recursive partitioning: for each partition
from the first pass, apply another hash function
(independent of hp and hr!) to split the partition into
even smaller partitions
■ Repeat this as many times as needed, until
partitions fit in memory
■ Every hash function used must be independent!
External Hashing
● What if the partitions are too big after the first pass?
○ We can recursively partition: for each partition from the
first pass, apply another hash function (independent of
hp and hr!) to split the partition into even smaller
partitions
External Hashing
● Is recursive partitioning always enough?
External Hashing
● Is recursive partitioning always enough?
■ No. If there are more than B pages of duplicates,
we’ll never get small enough partitions.
● Then what do we do?
■ Check if all values in partition are the same and
terminate algorithm
Worksheet
Worksheet - Hashing (a)
What are some use-cases in which hashing is preferred over
sorting?
Worksheet - Hashing (a)
What are some use-cases in which hashing is preferred over
sorting?

Removing duplicates, when partition phase can be omitted or


shortened.

Operations that require only data rendezvous (matching data


must be together) and no order requirements - such as
GROUP BY without ORDER BY.
Worksheet - Hashing (b)
We can process B × (B-1) pages of data with external hashing
in two passes. For this case, fill in the blanks with the
appropriate number of pages, where we have B pages of
available RAM (buffer pages).

input buffer(s)

partitions after Partitioning Pass 1

pages per partition


Worksheet - Hashing (b)
We can process B × (B-1) pages of data with external hashing
in two passes. For this case, fill in the blanks with the
appropriate number of pages, where we have B pages of
available RAM (buffer pages).

1 input buffer(s)

B-1 partitions after Partitioning Pass 1

B pages per partition


Worksheet - Hashing (c)
If you are processing exactly B × (B-1) pages of data with
external hashing, is it likely that you’ll have to perform
recursive external hashing in practice? Why or why not?
Worksheet - Hashing (c)
If you are processing exactly B × (B-1) pages of data with
external hashing, is it likely that you’ll have to perform
recursive external hashing in practice? Why or why not?

Yes. To avoid additional recursive external hashing, you would


have to have an absolutely perfect hash function that evenly
distributes records into the B - 1 partitions. This is almost
impossible in practice– some partitions may have more than
B pages after partition hashing.
Worksheet - Hashing (d)
We want to hash N = 100 pages using B = 10 buffer pages.
Suppose in the initial partitioning pass, the pages are
unevenly hashed into partitions of 10, 20, 20, and 50 pages.

Assuming uniform hash functions are used for every


partitioning pass after this pass, what is the total I/O cost for
External Hashing?
Worksheet - Hashing (d)
Reads in red, writes in blue, green indicates can fit into buffers

N = 100 Total I/Os = reads + writes


10 10 10 Build 10
B = 10 = 634 I/Os
x9
20 20 20 20 3 27 27 Build 27

100 100
x9
20 20 20 20 3 27 27 Build 27

x9
50 50 50 50 6 54 54 Build 54

You might also like