Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Database Management Systems (COP 5725)

Homework 3
Instructor: Dr. Daisy Zhe Wang
TAs:
Yang Chen, Kun Li, Yibin Wang
yang, kli, yibwang@cise.ufl.edu
November 30, 2012
Name:
UFID:
Email
Address:

Pledge(Must be signed according to UF Honor Code)


On my honor, I have neither given nor received unauthorized aid in doing
this assignment.

Signature ______________________________________

For grading use only:

Question: I II III IV V Total


Points: 20 20 16 20 24 100

Score:
Q1 (Hash Index)

Consider the Linear Hashing index shown in the above figure. Assume that we split whenever an
overflow page is created. Answer the following questions about this index:

1. [5 points] Show the index after inserting an entry with hash value 4.

2. [5 points] Show the original index after inserting an entry with hash value 15.
3. [5 points] Show the original index after deleting the entries with hash values 36 and 44.
(Assume that the full deletion algorithm is used.)

4. [5 points] Find a list of entries whose insertion into the original index would lead to a bucket
with two overflow pages. Use as few entries as possible to accomplish this. "What is the
maximum number of entries that can be inserted into this bucket before a split occurs that reduces
the length of this overflow chain?

The following constitutes the minimum list of entries to cause two overflow pages in the index
: 63, 127, 255, 511, 1023

The first insertion causes a split and causes an update of Next to 2. The insertion of 1023 causes a
subsequent split and Next is updated to 3 which points to this bucket. This overflow chain will not
be redistributed until three more insertions (a total of 8 entries) are made. In principle if we
choose data entries with key values of the form 2k + 3 with sufficiently large k, we can take the
maximum number of entries that can be inserted to reduce the length of the overflow chain to be
greater than any arbitrary number. This is so because the initial index has 31(binary 11111),
35(binary 10011),7(binary 111) and 11(binary 1011). So by an appropriate choice of data entries
as mentioned above we can make a split of this bucket cause just two values (7 and 31) to be
redistributed to the new bucket. By choosing a sufficiently large k we can delay the reduction of
the length of the overflow chain till any number of splits of this bucket.
Q2 (External Sort)

Consider a disk with an average seek time of 10ms, average rotational delay of 5ms, and a
transfer time of 1ms for a 4k page. Assume that the cost of reading/writing a page is the sum of
these values (i.e., 16ms) unless a sequence of pages is read/written. In this case, the cost is the
average seek time plus the average rotational delay (to find the first page in the sequence) plus
1ms per page (to transfer data). You are given 320 buffer pages and asked to sort a file with
10,000,000 pages.

(a) [5 points] Why is it a bad idea to use the 320 pages to support virtual memory, that is, to
'new' 10,000,000 * 4k bytes of memory, and to use an in-memory sorting algorithm such as
Quicksort?

In Pass 0, 31250 sorted runs of 320 pages each are created. For each run, we read and write 320
pages sequentially. The I/O cost per run is 2 ∗ (10 + 5 + 1 ∗ 320) = 670ms. Thus, the I/O cost for
Pass 0 is 31250 ∗ 670 = 20937500ms. For each of the cases discussed below, this cost must be
added to the cost of the subsequent merging passes to get the total cost. Also, the calculations
below are slightly simplified by neglecting the effect of a final read/written block that is slightly
smaller than the earlier blocks.

(b) Assume that you begin by creating sorted runs of 320 pages each in the first pass. Evaluate the
cost of the following approaches for the subsequent merging passes:

(i) [3 points] Do 319-way merges.

For 319-way merges, only 2 more passes are needed. The first pass will produce ⌈31250/319⌉ =
98 sorted runs; these can then be merged in the next pass. Every page is read and written
individually, at a cost of 16ms per read or write, in each of these two passes. The cost of these
merging passes is therefore 2∗(2∗16)∗10000000 = 640000000ms. (The formula can be read as
‘number of passes times cost of read and write per page times number of pages in file’.)

(ii) [3 points] Create 256 'input' buffers of 1 page each, create an 'output' buffer of 64 pages,
and do 256-way merges.

With 256-way merges, only two additional merging passes are needed. Every page in the file is
read and written in each pass, but the effect of blocking is different on reads and writes. For
reading, each page is read individually at a cost of 16ms. Thus, the cost of reads (over both
passes) is 2 ∗ 16 ∗ 10000000 = 320000000ms. For writing, pages are written out in blocks of 64
pages. The I/O cost per block is 10+5+1∗64 = 79ms. The number of blocks written out per pass is
10000000/64 = 156250, and the cost per pass is 156250∗79 = 12343750ms. The cost of writes
over both merging passes is therefore 2 ∗ 12343750 = 24687500ms. The total cost of reads and
writes for the two merging passes is 320000000 + 24687500 = 344687500ms.

(iii) [3 points] Create 16 'input' buffers of 16 pages each, create an 'output' buffer of 64 pages,
and do 16-way merges.

With 16-way merges, 4 additional merging passes are needed. For reading, pages are read in
blocks of 16 pages, at a cost per block of 10+5+1∗16 = 31ms. In each pass, 10000000/16 =
625000 blocks are read. The cost of reading over the 4 merging passes is therefore 4 ∗ 625000 ∗
31 = 77500000ms. For writing, pages are written in 64 page blocks, and the cost per pass is
12343750ms as before. The cost of writes over 4 merging passes is 4 ∗ 12343750 = 49375000ms,
and the total cost of the merging passes is 77500000 + 49375000 = 126875000ms.

(iv) [3 points] Create eight 'input' buffers of 32 pages each, create an 'output' buffer of 64
pages, and do eight-way merges.

With 8-way merges, 5 merging passes are needed. For reading, pages are read in blocks of 32
pages, at a cost per block of 10+5+1∗32 = 47ms. In each pass, 10000000/32 = 312500 blocks are
read. The cost of reading over the 5 merging passes is therefore 5 ∗ 312500 ∗ 47 = 73437500ms.
For writing, pages are written in 64 page blocks, and the cost per pass is 12343750ms as before.
The cost of writes over 5 merging passes is 5 ∗ 12343750 = 61718750ms, and the total cost of the
merging passes is 73437500 + 61718750 = 135156250ms.

(vi) [3 points] Create four 'input' buffers of 64 pages each, create an 'output' buffer of 64 pages,
and do four-way merges.

With 4-way merges, 8 merging passes are needed. For reading, pages are read in blocks of 64
pages, at a cost per block of 10+5+1∗64 = 79ms. In each pass, 10000000/64 = 156250 blocks are
read. The cost of reading over the 8 merging passes is therefore 8 ∗ 156250 ∗ 79 = 98750000ms.
For writing, pages are written in 64 page blocks, and the cost per pass is 12343750ms as before.
The cost of writes over 8 merging passes is 8 ∗ 12343750 = 98750000ms, and the total cost of the
merging passes is 98750000 + 98750000 = 197500000ms.
Q3. Query Processing
Consider the following relations:

CREATE TABLE Employee (SSN integer, DepartmentID integer,


PRIMARY KEY (SSN),
FOREIGN KEY (DepartmentID) REFERENCES Department);
(100,000 tuples; 1100 pages)

CREATE TABLE Department (DepartmentID integer, Name char(40),


PRIMARY KEY (DepartmentID));
(1000 tuples; 50 pages)

And consider the following join query:

SELECT SSN, DepartmentID, Name


FROM Employee, Department
WHERE Employee.DepartmentID = Department.DepartmentID

Assume there are no indexes available, and both relations are in arbitrary order on disk. Assume
that we use the refinement for sort-merge join that joins during the final merge phase. However,
assume that our implementation of hash join is simple: it cannot perform recursive partitioning.
The optimizer will not choose hash join if it would require recursive partitioning.

(a) [8 points] Assume you have B=3 memory buffers, enough to hold 3 disk pages in memory at
once. (Remember that one buffer must be used to buffer the output of a join). What is the
best join algorithm to compute the result of this query? What is its cost, as measured in the
number of I/Os. You should also give cost to each of the join algorithms of sort-merge, hash
join, block nested loop join.

• Sort-Merge Join: Supposing number of passes to sort Employee is PE and


number of passes to sort Department is PD, cost is (2PE - 1)|Employee| +
(2PD – 1)|Department| with the refinement and (2PE + 1)|Employee| +
(2PD + 1)|Department| without the refinement. Either answer was accepted,
since the problem asked you to use the refinement but it’s not technically
possible in this case (not enough buffers).
In this case PE = 10 (at the beginning of each pass there are 1100, 367, 184,
92, 46, 23, 12, 6, 3, 2 sorted runs). Similarly PD = 6 (runs are 50, 17, 9, 5,
3, 2). Credit was given if you were off by one in either case. Consequently
the following are all valid answers for cost in this problem:
19150, 19250, 19350, 21350, 21450, 21550, 21650, 23550, 23650, 23750,
23850, 25850, 25950, 26050
(The actual answer is 23750)
• Hash Join: Not applicable, since B2 = 9 < 50 = min(|Employee|,
|Department|). Could only be done with recursive partitioning, which was
excluded.
• Doubly-nested loop join: cost is NumTuples(Department)*|Employee| +
|Department| = (1000)(1100) + 50 = 1100050 > 23750
• Page-oriented doubly-nested loop join: cost is |Department|*|Employee| +
|Department| = (50)(1100) + 50 = 55050 > 23750
• Block nested loops join: B – 2 = 1, so same as page-oriented doubly-nested
loop join.

(b) [8 points] Suppose that instead of having 3 memory buffers we have B=11 memory buffers.
What is the best join algorithm now, and what is its cost (again, without writing final output
or considering buffer hits)? You should also give cost to each of the join algorithms of sort-
merge, hash join, block nested loop join.
• Hash Join: Since B2 = 121 > 50 = min(|Employee|, |Department|), we can
use hash join in this problem. We partition Employee into 10 partitions,
then partition Department into 10 partitions (average size 5), then load each
Department partition into memory while streaming through the
corresponding larger Employee partition. Since there is no recursive
partitioning, total cost is 3(|Department| + |Employee|) = 3(1100 + 50) =
3450.
• Sort-Merge Join: Supposing number of passes to sort Employee is PE and
number of passes to sort Department is PD, cost is (2PE - 1)|Employee| +
(2PD – 1)|Department| with the refinement and (2PE + 1)|Employee| +
(2PD + 1)|Department| without the refinement. Either answer was accepted,
since the problem asked you to use the refinement but it’s not technically
possible in this case (not enough buffers). Another option was to use the
refinement on the Department relation but not the Employee relation, for a
cost of (2PE + 1)|Employee| + (2PD - 1)|Department|.
In this case PE = 3 (at the beginning of each pass there are 1100, 100, 10
sorted runs) and PD = 2 (runs are 50, 5). The following are valid answers
for cost: 5650, 7850, 7950. These are more than the cost of hash join, but
were still accepted because that solution was not known at the time of
grading.
• Block nested loops join: Cost is (ceiling(|Department|/(B-2)) * |Employee|)
+ |Department| = (6)(1100) + 50 = 6650. This is faster than Sort-Merge
Join (since the refinement giving 5650 cost can’t actually be used), but
slower than Hash Join.
• Doubly-nested loop join, page-oriented doubly-nested loop join: costs
same as in first part above, both far more than the above options.
Q4. Join Costs

Consider the following schema.

auctions (aid, minprice, description, seller, end_date).


members (mid, nickname, name, since).
bids (aid, buyerid, amount).

Assume there is an unclustered B-tree index on the key of each table. In answering questions, use
the summary statistics functions that we learned about in class: NPages(), NTuples(), Low(),
High(), NKeys(), IHeight(), INPages().

a. [4 points] Consider the query


SELECT 'found it!' FROM members WHERE mid = 98765;
Given the information above, write a formula for the optimizer's lowest estimated cost for
this query.

IHeight(Btree on members.mid)

b. [5 points] Consider the query


SELECT * FROM bids, members
WHERE bids.buyerid = members.mid AND members.since < '2001';
Write the formula the optimizer would use for the selectivity of the entire WHERE clause.

[1/(MAX(NKeys(bids.buyerid), NKeys(members.mid)]

[2001 - MIN(members.since)]
/ [MAX(members.since)-MIN(members.since)]

c. [5 points] Consider the same query as in part (b). Now suppose the optimizer knows that
bids.buyerid is a “not null” foreign key referencing members. Write a simplified formula for
the selectivity of the entire WHERE clause.

1/NTuples(members)

[2001 - MIN(members.since)]
/ [MAX(members.since)-MIN(members.since)]

d. [6 points] Consider the following query:


SELECT R.*
FROM R, S, T
WHERE R.a = S.b
AND S.b = T.c

The following plans are generated during an intermediate pass of the Selinger (System R)
optimizer algorithm. For each plan, write down the ordering column(s) of its output if any, and
whether the plan would get pruned (P) or kept (K) at the end of the pass. If there is no clear
ordering on the output, write “none”.

Prune
Cost in Ordering Columns of or
PLAN I/Os Output Keep
IndexNestedLoops ( 2010 None
FileScan(R), P
IndexOnlyScan(Btree on S.b))

SortMergeJoin( 3010
FileScan(R), (R.a, S.b) K
FileScan(S))

BlockNestedLoops( 1010
FileScan(R), None K
FileScan(S))

IndexNestedLoops( 30000
IndexOnlyScan(Btree on (R.d, R.a)), (R.d, R.a) P
IndexScan(Btree on S.b))
Q5. Concurrency
Consider the following schedule of accesses by three transactions. The labels R and W indicate
reads and writes, and the labels A, B, and C indicate distinct elements of data.

There are many correct answers to this question – as long as each piece of data is locked before
used, each row contains at most one item, and no transaction locks data after it begins unlocking,
the answer is correct.

Time T1 T2 T3
1 L(A)

2 R(A)
3 L(C)

4 R(C)
5 L(B)

6 R(B)
7

8 W(B)
9 U(B)

10 U(A)

11 L(B)

12 R(B)
13 L(A)

14 R(A)
15 U(C)

16 L(C)

17 R(C)
18

19 W(C)
20 U(C)

21 W(A)
22 U(A)

23 U(B)
(a) [4 points] Recall the definition of a precedence graph: “A precedence graph has a node for
each committed transaction, and an arc from Ti to Tj if an action of Ti precedes and conflicts with
one of Tj’s actions.” Draw a precedence graph for the schedule on the previous page.

T1  

T2   T3  

(b) [4 points] Is the schedule on the previous page conflict-serializable? If so, what order should
the transactions be executed in to produce a conflict-equivalent serial schedule?

The precedence graph contains no cycles, so the schedule is conflict-serializable. The only
possible conflict-equivalent serial schedule is T2, T1, T3.

(c) [4 point] Suppose instead of reading B at time 12, transaction 3 reads B at time 7. Draw a
precedence graph for this modified schedule.

T1  

T2   T3  

(d) [4 point] Is the schedule of part (c) conflict-serializable? If so, what order should the
transactions be executed in to produce a conflict-equivalent serial schedule?

The precedence graph contains a cycle, so the schedule is not conflict-serializable.

(e) [8 points] Add lock/unlock actions into the schedule on the previous page in a way compliant
with (non-strict) two-phase locking. Use L(X) to lock a data element X, and U(X) to unlock it. At
most one box on each row should contain an action, and it may contain only one action. You
should only use exclusive locks, not shared (read) locks. No locks should remain held at the end
of the schedule.

See previous page.

You might also like