Chapter 1: Query Processing and Optimization: Slides By: Ms. Shree Jaswal

Chapter 1: Query
Processing
and Optimization
Slides by: Ms. Shree Jaswal
St. Francis Institute of Technology, Department of Information Technology

The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
Which chapter? Which Book?

• Chapter 12: Query Processing, Korth, Slberchatz,Sudarshan, :”Database System
Concepts”, 6th Edition, McGraw – Hill
• Chapter 13: Query Optimization, Korth, Slberchatz,Sudarshan, :”Database
System Concepts”, 6th Edition, McGraw – Hill
• Chapter 3: The Relational Data Model and Relational Database Constraints,
Slides by: Ms. Shree J.

Elmasri and Navathe, “ Fundamentals of Database Systems”, 6th Edition,
PEARSON Education
• Chapter 4: Relational Algebra And Calculus, Raghu Ramakrishnan and
Johannes Gehrke, “Database Management Systems” 3rd Edition -McGraw
Hill
ADMT chp1
• The slides in this presentation are made by referring the above mentioned author’s slides.
2
Topics to be covered
• Overview: Introduction, Query processing in DBMS,
Steps of Query Processing
• Measures of Query Cost: Selection Operation, Sorting,
Join Operation, Other Operations
• Evaluation of Expressions.

• Query Optimization Overview: Goals of Query
Optimization, Approaches of Query Optimization
• Transformation of Relational Expressions
• Estimating Statistics of Expression Results
• Choice of Evaluation Plans
ADMT chp1
• Self-learning Topics: Solve problems on query
optimization.
3
Prerequisite
ADMT chp1 Slides by: Ms. Shree J.

4
Topics
• Reviewing basic concepts of a Relational database,
• SQL concepts

ADMT chp1
5
Relational Database
• Entity
• Relationship

• Attributes
ADMT chp1
6
Relational Algebra Overview

• Relational Algebra consists of several groups of operations
 Unary Relational Operations
 SELECT (symbol:  (sigma))
 PROJECT (symbol:  (pi))
 RENAME (symbol:  (rho))

 Relational Algebra Operations From Set Theory
 UNION (  ), INTERSECTION (  ), DIFFERENCE (or MINUS, – )
 CARTESIAN PRODUCT ( x )
 Binary Relational Operations
 JOIN (several variations of JOIN exist)
 DIVISION
 Additional Relational Operations
 OUTER JOINS
ADMT chp1
 AGGREGATE FUNCTIONS (These compute summary of
information: for example, SUM, COUNT, AVG, MIN, MAX)
7
Examples on Relational Algebra
Operations
• Consider the following schema
• Sailors(sid: integer, sname: string, rating: integer, age: real)
• Boats( bid: integer, bname: string, color: string)
• Reserves(sid: integer, bid: integer, day: date)

Instance S1 of Sailors Instance S2 of Sailors
sid sname rating age sid sname rating age

22 Dustin 7 45.0 28 Yuppy 9 35.0
31 Lubber 8 55.5 31 Lubber 8 55.5
58 Rusty 10 35.0 44 Guppy 5 35.0
ADMT chp1
58 Rusty 10 35.0
sid bid day

22 101 10/10/96 Instance R1 of Reserves
58 103 11/12/96 8
Unary Relational Operations

• Selection and Projection
 Retrieve expert sailors with rating greater than 8:
 rating>8 (S2)
sid sname rating age

28 Yuppy 9 35.0
58 Rusty 10 35.0
 Compute the names and ratings of highly rated sailors:

 𝒔𝒏𝒂𝒎𝒆,𝒓𝒂𝒕𝒊𝒏𝒈 (rating>8 (S2))
sname rating
ADMT chp1
Yuppy 9
Rusty 10
9
Set Operations
• Union (Either-or) : Two relation instances are said to be union-compatible if the
following conditions hold:
 they have the same number of the fields,
 and corresponding fields, taken in order from left to right, have the same domains.
• S1  S2

22 Dustin 7 45.0
31 Lubber 8 55.5
58 Rusty 10 35.0
28 Yuppy 9 35.0
44 Guppy 5 35.0
ADMT chp1
• Is R1  S1 possible?
• No
• not a valid operation because the two relations are not union-compatible.
10
Set Operations
• Intersection (Both): S1  S2

31 Lubber 8 55.5

58 Rusty 10 35.0
• Set-difference: S1- S2 returns a relation instance containing all tuples that

occur in S1 but not in S2. The relations S1 and S2 must be union-compatible,
and the schema of the result is defined to be identical to the schema of S1.
ADMT chp1
22 Dustin 7 45.0
11
Set Operations
• Cross-product (Cartesian product): S1 x R1 returns a relation instance
whose schema contains all the fields of S1 (in the same order as they appear
in S1) followed by all the fields of R1 (in the same order as they appear in R1).
• S1 x R1

sid sname rating age sid bid day
22 Dustin 7 45.0 22 101 10/10/96
22 Dustin 7 45.0 58 103 11/12/96
31 Lubber 8 55.5 22 101 10/10/96
ADMT chp1
31 Lubber 8 55.5 58 103 11/12/96
58 Rusty 10 35.0 22 101 10/10/96
58 Rusty 10 35.0 58 103 11/12/96
12
Binary Relational Operations

• Joins: most commonly used way to combine information from two or more
relations.
 Condition Joins: The most general version of the join operation accepts a join
condition c and a pair of relation instances as arguments and returns a relation
instance

 S1 S1.sid<Rl.sid R1
22 Dustin 7 45.0 58 103 11/12/96
31 Lubber 8 55.5 58 103 11/12/96
 Equijoin: when the join condition consists solely of equalities of the form R.name1
= S.name2, that is, equalities between two fields in R and S.
ADMT chp1
 S1 S1.sid=Rl.sid R1
sid
22
58
13

• Natural Join: an equijoin R S in which equalities are
specified on all fields having the same name in R and S.
• In this case, we can simply omit the join condition; the
default is that the join condition is a collection of

equalities on all common fields
• The result is guaranteed not to have two fields with the
same name.
• If the two relations have no attributes in common, R S
is simply the cross-product.
ADMT chp1
• S1 S1.sid=Rl.sid R1 is actually a natural join
14

Outer Joins:
• The left outer join operation keeps every tuple in the first or left
relation R in R S; if no matching tuple is found in S, then the
attributes of S in the join result are filled or “padded” with null
values.

• A similar operation, right outer join, keeps every tuple in the
second or right relation S in the result of R S.
• A third operation, full outer join, denoted by keeps all
tuples in both the left and the right relations when no matching
tuples are found, padding them with null values as needed.
ADMT chp1
15

• left outer join: S1 R1

22 Dustin 7 45.0 22 101 10/10/96
31 Lubber 8 55.5 Null Null Null
ADMT chp1
58 Rusty 10 35.0 58 103 11/12/96
16

• right outer join: S1 R1

22 Dustin 7 45.0 22 101 10/10/96
58 Rusty 10 35.0 58 103 11/12/96
ADMT chp1
17

• full outer join: S1 R1

22 Dustin 7 45.0 22 101 10/10/96
31 Lubber 8 55.5 Null Null Null
ADMT chp1
58 Rusty 10 35.0 58 103 11/12/96
18

• Division: For a tuple t to appear in the result T of the DIVISION, the
values in t must appear in R in combination with every tuple in S.

ADMT chp1
19
Aggregate Functions
• Use of the Aggregate Functional operation ℱ
 ℱMAX Salary (EMPLOYEE) retrieves the maximum salary value from the
EMPLOYEE relation
 ℱMIN Salary (EMPLOYEE) retrieves the minimum Salary value from the

EMPLOYEE relation
 ℱSUM Salary (EMPLOYEE) retrieves the sum of the Salary from the
EMPLOYEE relation
 ℱCOUNT SSN, AVERAGE Salary (EMPLOYEE) computes the count (number)
of employees and their average salary
 Note: count just counts the number of rows, without removing duplicates
ADMT chp1
20
Recap of
Relational
Algebra
Operations

ADMT chp1
21
Query Tree
• A query tree is a tree data structure representing a relational
algebra expression.
• The tables of the query are represented as leaf nodes.

• The relational algebra operations are represented as the internal
nodes.
• The root represents the query as a whole.
• During execution, an internal node is executed whenever its operand
tables are available.
ADMT chp1
• The node is then replaced by the result table. This process continues
for all internal nodes until the root node is executed and replaced by
the result table.
22
Query Tree
• A B C

ADMT chp1
23
Example
• Employee
EmpID EName Salary DeptNo DateOfJoining

• Department
DNo DName Location
ADMT chp1
24
Example
• Let us consider the query as the following.
• πEmpID (σEName="ArunKumar" (EMPLOYEE))
• The corresponding query tree will be

ADMT chp1
25
Example
• Let us consider another query involving a join.
• πEName,Salary (σDName="Marketing“ (DEPARTMENT)) ⋈DNo=DeptNo

(EMPLOYEE)
• Following is the query tree for the above query.

ADMT chp1
26
Query Processing
Basic Steps in Query Processing

1. Parsing and translation
2. Optimization
3. Evaluation

ADMT chp1
28
Basic Steps in Query Processing

(Cont.)
• Parsing and translation
 translate the query into its internal form. This is then
translated into relational algebra.
 Parser checks syntax, verifies relations

• Evaluation
 The query-execution engine takes a query-evaluation
plan, executes that plan, and returns the answers to the
query.
ADMT chp1
29
Basic Steps in Query Processing :

Optimization
• Consider a query: Select salary from instructor where salary
<75000
• A relational algebra expression may have many equivalent

expressions
 E.g., salary75000(salary(instructor)) is equivalent to
salary(salary75000(instructor))
• Each relational algebra operation can be evaluated using one of
several different algorithms
 Correspondingly, a relational-algebra expression can be
ADMT chp1
evaluated in many ways.
30
Basic Steps: Optimization (Cont.)

• To specify how to evaluate a query, we need to provide
relational algebra expression and annotate it with
instructions(algorithms or indices to be used) specifying
how to evaluate each operation

• A relational algebra operation annotated with
instructions on how to evaluate it is called an
evaluation primitive
ADMT chp1
31

• Annotated expression specifying detailed evaluation strategy
is called an evaluation-plan.
 E.g., can use an index on salary to find instructors with
salary < 75000,

 or can perform complete relation scan and discard instructors
with salary  75000
ADMT chp1
32

• Query Optimization: Amongst all equivalent
evaluation plans choose the one with lowest cost.
 Cost is estimated using statistical information from the
database catalog

 e.g. number of tuples in each relation, size of tuples,
etc.
• It is the responsibility of the system to construct a query evaluation
plan that minimizes the cost of query evaluation
ADMT chp1
33
Disk
structure

ADMT chp1
34
Sectors and Blocks

ADMT chp1
35
Measures of Query Cost

• Cost is generally measured as total elapsed time for answering query
 Many factors contribute to time cost
 disk accesses, CPU, or even network communication
Typically disk access is the predominant cost, and is also relatively

•
easy to estimate. Measured by taking into account
 Number of seeks * average-seek-cost
 Number of blocks read * average-block-read-cost
 Number of blocks written * average-block-write-cost
 Cost to write a block is greater than cost to read a block(typically
about twice expensive)
ADMT chp1
 data is read back after being written to ensure that the write
was successful
36
Measures of Query Cost (Cont.)

• For simplicity we just use the number of block transfers from
disk and the number of seeks as the cost measures
 tT – time to transfer one block
 tS – time for one seek

 Cost for b block transfers plus S seeks
b * tT + S * tS
• We ignore CPU costs for simplicity
 Real systems do take CPU cost into account
• We do not include cost to writing output to disk in our cost formulae
ADMT chp1
37
Measures of Query Cost (Cont.)

• Several algorithms can reduce disk IO by using extra
buffer space
 Amount of real memory available to buffer depends on
other concurrent queries and OS processes, known only

during execution
 We often use worst case estimates, assuming only the
minimum amount of memory needed for the operation
is available
• Required data may be buffer resident already, avoiding
ADMT chp1
disk I/O
 But hard to take into account for cost estimation
38
Indexes as Access Paths

•A single-level index is an auxiliary file that
makes it more efficient to search for a record in
the data file.
• The index is usually specified on one field of the

file (although it could be specified on several
fields)
• One form of an index is a file of entries <field
value, pointer to record>, which is ordered by
field value
ADMT chp1
• The index is called an access path on the field.
39
Primary Index
 Also referred to as a clustering index
 Defined on an ordered data file
 The data file is ordered on a key field
 Includes one index entry for each block in the
data file; the index entry has the key field value for

the first record in the block, which is called the
block anchor
 In other words, it allows the records of a file to be
read in an order that corresponds to the physical
order in the file
 Further classified into dense and sparse index
ADMT chp1
 An index that is not primary index is called as a
secondary index
40
Primary Index

ADMT chp1
41
Secondary Index

ADMT chp1
42
B+ tree
• B+ tree is an n-array tree with a variable but often large number of children per
node. A B+ tree consists of a root, internal nodes and leaves. The root may be either
a leaf or a node with two or more children.
• In a B+ tree, data stored only in leaf nodes.
• The leaf nodes of the tree stores the actual record rather than pointers to records.

• These trees do not waste space.
• In B+ tree, leaf node data are ordered in a sequential linked list.
• Searching of any data in a B+ tree is very easy because all data is found in leaf
nodes.
They store redundant search key.
ADMT chp1
•
• Many database system implementers prefer the structural simplicity of a B+ tree.
43
B+ tree

ADMT chp1
44
Selection Operation
• File scan
• Algorithm A1 (linear search). Scan each file block and test all records to see
whether they satisfy the selection condition.
 Cost estimate = br block transfers + 1 seek
 br denotes number of blocks containing records from relation r
 If selection is on a key attribute, can stop on finding record

 cost = (br /2) block transfers + 1 seek
 Linear search can be applied regardless of
 selection condition or
 ordering of records in the file, or
 availability of indices
• Note: binary search generally does not make sense since data is not stored
ADMT chp1
consecutively
 except when there is an index available,
 and binary search requires more seeks than index search
45
Selections Using Indices

• Index scan – search algorithms that use an index
 selection condition must be on search-key of index.
• A2 (primary B+ tree index, equality on key). Retrieve a single
record that satisfies the corresponding equality condition. Index
lookup traverses height hi of the tree + 1 I/O to fetch record; each of
these I/O operations require a seek and a block transfer

 Cost = (hi + 1) * (tT + tS)
• A3 (primary B+ tree index, equality on nonkey) Retrieve
multiple records.
 Records will be on consecutive blocks
 Let b = number of blocks(leaf blocks) containing matching
records; all of which are read
ADMT chp1
 1 seek for each level of tree and one seek for 1st block
 Cost = hi * (tT + tS) + tT * b
46
Selections Using Indices

• A4 (secondary B+ tree index, equality on key and
nonkey).
 Retrieve a single record if the search-key is a candidate key.
Cost is same as A2

 Cost = (hi + 1) * (tT + tS)
 Retrieve multiple records if search-key is not a candidate key
 each of n matching records may be on a different block
which may result in 1 I/O operation per retrieved record
with each I/O operation requiring a seek and a block
transfer
ADMT chp1
 Cost = (hi + n) * (tT + tS)
 Can be very expensive!
47
Selections Involving Comparisons

• Can implement selections of the form AV (r) or A  V(r)
by using
 a linear file scan,
 or by using indices in the following ways:

• A5 (primary index, comparison). (Relation is sorted on
A)
 For A  V(r) use index to find first tuple  v and scan
relation sequentially from there
 For AV (r) just scan relation sequentially till first
ADMT chp1
tuple > v; do not use index
 Cost is identical to A3
48
Selections Involving Comparisons

• A6 (secondary index, comparison).
 For A  V(r) use index to find first index entry  v and
scan index sequentially from there, to find pointers to
records.

 For AV (r) just scan leaf pages of index finding
pointers to records, till first entry > v
 In either case, retrieve records that are pointed to
 requires an I/O for each record
 Linear file scan may be cheaper
ADMT chp1
 Cost is identical to A4
49
Implementation of Complex Selections

• Conjunction: 1 2. . . n(r)
• A7 (conjunctive selection using one index).
 Select a combination of i and algorithms A1 through A6 that results
in the least cost for i (r).
 Test other conditions on tuple after fetching it into memory buffer.

• A8 (conjunctive selection using composite index).
 Use appropriate composite (multiple-key) index if available.
• A9 (conjunctive selection by intersection of identifiers).
 Requires indices with record pointers.
 Use corresponding index for each condition, and take intersection of
ADMT chp1
all the obtained sets of record pointers.
 Then fetch records from file
 If some conditions do not have appropriate indices, apply test on
retrieved records in memory.
50
Implementation of Complex
Selections
• Disjunction:1 2 . . . n (r).
• A10 (disjunctive selection by union of identifiers).
 Applicable if all conditions have available indices.
 Otherwise use linear scan.

 Use corresponding index for each condition, and take union of
all the obtained sets of record pointers.
 Then fetch records from file
• Negation: (r)
 Use linear scan on file
ADMT chp1
 If very few records satisfy , and an index is applicable to 
 Find satisfying records using index and fetch from file
51
Summary Summary of costs for

selections

ADMT chp1
52
Sorting
• Sorting in DBMS may be required for 2 reasons:
 SQL queries specify that the output be sorted, or
 Several operations like joins can be implemented efficiently if
input relations are sorted first

• We may build an index on the relation, and then use the index to
read the relation in sorted order. However such process orders the
relation logically rather than physically. May lead to one disk
block access for each tuple, which can be expensive.
For relations that fit in memory, techniques like quicksort can be
ADMT chp1
•
used. For relations that don’t fit in memory, external sort-
merge is a good choice.
53
Quick Sort example

ADMT chp1
54
External Sort-Merge
• Let M denote memory size (in pages).
1. Create sorted runs. Let i be 0 initially.
Repeatedly do the following till the end of the relation:

(a) Read M blocks of relation into memory
(b) Sort the in-memory blocks
(c) Write sorted data to run Ri; increment i.
Let the final value of i be N
2. Merge the runs (next slide)…..
ADMT chp1
55
External Sort-Merge (Cont.)

• Example: File size = 5 GB
• Memory size =1000 MB = 1GB

Relation
R0 R1 R2 R3 R4
ADMT chp1
56

2.Merge the runs (N-way merge). We assume (for now) that N <
M.
1. Use N blocks of memory to buffer input runs, and 1 block to
buffer output. Read the first block of each run into its buffer

page
2. repeat
1. Select the first record (in sort order) among all buffer pages
2. Write the record to the output buffer. If the output buffer is
full write it to disk.
3. Delete the record from its input buffer page.
If the buffer page becomes empty then
ADMT chp1
read the next block (if any) of the run into the buffer.
3. until all input buffer pages are empty:
57

• If N  M, several merge passes are required.
 In each pass, contiguous groups of M - 1 runs are merged.
 A pass reduces the number of runs by a factor of M -1, and
creates runs longer by the same factor.

 Repeated passes are performed till all runs have been
merged into one.
• No. of passes is 1+ log M–1(N / M)

• Cost =2N * no. of passes
ADMT chp1
58

• Read 150 MB of data from each file
• 150*5 =750 MB for data from run files and 250 MB for writing output
• Perform 5 way merge
• Write to disk when output buffer is full
• Read next block of data (150 MB or rest of relation)
Repeat until all input buffers are empty

•
R0 R1 R2 R3 R4
ADMT chp1
5 way merge
Output Disk 59
Example: External Sorting Using Sort-Merge
a 19 a 19
g 24 d 31 a 14
b 14
a 19 g 24 a 19
c 33
d 31 b 14
b 14 d 31
c 33 c 33
c 33 e 16
b 14 d 7
g 24

e 16
e 16 d 21
r 16 d 21 d 31
a 14
d 21 m 3 e 16
d 7
m 3 r 16 g 24
d 21
p 2 m 3
m 3
d 7 a 14 p 2
ADMT chp1
p 2
a 14 d 7 r 16
r 16
p 2
initial sorted
relation runs runs output
create
runs
merge
pass–1
merge
pass–2
60
Example: External Sorting Using

Sort-Merge
• No. of passes is 1+ log M–1(N / M)
• 3 buffers used to sort a 12 page file
 In pass 0: 12/3=4 sorted runs of 3 pages are produced
 In pass 1: 4/2=2 sorted runs of 6 pages are produced

 In pass 2: 1 complete sorted file if produced
• Thus no. of passes = 1+ log 3–1(12 / 3) =3
• Another example: suppose 5 buffers available to sort 108 page
file
 In pass 0: 108/5 = 22 sorted runs of 5 pages (last run =3 pages)
 In pass 1: 22/4 = 6 sorted runs of 20 pages (last run =8 pages)
ADMT chp1
 In pass 2: 6/3= 2 sorted runs of 80 pages and 28 pages
 In pass 3: sorted file of 108 pages
• Thus no. of passes = 1+ log 5–1(108 / 5) =4
61
Simple Example

ADMT chp1
62
External Merge Sort (Cont.)

Cost of block transfers:
• Number of block transfers (read and write) for initial run
creation: 2br
• Total number of merge passes required:  log M–1(br / M) 

• Number of block transfers (read and write) in each pass: 2br
• For final pass, we don’t count write cost: - br (i.e, subtract br)
 we ignore final write cost for all operations since the output of an operation may be
sent to the parent operation without being written to disk
• Thus, total number of block transfers for external sorting:
• 2br + 2 br  log M–1(br / M)  - br
ADMT chp1
 = br ( 2  log M–1 (br / M)  + 1)= 12(2*2+1)= 60 block transfers
 Where br=12 and M=3
• Seeks: next slide
63
External Merge Sort (Cont.)

Cost of seeks
• During run generation: one seek to read each run and one seek to write
each run
 2 br / M

• Total number of merge passes required: log M–1(br / M) 
• During each pass or merge phases

 1 seek for reading each block and 1 seek for writing each block i.e 2br
seeks for each merge pass
 except the final one which does not require a write (i.e, subtract br)
ADMT chp1
 Total number of seeks:
2 br / M + 2 br logM–1(br / M) - br
 = 2 br / M + br ( 2 logM–1(br / M) - 1)
 =2(4) + 12(2(2)-1)=44 disk seeks 64
External Merge Sort (Cont.) Example

• Select * from Employee order by name
• Assumptions: No. of buffers fit in memory =3, no. of tuples fit in one buffer =1
EmpId Name
1001 Jayanti
1002 Pramod
1003 Neha

1004 Nilesh
1005 Mayur
1006 Vinayak
1007 Shree
1008 Akshata
1009 Jaya
ADMT chp1
1010 Abhishek
1011 Santosh
65
Example –Stage 1
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
Jayan Pram Neha Niles Mayu Vinay Shree Aksha Jaya Abhis Santo
ti od h r ak ta hek sh

R0 R1 R2 R3
1, 3, 2, 5, 4, 6, 8, 9, 7, 10, 11,
Jayan Neha Pram Mayu Niles Vinay Aksha Jaya Shree Abhis Santo
ti od r h ak ta hek sh
ADMT chp1
66
Example –Stage 2
• Read M-1 input buffers and 1 output buffer to write sorted output
EmpId Name
1001 Jayanti
1005 Mayur
1, Jayanti 3, Neha 2, Pramod 1003 Neha

1004 Nilesh

1002 Pramod
5, Mayur 4, Nilesh 6,
Vinayak 1006 Vinayak
8, 9, Jaya 7, Shree EmpId Name

Akshata
1010 Abhishek
10, 11, 1008 Akshata
Abhishek Santosh
ADMT chp1
1009 Jaya
1011 Santosh
1007 Shree
67
Example –Stage 2
EmpId Name
1010 Abhishek
1008 Akshata
1009 Jaya

1001 Jayanti
1005 Mayur
1003 Neha
1004 Nilesh
1002 Pramod
1011 Santosh
ADMT chp1
1007 Shree
1006 Vinayak
68
Join Operation
• Several different algorithms to implement joins
 Nested-loop join
 Block nested-loop join

• Choice based on cost estimate
• Examples use the following information
 Number of records of student: 5,000 and of takes: 10,000
 Number of blocks of student: 100 and of takes: 400
ADMT chp1
69
Nested-Loop Join (brute force)

• To compute the theta join r  s
for each tuple tr in r do begin
for each tuple ts in s do begin
test pair (tr,ts) to see if they satisfy the join condition 

if they do, add tr • ts to the result.
end
end
• r is called the outer relation and s the inner relation of the join.
• Requires no indices and can be used with any kind of join condition.
ADMT chp1
• Expensive since it examines every pair of tuples in the two relations.
70
•
Nested-Loop Join (Cont.)
In the worst case, if there is enough memory only to hold one block of each relation,
• the estimated cost is
nr  bs + br block transfers, plus
n r + br seeks
• If the smaller relation fits entirely in memory, use that as the inner relation.
 Reduces cost to br + bs block transfers and 2 seeks

• Assuming worst case memory availability cost estimate is
 with student as outer relation:
 5000  400 + 100 = 2,000,100 block transfers,
 5000 + 100 = 5100 seeks
 with takes as the outer relation
 10000  100 + 400 = 1,000,400 block transfers and 10,400 seeks
• In best case scenario, we can read both relations only once
ADMT chp1
• and cost estimate will be 100+400 =500 block transfers.
• Block nested-loops algorithm (next slide) is preferable.
71
Block Nested-Loop Join

• Variant of nested-loop join in which every block of inner
relation is paired with every block of outer relation.
for each block Br of r do begin

for each block Bs of s do begin
for each tuple tr in Br do begin
for each tuple ts in Bs do begin
Check if (tr,ts) satisfy the join condition
if they do, add tr • ts to the result.
end
ADMT chp1
end
end
end
72
Block Nested-Loop Join (Cont.)

• Worst case estimate: br  bs + br block transfers + 2 * br seeks
 Each block in the inner relation s is read once for each block in
the outer relation
 It is more efficient to use smaller relation as the outer relation,
in case neither of the relations fit in the memory

• Best case: br + bs block transfers + 2 seeks.
• Improvements to nested loop and block nested loop algorithms:
 In block nested-loop, use M — 2 disk blocks as blocking unit for
outer relations, where M = memory size in blocks; use remaining
two blocks to buffer inner relation and output
 Cost = br / (M-2)  bs + br block transfers +
2 br / (M-2) seeks
ADMT chp1
 If equi-join attribute forms a key or inner relation, stop inner
loop on first match
73
Block Nested-Loop Join (Cont.)

• Example of student takes
• In the worst case, we have to read each block of takes once for each block of
student. Thus, in the worst case, a total of 100 ∗ 400 + 100 = 40,100 block
transfers plus 2∗100 = 200 seeks are required.

• This cost is a significant improvement over the 5000∗400+100 = 2,000,100
block transfers plus 5100 seeks needed in the worst case for the basic
nested-loop join.
• The best-case cost remains the same—namely, 100 + 400 = 500 block
transfers and 2 seeks.
ADMT chp1
74
Indexed Nested-Loop Join

• Index lookups can replace file scans if
 join is an equi-join or natural join and
 an index is available on the inner relation’s join attribute
 Can construct an index just to compute a join.
• For each tuple tr in the outer relation r, use the index to look up tuples in s

that satisfy the join condition with tuple tr.
• Worst case: buffer has space for only one page of r, and, for each tuple in r, we
perform an index lookup on s.
• Cost of the join: br (tT + tS) + nr  c
 Where c is the cost of traversing index and fetching all matching s tuples for
one tuple or r
ADMT chp1
 c can be estimated as cost of a single selection on s using the join condition.
• If indices are available on join attributes of both r and s,
use the relation with fewer tuples as the outer relation.
75
Example of Nested-Loop Join Costs

• Compute student takes, with student as the outer relation.
• Let takes have a primary B+-tree index on the attribute ID, which contains 20 entries
in each index node.
• Since takes has 10,000 tuples, the height of the tree is 4, and one more access is
needed to find the actual data
• student has 5000 tuples

• Cost of block nested loops join
 400*100 + 100 = 40,100 block transfers + 2 * 100 = 200 seeks
 assuming worst case memory
 may be significantly less with more memory
• Cost of indexed nested loops join: br (tT + tS) + nr  c
 100 + 5000 * 5 = 25,100 block transfers and seeks.
ADMT chp1
 c is computed by applying Algorithm A2 cost = (hi + 1) * (tT + tS) = (4+1)*1
 CPU cost likely to be less than that for block nested loops join
76
Merge-Join( sort-merge-join)
1. Sort both relations on their join attribute (if not already sorted on
the join attributes).
2. Merge the sorted relations to join them
1. Join step is similar to the merge stage of the sort-merge
algorithm.
2. Main difference is handling of duplicate values in join attribute

— every pair with same value on join attribute must be matched
3. Detailed algorithm in book
ADMT chp1
77
Merge-Join (Cont.) Example

Pr EmpId Name Dept ID Dept ID Dname
1001 Jayanti 01 Ps ts
01 Purchase
1002 Pramod 01 02 Sales
1003 Neha 01 03 Production

1004 Nilesh 02 04 Marketing
1005 Mayur 02
1006 Vinayak 02
1007 Shree 03 S 01 Purchase
1008 Akshata 03
1009 Jaya 03
ADMT chp1
1010 Abhishek 04
1011 Santosh 04
78
Merge-Join (Cont.) Example

EmpId Name Dept ID Dname
1001 Jayanti 01 Purchase
1002 Pramod 01 Purchase
1003 Neha 01 Purchase

1004 Nilesh 02 Sales
1005 Mayur 02 Sales
1006 Vinayak 02 Sales
1007 Shree 03 Production
1008 Akshata 03 Production
1009 Jaya 03 Production
ADMT chp1
1010 Abhishek 04 Marketing
1011 Santosh 04 Marketing
79
Merge-Join (Cont.)
• Can be used only for equi-joins and natural joins
• Each block needs to be read only once (assuming all
tuples for any given value of the join attributes fit in

memory)
• Thus the cost of merge join is:
br + bs block transfers + br / bb + bs / bb seeks
 + the cost of sorting if relations are unsorted.
ADMT chp1
80
Merge-Join (Cont.)
• Example of student takes
 If the relations are already sorted on the join attribute ID, then merge join takes a total of 400 + 100
= 500 block transfers.
 If we assume that in the worst case only one buffer block is allocated to each input relation (that is,
bb = 1), a total of 400 + 100 = 500 seeks
• Suppose the relations are not sorted, and the memory size is the worst case, only three blocks. The
cost is as follows:
1. sorting relation takes requires  log M–1(br / M) =log3−1(400/3) = 8 merge passes.

 Sorting of relation takes = br ( 2  log M–1 (br / M)  + 1)= 400 ∗ (2log3−1(400/3) + 1) = 6800 block
transfers with 400 more transfers to write out the result
 The number of seeks = 2 br / M + br ( 2 logM–1(br / M) - 1)= 2 ∗ 400/3 + 400 ∗ (2 ∗ 8 − 1) = 6268
seeks for sorting, and 400 seeks for writing the output, for a total of 6668 seeks
2. sorting relation student takes log3−1(100/3) = 6 merge passes
 Sorting of relation student = 100 ∗ (2log3−1(100/3) + 1), or 1300, block transfers, with 100 more
transfers to write it out
ADMT chp1
 The number of seeks = 2 ∗ 100/3 + 100 ∗ (2 ∗ 6 − 1) = 1164, and 100 seeks are required for writing
the output, for a total of 1264 seeks.
3. merging the two relations takes 400 + 100 = 500 block transfers and 500 seeks.
• Thus, the total cost is 9100 block transfers plus 8932 seeks if the relations are not sorted, and the
memory size is just 3 blocks.
81
Example
• Suppose that the relation R (with 150 pages) consists of one attribute a and S (with 90
pages) also consists of one attribute a. Determine the optimal join method for processing the
following query:
• select * from R, S where R.a > S.a

• Assume there are 10 buffer pages available to process the query and there are no indexes
available.
• Assume also that the DBMS only has available the following join methods: nested-loop,
block nested loop and sort-merge.
• Determine the number of page I/Os required by each method to work out which is the
cheapest.
ADMT chp1
82
Example Solution
• Simple Nested Loops: nr  bs + br
 We use relation S as the outer loop. Total Cost = 90 + (90×150) = 13590
• Block Nested Loops: = br / (M-2)  bs + br block transfers + 2 br / (M-2) seeks
 If R is outer: Total Cost = 150 + (90×ceil(150/(10-2))) = 1860
 If S is outer: Total Cost = 90 + (150×ceil(90/(10-2))) = 1890

• Sort-Merge:
 Denote B as the number of buffer pages, where B = 10; denote M as the number of
pages in the larger relation, where M = 150. Since B < M , the cost on sort-merge is:
 Sorting R: 150×(2×ceil(log10-1(150/10))+1) = 1500
 Sorting S: 90×(2×ceil(log10-1(90/10))+1) = 270
 Merge: 150 + 90 = 240
ADMT chp1
• Total Cost = 1500 + 270 + 240 = 2010
• Therefore, the optimal way to process the query is Block Nested Loop join.
83
Hash-Join
• Applicable for equi-joins and natural joins.
• A hash function h is used to partition tuples of both relations
• h maps JoinAttrs values to {0, 1, ..., n}, where JoinAttrs denotes the
common attributes of r and s used in the natural join.

 r0, r1, . . ., rn denote partitions of r tuples
 Each tuple tr  r is put in partition ri where i = h(tr [JoinAttrs]).
 s0, s1. . ., sn denotes partitions of s tuples
 Each tuple ts s is put in partition si, where i = h(ts [JoinAttrs]).
Note: In book, ri is denoted as Hri, si is denoted as Hsi and
ADMT chp1
•
n is denoted as nh.
84
Probe i/p Build i/p

Hash-Join (Cont.)

ADMT chp1
85
Hash-Join (Cont.)
•r tuples in ri need only to be compared with s
tuples in si Need not be compared with s tuples
in any other partition, since:

 an r tuple and an s tuple that satisfy the join
condition will have the same value for the join
attributes.
 If that value is hashed to some value i, the r
tuple has to be in ri and the s tuple in si.
ADMT chp1
86
Hash-Join Algorithm
• The hash-join of r and s is computed as follows.
1.Partition the relation s using hashing function h. When partitioning
a relation, one block of memory is reserved as the output buffer for
each partition.

2.Partition r similarly.
3.For each i:
(a) Load si into memory and build an in-memory hash index on it
using the join attribute. This hash index uses a different hash
function than the earlier one h.
(b) Read the tuples in ri from the disk one by one. For each tuple tr
locate each matching tuple ts in si using the in-memory hash
ADMT chp1
index. Output the concatenation of their attributes.
• Relation s is called the build input and r is called the probe
input.
87
Hash-Join algorithm (Cont.)

• The value n and the hash function h is chosen such that each si should fit
in memory.
 Typically n is chosen as bs/M * f where f is a “fudge factor”, typically
around 1.2(usually 20% of no. of hashed partitions)

 The probe relation partitions ri need not fit in memory
• Recursive partitioning required if number of partitions n is greater

than number of pages M of memory.
 instead of partitioning n ways, use M – 1 partitions for s
 Further partition the M – 1 partitions using a different hash function
 Use same partitioning method on r
ADMT chp1
 Rarely required
88
Handling of Overflows
• Partitioning is said to be skewed if some partitions have significantly more tuples than
some others
• Hash-table overflow occurs in partition si if si does not fit in memory. Reasons could
be
 Many tuples in s with same value for join attributes
 Bad hash function

• Overflow resolution can be done in build phase
 Partition si is further partitioned using different hash function.
 Partition ri must be similarly partitioned.
• Overflow avoidance performs partitioning carefully to avoid overflows during build

phase
 E.g. partition build relation into many partitions, then combine them
ADMT chp1
• Both approaches fail with large numbers of duplicates
 Fallback option: use block nested loops join on overflowed partitions
89
Cost of Hash-Join
• If recursive partitioning is not required: cost of hash join is
3(br + bs) +4  nh block transfers +
2( br / bb + bs / bb) seeks
 nh is overhead for partially filled blocks which can be ignored

• If recursive partitioning required:
 number of passes required for partitioning build relation s to less than
M blocks per partition is logM/bb–1(bs/M)
 best to choose the smaller relation as the build relation.
 Total cost estimate is:
2(br + bs) logM/bb–1(bs/M) + br + bs block transfers +
2(br / bb + bs / bb) logM/bb–1(bs/M)  seeks
ADMT chp1
• If the entire build input can be kept in main memory no partitioning is
required
 Cost estimate goes down to br + bs.
90
Example of Cost of Hash-Join

• takes student
• Assume that memory size is 20 blocks. Available buffer blocks is 3
• bstudent= 100 and btakes = 400.

• student is to be used as build input. Partition it into five partitions,
each of size 20 blocks. This partitioning can be done in one pass.
• Similarly, partition takes into five partitions,each of size 80. This is
also done in one pass.
• Therefore total cost, ignoring cost of writing partially filled blocks:
ADMT chp1
 3(100 + 400) = 1500 block transfers +
2( 100/3 + 400/3) = 336 seeks
91
The Join Operation

• Join algorithms which are applicable to any kind of join:
 Simple Nested Loops Join
 Block Nested Loops Join

• Join algorithms which are applicable to equi and natural
join :
 Index Nested Loops Join
 Sort-Merge Join
 Hash Join
ADMT chp1
92
Other Operations
• Duplicate elimination can be implemented via hashing or
sorting.
 On sorting duplicates will come adjacent to each other, and
all but one set of duplicates can be deleted.
 Optimization: duplicates can be deleted during run

generation as well as at intermediate merge steps in external
sort-merge.
 Hashing is similar – duplicates will come into the same
bucket.
• Projection:
ADMT chp1
 perform projection on each tuple followed by duplicate
elimination.
 If projection includes a key, no duplicates will exist
93
Hash Index

ADMT chp1
94
Other Operations : Aggregation

Example: Compute average salary in each university
department
Select dept_name, avg(salary)

from instructor
group by dept_name;
ADMT chp1
95
Other Operations : Aggregation

• Aggregation can be implemented in a manner similar to
duplicate elimination.
 Sorting or hashing can be used to bring tuples in the same
group together, and then the aggregate functions can be
applied on each group.

 Optimization: combine tuples in the same group during run
generation and intermediate merges, by computing partial
aggregate values
 For count, min, max, sum: keep aggregate values on tuples
found so far in the group.
 When combining partial aggregate for count, add up the
ADMT chp1
aggregates
 For avg, keep sum and count, and divide sum by count at the
end
96
Other Operations : Set Operations

• Set operations (,  and ⎯): can either use variant of merge-join after
sorting, or variant of hash-join.
• E.g., Set operations using hashing:
1. Partition both relations using the same hash function

2. Process each partition i as follows.
1. Using a different hashing function, build an in-memory hash index on
ri.
2. Process si as follows
r  s:
1. Add tuples in si to the hash index if they are not already in it.
ADMT chp1
2. At end of si add the tuples in the hash index to the result.
97
Other Operations : Set Operations

• E.g., Set operations using hashing:
1. as before partition r and s,
2. as before, process each partition i as follows
1. build a hash index on ri
2. Process si as follows

r  s:
1. output tuples in si to the result if they are already
there in the hash index
r – s:
1. for each tuple in si, if it is there in the hash index,
ADMT chp1
delete it from the index.
2. At end of si add remaining tuples in the hash index
to the result.
98
Other Operations : Outer Join

• Outer join can be computed either as
 A join followed by addition of null-padded non-participating
tuples.
 by modifying the join algorithms.

1. Modifying merge join to compute r s
 In r s, non participating tuples are those in r – R(r s)
 Modify merge-join to compute r s:
 During merging, for every tuple tr from r that do not match
any tuple in s, output tr padded with nulls.
ADMT chp1
 Right outer-join and full outer-join can be computed similarly.
99
Example
• SELECT FNAME, DNAME
FROM (EMPLOYEE LEFT OUTER JOIN DEPARTMENT
ON DNO = DNUMBER);
• Note: The result of this query is a table of employee names and their associated departments. It is
similar to a regular join result, with the exception that if an employee does not have an associated
department, the employee's name will still appear in the resulting table, although the department name
would be indicated as null.

• Implementation of the left outer join example
 {Compute the JOIN of the EMPLOYEE and DEPARTMENT tables}
 TEMP1FNAME,DNAME(EMPLOYEE DNO=DNUMBER DEPARTMENT)
 {Find the EMPLOYEEs that do not appear in the JOIN}
 TEMP2   FNAME (EMPLOYEE) - FNAME (Temp1)
 {Pad each tuple in TEMP2 with a null DNAME field}
 TEMP2  TEMP2 x 'null'
ADMT chp1
 {UNION the temporary tables to produce the LEFT OUTER JOIN}
 RESULT  TEMP1 υ TEMP2
• The cost of the outer join, as computed above, would include the cost of the
associated steps (i.e., join, projections and union).
100
Other Operations : Outer Join

2. Modifying join to compute r s
• Modify nested Loop join algorithm to compute left outer
join.

• Tuples in the outer relation that do not match any tuple
in the inner relation are written to the output after being
padded with null values.
• However it is hard to extend the nested-loop join to
compute the full outer join
ADMT chp1
101
Evaluation of Expressions
• So far: we have seen algorithms for individual operations
• Alternatives for evaluating an entire expression tree
 Materialization: generate results of an expression

whose inputs are relations or are already computed,
materialize (store) it on disk. Repeat.
 Pipelining: pass on tuples to parent operations even
as an operation is being executed
• We study above alternatives in more detail
ADMT chp1
102
Materialization
• Materialized evaluation: evaluate one operation at a time, starting at
the lowest-level. Use intermediate results materialized into temporary
relations to evaluate next-level operations.
• E.g.,

 building="Watson " (department)
• in figure below, compute and store
then compute the store its join with instructor,
ADMT chp1
and finally compute the projection on name.
103
Materialization (Cont.)
• Materialized evaluation is always applicable
• Cost of writing results to disk and reading them back can be
quite high
 Our cost formulas for operations ignore cost of writing results

to disk, so
 Overall cost = Sum of costs of individual operations +
cost of writing intermediate results to disk
• Double buffering: use two output buffers for each operation,
when one is full, write it to disk while the other is getting filled
 Allows overlap of disk writes with computation and reduces
ADMT chp1
execution time
104
Pipelining
• Pipelined evaluation : evaluate several operations simultaneously,
passing the results of one operation on to the next.
• E.g., in previous expression tree, don’t store result of
 building="Watson " (department)
 instead, pass tuples directly to the join.. Similarly, don’t store result of

join, pass tuples directly to projection.
• Much cheaper than materialization: no need to store a temporary relation
to disk.
• Pipelining may not always be possible – e.g., sort, hash-join.
• For pipelining to be effective, use evaluation algorithms that generate
ADMT chp1
output tuples even as tuples are received for inputs to the operation.
• Pipelines can be executed in two ways: demand driven and producer
driven
105
Pipelining (Cont.)
• In demand driven or lazy evaluation
 system repeatedly requests next tuple from top level operation
 Each operation requests next tuple from children operations as
required, in order to output its next tuple
 In between calls, operation has to maintain “state” so it knows

what to return next
• Alternative name: pull model of pipelining
• More commonly used because its easier to implement
ADMT chp1
106
Pipelining (Cont.)
• Implementation of demand-driven pipelining
 Each operation is implemented as an iterator implementing the
following operations
 open()
 E.g. file scan: initialize file scan
 state: pointer to beginning of file

 E.g.merge join: sort relations;
 state: pointers to beginning of sorted relations
 next()
 E.g. for file scan: Output next tuple, and advance and store file
pointer
 E.g. for merge join: continue with merge from earlier state till
ADMT chp1
next output tuple is found. Save pointers as iterator state.
 close(): tells iterator that no more tuples are required
107
Pipelining (Cont.)
• In producer-driven or eager pipelining
 Operators produce tuples eagerly and pass them up to their
parents
 Buffer maintained between operators, child puts tuples in

buffer, parent removes tuples from buffer
 if buffer is full, child waits till there is space in the buffer,
and then generates more tuples
 System schedules operations that have space in output
buffer and can process more input tuples
ADMT chp1
• Alternative name: push model of pipelining
• Useful in parallel processing systems
108
Query Optimization
• DBMS Structure

ADMT chp1
110
Introduction
• Alternative ways of evaluating a given query
 Equivalent expressions
 Different algorithms for each operation

ADMT chp1
111
Introduction (Cont.)
• An evaluation plan defines exactly what algorithm is used for
each operation, and how the execution of the operations is
coordinated.

ADMT chp1
112
• Cost difference between evaluation plans for a query can
be enormous
 E.g. seconds vs. days in some cases
• Steps in cost-based query optimization
1. Generate logically equivalent expressions using

equivalence rules
2. Annotate resultant expressions to get alternative
query plans
3. Choose the cheapest plan based on estimated cost
ADMT chp1
113
• Estimation of plan cost based on:
 Statistical information about relations.
Examples:

 number of tuples, number of distinct values
for an attribute
 Statistics estimation for intermediate results
 to compute cost of complex expressions
 Cost formulae for algorithms, computed using
ADMT chp1
statistics
114
Transformation of Relational
Expressions
• Two relational algebra expressions are said to be
equivalent if the two expressions generate the
same set of tuples on every legal database

instance
 Note: order of tuples is irrelevant
 we don’t care if they generate different results
on databases that violate integrity constraints
ADMT chp1
115
Transformation of Relational
Expressions
• In SQL, inputs and outputs are multisets of
tuples
 Two expressions in the multiset version of the
relational algebra are said to be equivalent if

the two expressions generate the same multiset
of tuples on every legal database instance.
• An equivalence rule says that expressions of
two forms are equivalent
 Can replace expression of first form by second,
ADMT chp1
or vice versa
116
Equivalence Rules

ADMT chp1
117

ADMT chp1
118
Pictorial Depiction of Equivalence

Rules

ADMT chp1
119
Equivalence Rules (Cont.)

ADMT chp1
120

ADMT chp1
121

ADMT chp1
122
Example
• Consider the following University example with relation
schemas:
• Instructor(ID, name, dept_name, salary)

• Teaches(ID, course_id, semester, year)
• Course(course_id, title, dept_name, credit)
ADMT chp1
123
Transformation Example: Pushing

Selections
• Query: Find the names of all instructors in the Music
department, along with the titles of the courses that they
teach
 name, title(dept_name= “Music”

(instructor (teaches course_id, title (course))))
• Transformation using rule 7a.
 name, title((dept_name= “Music”(instructor))

(teaches course_id, title (course)))
• Performing the selection as early as possible reduces the
ADMT chp1
size of the relation to be joined.
124

Selections
• Query: Find the names of all instructors in the Music department
who have taught a course in 2009, along with the titles of the
courses that they taught
 name, title(dept_name= “Music”year = 2009

(instructor (teaches course_id, title (course))))
• Transformation using join associatively (Rule 6a):
 name, title(dept_name= “Music”year = 2009
((instructor teaches) course_id, title (course)))
ADMT chp1
125

Selections
• Using rule 7a we can rewrite query as,
name, title((dept_name= “Music”year = 2009
((instructor teaches) ) course_id, title (course))

• Second form provides an opportunity to apply the
“perform selections early” rule(Rule 1), resulting in the
subexpression
ADMT chp1
dept_name = “Music” (instructor)  year = 2009 (teaches)
126

Selections

ADMT chp1
127
Example
An un-optimized relational algebra expression:
Name (
GPA3.5 and Title = ′Ada Programming Language’ and Students.SSN = Enrollment.SSN and Enrollment.Course_no = Courses.Course_no
(Students Enrollment Courses))

ADMT chp1
128
Example contd.
• Initial query tree:
 Name
 GPA  3.5 and Title = 'Ada Programming Language’

and Students.SSN = Enrollment.SSN
and Enrollment.Course_no = Courses.Course_no
ADMT chp1
Courses
Students Enrollment 129
Example contd.
• Perform selections as early as possible.
 Name
 Enrollment.Course_no = Courses.Course_no

Students.SSN = Enrollment.SSN Title = 'Ada Programming
Language’
 GPA  3.5 Enrollment
ADMT chp1
Courses
Students
130
Projections
• Consider: name, title(dept_name= “Music” (instructor) teaches)
course_id, title (course))))
• When we compute
(dept_name = “Music” (instructor teaches)

we obtain a relation whose schema is:
(ID, name, dept_name, salary, course_id, sec_id, semester, year)
• Push projections using equivalence rules 8a and 8b; eliminate unneeded
attributes from intermediate results to get:
name, title(name, course_id (
dept_name= “Music” (instructor) teaches))
ADMT chp1
• Performing the projection as early as possible reduces the size of the relation
to be joined.
131
Join Ordering Example

• For all relations r1, r2, and r3,
(r1 r2) r3 = r1 (r2 r3 )
(Join Associativity Rule 6a)

• If r2 r3 is quite large and r1 r2 is small, we choose
(r1 r2) r3
so that we compute and store a smaller temporary
ADMT chp1
relation.
132
Join Ordering Example (Cont.)

• Consider the expression
name, title(dept_name= “Music” (instructor) teaches)

• Could compute teaches course_id, title (course) first, and join result
with
dept_name= “Music” (instructor)
but the result of the first join is likely to be a large relation.
• Only a small fraction of the university’s instructors are likely to be
from the Music department
 it is better to compute
ADMT chp1
dept_name= “Music” (instructor) teaches
first.
133
Enumeration of Equivalent
Expressions
• Query optimizers use equivalence rules to systematically generate
expressions equivalent to the given expression
• Can generate all equivalent expressions as follows:
 Repeat

 apply all applicable equivalence rules on every subexpression of every
equivalent expression found so far
 add newly generated expressions to the set of equivalent expressions
Until no new equivalent expressions are generated above
• The above approach is very expensive in space and time
 Two approaches
ADMT chp1
 Optimized plan generation based on transformation rules
 Special case approach for queries with only selections, projections and
joins
134
Estimating Statistics of Expression
Results
• Statistical Information for Cost Estimation using Catalog
information: DBS catalog stores info about DB relations…
• nr: number of tuples in a relation r.
• br: number of blocks containing tuples of r.

• l r: size of a tuple of r.
• f r: blocking factor of r — i.e., the number of tuples of r that fit
into one block.
• V(A, r): number of distinct values that appear in r for attribute A;
which is same as the size of A(r).
If tuples of r are stored together physically in a file, then:
ADMT chp1
•
nr ùú é
br = ú ê
fr ú ê
ê 135
Histograms
Statistical Information for Cost Estimation using Catalog

information: Most real world optimizers in DBS store distribution of
values for each attribute as a Histogram
Histogram on attribute age of relation person
50

40
frequency
30
20
10
ADMT chp1
1–5 6–10 11–15 16–20 21–25
Equi-width histograms
value
Equi-depth histograms
136
Selection Size Estimation

• A=v(r)
 nr / V(A,r) : number of records that will satisfy the selection
 Equality condition on a key attribute: size estimate = 1
• AV(r) (case of A  V(r) is symmetric)
 Let c denote the estimated number of tuples satisfying the condition.

 If min(A,r) and max(A,r) are available in catalog
 c = 0 if v < min(A,r)
v - min( A, r )
c= nr .
max( A, r ) - min( A, r )
 If histograms available, can refine above estimate
ADMT chp1
 In absence of statistical information c is assumed to be nr / 2.
137
Choice of Evaluation Plans

• Must consider the interaction of evaluation techniques when
choosing evaluation plans
 choosing the cheapest algorithm for each operation
independently may not yield best overall algorithm.
 E.g. merge-join may be costlier than hash-join, but may provide

a sorted output which reduces the cost for an outer level
aggregation.
 nested-loop join may provide opportunity for pipelining
• Practical query optimizers incorporate elements of the following
two broad approaches:
ADMT chp1
1. Search all the plans and choose the best plan in a cost-based
fashion.
2. Uses heuristics to choose a plan.
138
Cost based join order selection

• Consider finding the best join-order for r1 r2 . . . rn.
• For n = 3, there are 12 different join orderings.
• There are (2( n – 1))!/( n – 1)! different join orders for

above expression. With n = 7, the number is 665280, with
n = 10, the number is greater than 17.6 billion!
• No need to generate all the join orders. Using dynamic
programming, the least -cost join order for any subset of {
r1 , r2 , . . . rn } is computed only once and stored for
ADMT chp1
future use.
139
Heuristic Optimization
• Cost-based optimization is expensive
• Systems may use heuristics to reduce the number of choices that must
be made in a cost-based fashion.

• Heuristic optimization transforms the query-tree by using a set of rules
that typically (but not in all cases) improve execution performance:
 Perform selection early (reduces the number of tuples)
 Perform projection early (reduces the number of attributes)
 Perform most restrictive selection and join operations (i.e. with
smallest result size) before other similar operations.
ADMT chp1
 Some systems use only heuristics, others combine heuristics with
partial cost-based optimization.
140
Heuristics based optimisation

• In left-deep join trees, the right-hand-side input for each join is a relation, not
the result of an intermediate join.
• Many optimizers considers only left-deep join orders. Eg. System R optimizer
 Plus heuristics to push selections and projections down the query tree
 Reduces optimization complexity and generates plans amenable to pipelined
evaluation.

ADMT chp1
141
Questions from MU papers

• Explain sort-merge join and hash join (Dec 2019)……10M….Ans:Chp12,pg553,pg557
• Explain brute force nested loop join algorithm (Dec 2018, 2020)……10M…Ans: Chp12,
pg 550
• Write short note on:

 Query optimization (Dec 2018)……5M… Ans:Chp12, pg 539,Chp13, pg 580

 Query evaluation plan (May 2019)….5M…Ans:Chp12, pg 539
 Measures of query cost (May 2019)….5M…Ans: Chp12, pg 540
• Why BCNF is called as stricter than 3NF? Justify your answer. (Dec
2019)……5M….out of syllabus…Ans: Chp8, pg 333- pg336, watch video
https://www.youtube.com/watch?time_continue=18&v=NNjUhvvwOrk&feature=emb_l
ogo
What is materialized view? What is its utility? (Dec 2019)……5M…..out of
ADMT chp1
•
syllabus…Ans: Chp13, pg 607, watch video
https://www.youtube.com/watch?v=06HlvmB8mDk
Note: Chapter number and page numbers are from the book, Korth, Slberchatz,Sudarshan, :”Database
System Concepts”, 6th Edition, McGraw – Hill 142
Questions from MU papers

• MCQ from Dec 2020
• Which algorithm uses equality comparison on a key attribute with a primary
index to retrieve a single record that satisfies the corresponding equality
condition.
1. Primary index equality on non key attribute
2. Primary index equality on key attribute

3. secondary index equality on non key attribute
4. secondary index equality on key attribute
• One of the main heuristic rule in query optimization is
1. Apply SELECT operation at the earliest.
2. Apply PROJECT operation at the earliest.
3. Apply SELECT and PROJECT operation at the earliest.
ADMT chp1
4. Apply SELECT and PROJECT operations before applying the
JOIN operation at the earliest
143

Chapter 1: Query Processing and Optimization: Slides By: Ms. Shree Jaswal

Uploaded by

Copyright:

Available Formats

You might also like

Chapter 1: Query Processing and Optimization: Slides By: Ms. Shree Jaswal

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 1: Query Processing and Optimization: Slides By: Ms. Shree Jaswal

Uploaded by

Copyright:

Available Formats

Chapter 1: Query

St. Francis Institute of Technology, Department of Information Technology

Which chapter? Which Book?

Slides by: Ms. Shree J.

Slides by: Ms. Shree J.

ADMT chp1 Slides by: Ms. Shree J.

Slides by: Ms. Shree J.

Slides by: Ms. Shree J.

Relational Algebra Overview

Slides by: Ms. Shree J.

• Sailors(sid: integer, sname: string, rating: integer, age: real)

• Boats( bid: integer, bname: string, color: string)

• Reserves(sid: integer, bid: integer, day: date)

Slides by: Ms. Shree J.

sid sname rating age sid sname rating age

sid bid day

Unary Relational Operations

sid sname rating age

Slides by: Ms. Shree J.

 Compute the names and ratings of highly rated sailors:

Slides by: Ms. Shree J.

sid sname rating age

Slides by: Ms. Shree J.

• Set-difference: S1- S2 returns a relation instance containing all tuples that

sid sname rating age

Slides by: Ms. Shree J.

22 Dustin 7 45.0 58 103 11/12/96

31 Lubber 8 55.5 22 101 10/10/96

Binary Relational Operations

Slides by: Ms. Shree J.

31 Lubber 8 55.5 58 103 11/12/96

Binary Relational Operations

Slides by: Ms. Shree J.

Binary Relational Operations

Slides by: Ms. Shree J.

Binary Relational Operations

Slides by: Ms. Shree J.

22 Dustin 7 45.0 22 101 10/10/96

31 Lubber 8 55.5 Null Null Null

Binary Relational Operations

Slides by: Ms. Shree J.

22 Dustin 7 45.0 22 101 10/10/96

58 Rusty 10 35.0 58 103 11/12/96

Binary Relational Operations

Slides by: Ms. Shree J.

22 Dustin 7 45.0 22 101 10/10/96

31 Lubber 8 55.5 Null Null Null

Binary Relational Operations

Slides by: Ms. Shree J.

Slides by: Ms. Shree J.

Slides by: Ms. Shree J.

Slides by: Ms. Shree J.

Slides by: Ms. Shree J.

Slides by: Ms. Shree J.

DNo DName Location

• πEmpID (σEName="ArunKumar" (EMPLOYEE))

• The corresponding query tree will be