Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 29

Query Optimization

These slides are prepared based on the Lecture


Notes of CS186, Berkeley 2021

Yücel Saygın
Sample SQL Query
SELECT sname, sid
FROM Sailors S, Reserves R
Where S.sid = R.sid AND date = ’28.05.2024’

Lets write the RA expression and an execution plan


Different Execution Plans
• Many execution plans!
• Which one to choose?
• What is the criteria for choosing an execution plan?
• Is it possible to find the optimal query execution plan?
Heuristics and Estimation
• Estimation of the Cost of a given execution plan.
• The selectivity of an operator is an approximation for the
percentage of pages it will return (or pass to the next
operator).
• Why is it important to estimate the selectivity of an
operator?
Selectivity of some operators
• Ex. rating = 5
• With uniform distribution assumption, what would be the
selectivity of rating = 5 assuming that rating is between 1
and 10?
Selectivity of some operators
• In general selectivity of X=a: 1/(unique vals in X)
Selectivity of some operators
• Ex. rating > 5
• What would be the selectivity of the above condition?
Selectivity of some operators
• In general selectivity of X>a :
(max(X)- a) / (max(X)- min(X) + 1)
Selectivity of some operators
• Ex. Sailors.sid = Reserves.sid
• What would be the selectivity of the above condition?
Selectivity of some operators
• In general selectivity of X=Y :
1/max(unique vals in X, unique vals in Y)
Selectivity of some operators
• Ex. rating = 5 AND age = 20
• What would be the selectivity of the above condition
assuming that the age is between 0 and 99?
Selectivity of some operators
• In general selectivity of cond1 AND cond2:
Selectivity(cond1) * Selectivity(cond2)
Selectivity of Joins
• If we join tables A and B on the condition A.id = B.id
Selectivity of the join is:
|A|∗|B|/max(unique vals in A.id, unique vals in B.id)
Basic Heuristics
• There are many possible query plans for a reasonably complex
query
• We need some way to reduce the number of plans that we
actually consider
1. Push down projects (π) and selects (σ) as far as they can go

down
2. Only consider left deep plans

3. Do not consider cross products unless they are the only option
Benefits of Pushing down
selection and projection operators

• Pushing down selection is an obvious choice.


• How about the projection operator?
Benefits of Pushing down
selection and projection operators

• Pushing down selection is an obvious choice.


• How about the projection operator?
• Can you push down all projection operators?
Considering left deep plans

• Aim: reducing the search space and pipelining


• Which one of the above allows pipelining?
Avoiding Cross Products
• Cross products are the worst!
System R Query Optimizer
• System R uses all the heuristics described in the previous
slides.
• The first pass of System R determines how to access
tables optimally or interestingly (we will define what we
mean by interesting later on).
System R Query Optimizer
• We have two options for how to access a table during the
first pass:
1. Full Scan
2. Index Scan (for every index the table has built on it)
System R Query Optimizer
• Cost of Full Scan for a table A is [A] I/O operations since
it needs to read in every page.
System R Query Optimizer
• Cost of an index scan, the number of I/O operations
depends on how the records are stored and whether or
not the index is clustered.
Lets remember the alternatives for
Data Entry k* in Index
Three alternatives:
1. Data record with key value k
2. <k, rid of data record with search key value k>
3. <k, list of rids of data records with search key k>
System R Query Optimizer
• If the data entries are the data records themselves (i.e.
• Alternative 1) then indexes have an IO cost of:
• (cost to reach level above leaf) + (num leaves read)
Example
• Table A has [A] pages
• There is an alternative 1 index built on C1 of height 2
• There are 2 conditions in our query: C1 > 5 and C2 < 6
• C1 and C2 both have values in the range 1-10

• What is the selectivity of C1 > 5 and C2 < 6 ?


Example
• Table A has [A] pages
• There is an alternative 1 index built on C1 of height 2
• There are 2 conditions in our query: C1 > 5 and C2 < 6
• C1 and C2 both have values in the range 1-10

• We can do an index scan for C1 : 2 I/O operations to read


the internal nodes (since the index is of height 2) then
we will read half of the leaf pages adding 0.5[A] I/O
operations
System R Query Optimizer
• For alternative 2 and 3 indexes, the formula is a little different:
(cost to reach level above leaf) +
(num of leaf nodes read) +
(num of data pages read)
For alternatives 2 and 3:
The index could be Clustered or
Unclustered

Index entries
CLUSTERED direct search for UNCLUSTERED
data entries

Data entries Data entries


(Index File)
(Data file)

Data Records Data Records


Example
• Table B with [B] data pages and |B| records
• Alt 2 index on column C1, with a height of 2 and [L] leaf
pages
• There are two conditions: C1 > 5 and C2 < 6
• C1 and C2 both have values in the range 1-10

• If the index is clustered, the scan will take 2 I/Os to reach the
index node above the leaf level, it will then have to read 0.5[L]
leaf pages, and then 0.5[B] data pages. Therefore, the total is
2 + 0.5[L] + 0.5[B]. If the index is unclustered, the formula is
the same except we have to read 0.5|B| data pages instead. So
the total number of I/Os is 2 + 0.5[L] + 0.5|B|.

You might also like