Chapter 1: Query Processing and Optimization: Slides By: Ms. Shree Jaswal

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 143

Chapter 1: Query

Processing
and Optimization
Slides by: Ms. Shree Jaswal

St. Francis Institute of Technology, Department of Information Technology


The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Which chapter? Which Book?


• Chapter 12: Query Processing, Korth, Slberchatz,Sudarshan, :”Database System
Concepts”, 6th Edition, McGraw – Hill
• Chapter 13: Query Optimization, Korth, Slberchatz,Sudarshan, :”Database
System Concepts”, 6th Edition, McGraw – Hill
• Chapter 3: The Relational Data Model and Relational Database Constraints,

Slides by: Ms. Shree J.


Elmasri and Navathe, “ Fundamentals of Database Systems”, 6th Edition,
PEARSON Education
• Chapter 4: Relational Algebra And Calculus, Raghu Ramakrishnan and
Johannes Gehrke, “Database Management Systems” 3rd Edition -McGraw
Hill

ADMT chp1
• The slides in this presentation are made by referring the above mentioned author’s slides.

2
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Topics to be covered
• Overview: Introduction, Query processing in DBMS,
Steps of Query Processing
• Measures of Query Cost: Selection Operation, Sorting,
Join Operation, Other Operations
• Evaluation of Expressions.

Slides by: Ms. Shree J.


• Query Optimization Overview: Goals of Query
Optimization, Approaches of Query Optimization
• Transformation of Relational Expressions
• Estimating Statistics of Expression Results
• Choice of Evaluation Plans

ADMT chp1
• Self-learning Topics: Solve problems on query
optimization.
3
Prerequisite

ADMT chp1 Slides by: Ms. Shree J.


4
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Topics
• Reviewing basic concepts of a Relational database,
• SQL concepts

Slides by: Ms. Shree J.


ADMT chp1
5
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Relational Database
• Entity

• Relationship

Slides by: Ms. Shree J.


• Attributes

ADMT chp1
6
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Relational Algebra Overview


• Relational Algebra consists of several groups of operations
 Unary Relational Operations
 SELECT (symbol:  (sigma))
 PROJECT (symbol:  (pi))
 RENAME (symbol:  (rho))

Slides by: Ms. Shree J.


 Relational Algebra Operations From Set Theory
 UNION (  ), INTERSECTION (  ), DIFFERENCE (or MINUS, – )
 CARTESIAN PRODUCT ( x )
 Binary Relational Operations
 JOIN (several variations of JOIN exist)
 DIVISION
 Additional Relational Operations
 OUTER JOINS

ADMT chp1
 AGGREGATE FUNCTIONS (These compute summary of
information: for example, SUM, COUNT, AVG, MIN, MAX)

7
Examples on Relational Algebra
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Operations
• Consider the following schema

• Sailors(sid: integer, sname: string, rating: integer, age: real)

• Boats( bid: integer, bname: string, color: string)

• Reserves(sid: integer, bid: integer, day: date)

Slides by: Ms. Shree J.


Instance S1 of Sailors Instance S2 of Sailors

sid sname rating age sid sname rating age


22 Dustin 7 45.0 28 Yuppy 9 35.0
31 Lubber 8 55.5 31 Lubber 8 55.5
58 Rusty 10 35.0 44 Guppy 5 35.0

ADMT chp1
58 Rusty 10 35.0

sid bid day


22 101 10/10/96 Instance R1 of Reserves

58 103 11/12/96 8
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Unary Relational Operations


• Selection and Projection
 Retrieve expert sailors with rating greater than 8:
 rating>8 (S2)

sid sname rating age

Slides by: Ms. Shree J.


28 Yuppy 9 35.0
58 Rusty 10 35.0

 Compute the names and ratings of highly rated sailors:


 𝒔𝒏𝒂𝒎𝒆,𝒓𝒂𝒕𝒊𝒏𝒈 (rating>8 (S2))

sname rating

ADMT chp1
Yuppy 9
Rusty 10

9
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Set Operations
• Union (Either-or) : Two relation instances are said to be union-compatible if the
following conditions hold:
 they have the same number of the fields,
 and corresponding fields, taken in order from left to right, have the same domains.

• S1  S2

Slides by: Ms. Shree J.


sid sname rating age
22 Dustin 7 45.0
31 Lubber 8 55.5
58 Rusty 10 35.0
28 Yuppy 9 35.0
44 Guppy 5 35.0

ADMT chp1
• Is R1  S1 possible?
• No
• not a valid operation because the two relations are not union-compatible.
10
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Set Operations
• Intersection (Both): S1  S2

sid sname rating age


31 Lubber 8 55.5

Slides by: Ms. Shree J.


58 Rusty 10 35.0

• Set-difference: S1- S2 returns a relation instance containing all tuples that


occur in S1 but not in S2. The relations S1 and S2 must be union-compatible,
and the schema of the result is defined to be identical to the schema of S1.

sid sname rating age

ADMT chp1
22 Dustin 7 45.0

11
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Set Operations
• Cross-product (Cartesian product): S1 x R1 returns a relation instance
whose schema contains all the fields of S1 (in the same order as they appear
in S1) followed by all the fields of R1 (in the same order as they appear in R1).
• S1 x R1

Slides by: Ms. Shree J.


sid sname rating age sid bid day
22 Dustin 7 45.0 22 101 10/10/96

22 Dustin 7 45.0 58 103 11/12/96

31 Lubber 8 55.5 22 101 10/10/96

ADMT chp1
31 Lubber 8 55.5 58 103 11/12/96
58 Rusty 10 35.0 22 101 10/10/96
58 Rusty 10 35.0 58 103 11/12/96

12
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Binary Relational Operations


• Joins: most commonly used way to combine information from two or more
relations.
 Condition Joins: The most general version of the join operation accepts a join
condition c and a pair of relation instances as arguments and returns a relation
instance

Slides by: Ms. Shree J.


 S1 S1.sid<Rl.sid R1
sid sname rating age sid bid day
22 Dustin 7 45.0 58 103 11/12/96

31 Lubber 8 55.5 58 103 11/12/96

 Equijoin: when the join condition consists solely of equalities of the form R.name1
= S.name2, that is, equalities between two fields in R and S.

ADMT chp1
 S1 S1.sid=Rl.sid R1
sid
22

58
13
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Binary Relational Operations


• Natural Join: an equijoin R S in which equalities are
specified on all fields having the same name in R and S.
• In this case, we can simply omit the join condition; the
default is that the join condition is a collection of

Slides by: Ms. Shree J.


equalities on all common fields
• The result is guaranteed not to have two fields with the
same name.
• If the two relations have no attributes in common, R S
is simply the cross-product.

ADMT chp1
• S1 S1.sid=Rl.sid R1 is actually a natural join

14
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Binary Relational Operations


Outer Joins:
• The left outer join operation keeps every tuple in the first or left
relation R in R S; if no matching tuple is found in S, then the
attributes of S in the join result are filled or “padded” with null
values.

Slides by: Ms. Shree J.


• A similar operation, right outer join, keeps every tuple in the
second or right relation S in the result of R S.
• A third operation, full outer join, denoted by keeps all
tuples in both the left and the right relations when no matching
tuples are found, padding them with null values as needed.

ADMT chp1
15
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Binary Relational Operations


• left outer join: S1 R1

Slides by: Ms. Shree J.


sid sname rating age sid bid day

22 Dustin 7 45.0 22 101 10/10/96

31 Lubber 8 55.5 Null Null Null

ADMT chp1
58 Rusty 10 35.0 58 103 11/12/96

16
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Binary Relational Operations


• right outer join: S1 R1

Slides by: Ms. Shree J.


sid sname rating age sid bid day

22 Dustin 7 45.0 22 101 10/10/96

58 Rusty 10 35.0 58 103 11/12/96

ADMT chp1
17
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Binary Relational Operations


• full outer join: S1 R1

Slides by: Ms. Shree J.


sid sname rating age sid bid day

22 Dustin 7 45.0 22 101 10/10/96

31 Lubber 8 55.5 Null Null Null

ADMT chp1
58 Rusty 10 35.0 58 103 11/12/96

18
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Binary Relational Operations


• Division: For a tuple t to appear in the result T of the DIVISION, the
values in t must appear in R in combination with every tuple in S.

Slides by: Ms. Shree J.


ADMT chp1
19
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Aggregate Functions
• Use of the Aggregate Functional operation ℱ
 ℱMAX Salary (EMPLOYEE) retrieves the maximum salary value from the
EMPLOYEE relation
 ℱMIN Salary (EMPLOYEE) retrieves the minimum Salary value from the

Slides by: Ms. Shree J.


EMPLOYEE relation
 ℱSUM Salary (EMPLOYEE) retrieves the sum of the Salary from the
EMPLOYEE relation
 ℱCOUNT SSN, AVERAGE Salary (EMPLOYEE) computes the count (number)
of employees and their average salary
 Note: count just counts the number of rows, without removing duplicates

ADMT chp1
20
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Recap of
Relational
Algebra
Operations

Slides by: Ms. Shree J.


ADMT chp1
21
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Query Tree
• A query tree is a tree data structure representing a relational
algebra expression.
• The tables of the query are represented as leaf nodes.

Slides by: Ms. Shree J.


• The relational algebra operations are represented as the internal
nodes.
• The root represents the query as a whole.
• During execution, an internal node is executed whenever its operand
tables are available.

ADMT chp1
• The node is then replaced by the result table. This process continues
for all internal nodes until the root node is executed and replaced by
the result table.

22
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Query Tree
• A B C

Slides by: Ms. Shree J.


ADMT chp1
23
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Example
• Employee
EmpID EName Salary DeptNo DateOfJoining

Slides by: Ms. Shree J.


• Department

DNo DName Location

ADMT chp1
24
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Example
• Let us consider the query as the following.

• πEmpID (σEName="ArunKumar" (EMPLOYEE))

• The corresponding query tree will be

Slides by: Ms. Shree J.


ADMT chp1
25
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Example
• Let us consider another query involving a join.

• πEName,Salary (σDName="Marketing“ (DEPARTMENT)) ⋈DNo=DeptNo


(EMPLOYEE)

• Following is the query tree for the above query.

Slides by: Ms. Shree J.


ADMT chp1
26
Query Processing
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Basic Steps in Query Processing


1. Parsing and translation

2. Optimization

3. Evaluation

Slides by: Ms. Shree J.


ADMT chp1
28
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Basic Steps in Query Processing


(Cont.)
• Parsing and translation
 translate the query into its internal form. This is then
translated into relational algebra.
 Parser checks syntax, verifies relations

Slides by: Ms. Shree J.


• Evaluation
 The query-execution engine takes a query-evaluation
plan, executes that plan, and returns the answers to the
query.

ADMT chp1
29
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Basic Steps in Query Processing :


Optimization
• Consider a query: Select salary from instructor where salary
<75000
• A relational algebra expression may have many equivalent

Slides by: Ms. Shree J.


expressions
 E.g., salary75000(salary(instructor)) is equivalent to
salary(salary75000(instructor))
• Each relational algebra operation can be evaluated using one of
several different algorithms
 Correspondingly, a relational-algebra expression can be

ADMT chp1
evaluated in many ways.

30
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Basic Steps: Optimization (Cont.)


• To specify how to evaluate a query, we need to provide
relational algebra expression and annotate it with
instructions(algorithms or indices to be used) specifying
how to evaluate each operation

Slides by: Ms. Shree J.


• A relational algebra operation annotated with
instructions on how to evaluate it is called an
evaluation primitive

ADMT chp1
31
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Basic Steps: Optimization (Cont.)


• Annotated expression specifying detailed evaluation strategy
is called an evaluation-plan.
 E.g., can use an index on salary to find instructors with
salary < 75000,

Slides by: Ms. Shree J.


 or can perform complete relation scan and discard instructors
with salary  75000

ADMT chp1
32
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Basic Steps: Optimization (Cont.)


• Query Optimization: Amongst all equivalent
evaluation plans choose the one with lowest cost.
 Cost is estimated using statistical information from the
database catalog

Slides by: Ms. Shree J.


 e.g. number of tuples in each relation, size of tuples,
etc.
• It is the responsibility of the system to construct a query evaluation
plan that minimizes the cost of query evaluation

ADMT chp1
33
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Disk
structure

Slides by: Ms. Shree J.


ADMT chp1
34
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Sectors and Blocks

Slides by: Ms. Shree J.


ADMT chp1
35
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Measures of Query Cost


• Cost is generally measured as total elapsed time for answering query
 Many factors contribute to time cost
 disk accesses, CPU, or even network communication
Typically disk access is the predominant cost, and is also relatively

Slides by: Ms. Shree J.



easy to estimate. Measured by taking into account
 Number of seeks * average-seek-cost
 Number of blocks read * average-block-read-cost
 Number of blocks written * average-block-write-cost
 Cost to write a block is greater than cost to read a block(typically
about twice expensive)

ADMT chp1
 data is read back after being written to ensure that the write
was successful

36
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Measures of Query Cost (Cont.)


• For simplicity we just use the number of block transfers from
disk and the number of seeks as the cost measures
 tT – time to transfer one block
 tS – time for one seek

Slides by: Ms. Shree J.


 Cost for b block transfers plus S seeks
b * tT + S * tS
• We ignore CPU costs for simplicity
 Real systems do take CPU cost into account
• We do not include cost to writing output to disk in our cost formulae

ADMT chp1
37
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Measures of Query Cost (Cont.)


• Several algorithms can reduce disk IO by using extra
buffer space
 Amount of real memory available to buffer depends on
other concurrent queries and OS processes, known only

Slides by: Ms. Shree J.


during execution
 We often use worst case estimates, assuming only the
minimum amount of memory needed for the operation
is available
• Required data may be buffer resident already, avoiding

ADMT chp1
disk I/O
 But hard to take into account for cost estimation

38
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Indexes as Access Paths


•A single-level index is an auxiliary file that
makes it more efficient to search for a record in
the data file.
• The index is usually specified on one field of the

Slides by: Ms. Shree J.


file (although it could be specified on several
fields)
• One form of an index is a file of entries <field
value, pointer to record>, which is ordered by
field value

ADMT chp1
• The index is called an access path on the field.

39
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Primary Index
 Also referred to as a clustering index
 Defined on an ordered data file
 The data file is ordered on a key field
 Includes one index entry for each block in the
data file; the index entry has the key field value for

Slides by: Ms. Shree J.


the first record in the block, which is called the
block anchor
 In other words, it allows the records of a file to be
read in an order that corresponds to the physical
order in the file
 Further classified into dense and sparse index

ADMT chp1
 An index that is not primary index is called as a
secondary index

40
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Primary Index

Slides by: Ms. Shree J.


ADMT chp1
41
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Secondary Index

Slides by: Ms. Shree J.


ADMT chp1
42
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

B+ tree
• B+ tree is an n-array tree with a variable but often large number of children per
node. A B+ tree consists of a root, internal nodes and leaves. The root may be either
a leaf or a node with two or more children.
• In a B+ tree, data stored only in leaf nodes.
• The leaf nodes of the tree stores the actual record rather than pointers to records.

Slides by: Ms. Shree J.


• These trees do not waste space.
• In B+ tree, leaf node data are ordered in a sequential linked list.
• Searching of any data in a B+ tree is very easy because all data is found in leaf
nodes.
They store redundant search key.

ADMT chp1

• Many database system implementers prefer the structural simplicity of a B+ tree.

43
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

B+ tree

Slides by: Ms. Shree J.


ADMT chp1
44
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Selection Operation
• File scan
• Algorithm A1 (linear search). Scan each file block and test all records to see
whether they satisfy the selection condition.
 Cost estimate = br block transfers + 1 seek
 br denotes number of blocks containing records from relation r
 If selection is on a key attribute, can stop on finding record

Slides by: Ms. Shree J.


 cost = (br /2) block transfers + 1 seek
 Linear search can be applied regardless of
 selection condition or
 ordering of records in the file, or
 availability of indices

• Note: binary search generally does not make sense since data is not stored

ADMT chp1
consecutively
 except when there is an index available,
 and binary search requires more seeks than index search
45
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Selections Using Indices


• Index scan – search algorithms that use an index
 selection condition must be on search-key of index.
• A2 (primary B+ tree index, equality on key). Retrieve a single
record that satisfies the corresponding equality condition. Index
lookup traverses height hi of the tree + 1 I/O to fetch record; each of
these I/O operations require a seek and a block transfer

Slides by: Ms. Shree J.


 Cost = (hi + 1) * (tT + tS)
• A3 (primary B+ tree index, equality on nonkey) Retrieve
multiple records.
 Records will be on consecutive blocks
 Let b = number of blocks(leaf blocks) containing matching
records; all of which are read

ADMT chp1
 1 seek for each level of tree and one seek for 1st block
 Cost = hi * (tT + tS) + tT * b
46
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Selections Using Indices


• A4 (secondary B+ tree index, equality on key and
nonkey).
 Retrieve a single record if the search-key is a candidate key.
Cost is same as A2

Slides by: Ms. Shree J.


 Cost = (hi + 1) * (tT + tS)
 Retrieve multiple records if search-key is not a candidate key
 each of n matching records may be on a different block
which may result in 1 I/O operation per retrieved record
with each I/O operation requiring a seek and a block
transfer

ADMT chp1
 Cost = (hi + n) * (tT + tS)
 Can be very expensive!

47
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Selections Involving Comparisons


• Can implement selections of the form AV (r) or A  V(r)
by using
 a linear file scan,
 or by using indices in the following ways:

Slides by: Ms. Shree J.


• A5 (primary index, comparison). (Relation is sorted on
A)
 For A  V(r) use index to find first tuple  v and scan
relation sequentially from there
 For AV (r) just scan relation sequentially till first

ADMT chp1
tuple > v; do not use index
 Cost is identical to A3

48
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Selections Involving Comparisons


• A6 (secondary index, comparison).
 For A  V(r) use index to find first index entry  v and
scan index sequentially from there, to find pointers to
records.

Slides by: Ms. Shree J.


 For AV (r) just scan leaf pages of index finding
pointers to records, till first entry > v
 In either case, retrieve records that are pointed to
 requires an I/O for each record
 Linear file scan may be cheaper

ADMT chp1
 Cost is identical to A4

49
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Implementation of Complex Selections


• Conjunction: 1 2. . . n(r)
• A7 (conjunctive selection using one index).
 Select a combination of i and algorithms A1 through A6 that results
in the least cost for i (r).
 Test other conditions on tuple after fetching it into memory buffer.

Slides by: Ms. Shree J.


• A8 (conjunctive selection using composite index).
 Use appropriate composite (multiple-key) index if available.
• A9 (conjunctive selection by intersection of identifiers).
 Requires indices with record pointers.
 Use corresponding index for each condition, and take intersection of

ADMT chp1
all the obtained sets of record pointers.
 Then fetch records from file
 If some conditions do not have appropriate indices, apply test on
retrieved records in memory.
50
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Implementation of Complex
Selections
• Disjunction:1 2 . . . n (r).
• A10 (disjunctive selection by union of identifiers).
 Applicable if all conditions have available indices.
 Otherwise use linear scan.

Slides by: Ms. Shree J.


 Use corresponding index for each condition, and take union of
all the obtained sets of record pointers.
 Then fetch records from file
• Negation: (r)
 Use linear scan on file

ADMT chp1
 If very few records satisfy , and an index is applicable to 
 Find satisfying records using index and fetch from file
51
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Summary Summary of costs for


selections

Slides by: Ms. Shree J.


ADMT chp1
52
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Sorting
• Sorting in DBMS may be required for 2 reasons:
 SQL queries specify that the output be sorted, or
 Several operations like joins can be implemented efficiently if
input relations are sorted first

Slides by: Ms. Shree J.


• We may build an index on the relation, and then use the index to
read the relation in sorted order. However such process orders the
relation logically rather than physically. May lead to one disk
block access for each tuple, which can be expensive.
For relations that fit in memory, techniques like quicksort can be

ADMT chp1

used. For relations that don’t fit in memory, external sort-
merge is a good choice.
53
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Quick Sort example

Slides by: Ms. Shree J.


ADMT chp1
54
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

External Sort-Merge
• Let M denote memory size (in pages).
1. Create sorted runs. Let i be 0 initially.
Repeatedly do the following till the end of the relation:

Slides by: Ms. Shree J.


(a) Read M blocks of relation into memory
(b) Sort the in-memory blocks
(c) Write sorted data to run Ri; increment i.
Let the final value of i be N
2. Merge the runs (next slide)…..

ADMT chp1
55
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

External Sort-Merge (Cont.)


• Example: File size = 5 GB
• Memory size =1000 MB = 1GB

Slides by: Ms. Shree J.


Relation

R0 R1 R2 R3 R4

ADMT chp1
56
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

External Sort-Merge (Cont.)


2.Merge the runs (N-way merge). We assume (for now) that N <
M.
1. Use N blocks of memory to buffer input runs, and 1 block to
buffer output. Read the first block of each run into its buffer

Slides by: Ms. Shree J.


page
2. repeat
1. Select the first record (in sort order) among all buffer pages
2. Write the record to the output buffer. If the output buffer is
full write it to disk.
3. Delete the record from its input buffer page.
If the buffer page becomes empty then

ADMT chp1
read the next block (if any) of the run into the buffer.
3. until all input buffer pages are empty:
57
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

External Sort-Merge (Cont.)


• If N  M, several merge passes are required.
 In each pass, contiguous groups of M - 1 runs are merged.
 A pass reduces the number of runs by a factor of M -1, and
creates runs longer by the same factor.

Slides by: Ms. Shree J.


 Repeated passes are performed till all runs have been
merged into one.

• No. of passes is 1+ log M–1(N / M)


• Cost =2N * no. of passes

ADMT chp1
58
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

External Sort-Merge (Cont.)


• Read 150 MB of data from each file
• 150*5 =750 MB for data from run files and 250 MB for writing output
• Perform 5 way merge
• Write to disk when output buffer is full
• Read next block of data (150 MB or rest of relation)
Repeat until all input buffers are empty

Slides by: Ms. Shree J.


R0 R1 R2 R3 R4

ADMT chp1
5 way merge

Output Disk 59
Example: External Sorting Using Sort-Merge
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

a 19 a 19
g 24 d 31 a 14
b 14
a 19 g 24 a 19
c 33
d 31 b 14
b 14 d 31
c 33 c 33
c 33 e 16
b 14 d 7
g 24

Slides by: Ms. Shree J.


e 16
e 16 d 21
r 16 d 21 d 31
a 14
d 21 m 3 e 16
d 7
m 3 r 16 g 24
d 21
p 2 m 3
m 3
d 7 a 14 p 2

ADMT chp1
p 2
a 14 d 7 r 16
r 16
p 2
initial sorted
relation runs runs output
create
runs
merge
pass–1
merge
pass–2
60
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Example: External Sorting Using


Sort-Merge
• No. of passes is 1+ log M–1(N / M)
• 3 buffers used to sort a 12 page file
 In pass 0: 12/3=4 sorted runs of 3 pages are produced
 In pass 1: 4/2=2 sorted runs of 6 pages are produced

Slides by: Ms. Shree J.


 In pass 2: 1 complete sorted file if produced
• Thus no. of passes = 1+ log 3–1(12 / 3) =3
• Another example: suppose 5 buffers available to sort 108 page
file
 In pass 0: 108/5 = 22 sorted runs of 5 pages (last run =3 pages)
 In pass 1: 22/4 = 6 sorted runs of 20 pages (last run =8 pages)

ADMT chp1
 In pass 2: 6/3= 2 sorted runs of 80 pages and 28 pages
 In pass 3: sorted file of 108 pages
• Thus no. of passes = 1+ log 5–1(108 / 5) =4
61
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Simple Example

Slides by: Ms. Shree J.


ADMT chp1
62
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

External Merge Sort (Cont.)


Cost of block transfers:
• Number of block transfers (read and write) for initial run
creation: 2br
• Total number of merge passes required:  log M–1(br / M) 

Slides by: Ms. Shree J.


• Number of block transfers (read and write) in each pass: 2br
• For final pass, we don’t count write cost: - br (i.e, subtract br)
 we ignore final write cost for all operations since the output of an operation may be
sent to the parent operation without being written to disk
• Thus, total number of block transfers for external sorting:
• 2br + 2 br  log M–1(br / M)  - br

ADMT chp1
 = br ( 2  log M–1 (br / M)  + 1)= 12(2*2+1)= 60 block transfers
 Where br=12 and M=3
• Seeks: next slide
63
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

External Merge Sort (Cont.)


Cost of seeks
• During run generation: one seek to read each run and one seek to write
each run
 2 br / M

Slides by: Ms. Shree J.


• Total number of merge passes required: log M–1(br / M) 

• During each pass or merge phases


 1 seek for reading each block and 1 seek for writing each block i.e 2br
seeks for each merge pass
 except the final one which does not require a write (i.e, subtract br)

ADMT chp1
 Total number of seeks:
2 br / M + 2 br logM–1(br / M) - br
 = 2 br / M + br ( 2 logM–1(br / M) - 1)
 =2(4) + 12(2(2)-1)=44 disk seeks 64
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

External Merge Sort (Cont.) Example


• Select * from Employee order by name

• Assumptions: No. of buffers fit in memory =3, no. of tuples fit in one buffer =1
EmpId Name
1001 Jayanti
1002 Pramod
1003 Neha

Slides by: Ms. Shree J.


1004 Nilesh
1005 Mayur
1006 Vinayak
1007 Shree
1008 Akshata
1009 Jaya

ADMT chp1
1010 Abhishek
1011 Santosh

65
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Example –Stage 1
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
Jayan Pram Neha Niles Mayu Vinay Shree Aksha Jaya Abhis Santo
ti od h r ak ta hek sh

Slides by: Ms. Shree J.


R0 R1 R2 R3

1, 3, 2, 5, 4, 6, 8, 9, 7, 10, 11,
Jayan Neha Pram Mayu Niles Vinay Aksha Jaya Shree Abhis Santo
ti od r h ak ta hek sh

ADMT chp1
66
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Example –Stage 2
• Read M-1 input buffers and 1 output buffer to write sorted output
EmpId Name
1001 Jayanti
1005 Mayur

1, Jayanti 3, Neha 2, Pramod 1003 Neha


1004 Nilesh

Slides by: Ms. Shree J.


1002 Pramod
5, Mayur 4, Nilesh 6,
Vinayak 1006 Vinayak

8, 9, Jaya 7, Shree EmpId Name


Akshata
1010 Abhishek
10, 11, 1008 Akshata
Abhishek Santosh

ADMT chp1
1009 Jaya
1011 Santosh
1007 Shree
67
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Example –Stage 2
EmpId Name
1010 Abhishek
1008 Akshata
1009 Jaya

Slides by: Ms. Shree J.


1001 Jayanti
1005 Mayur
1003 Neha
1004 Nilesh
1002 Pramod
1011 Santosh

ADMT chp1
1007 Shree
1006 Vinayak

68
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Join Operation
• Several different algorithms to implement joins
 Nested-loop join
 Block nested-loop join

Slides by: Ms. Shree J.


• Choice based on cost estimate
• Examples use the following information
 Number of records of student: 5,000 and of takes: 10,000
 Number of blocks of student: 100 and of takes: 400

ADMT chp1
69
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Nested-Loop Join (brute force)


• To compute the theta join r  s
for each tuple tr in r do begin
for each tuple ts in s do begin
test pair (tr,ts) to see if they satisfy the join condition 

Slides by: Ms. Shree J.


if they do, add tr • ts to the result.
end
end
• r is called the outer relation and s the inner relation of the join.
• Requires no indices and can be used with any kind of join condition.

ADMT chp1
• Expensive since it examines every pair of tuples in the two relations.

70
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.


Nested-Loop Join (Cont.)
In the worst case, if there is enough memory only to hold one block of each relation,
• the estimated cost is
nr  bs + br block transfers, plus
n r + br seeks

• If the smaller relation fits entirely in memory, use that as the inner relation.
 Reduces cost to br + bs block transfers and 2 seeks

Slides by: Ms. Shree J.


• Assuming worst case memory availability cost estimate is
 with student as outer relation:
 5000  400 + 100 = 2,000,100 block transfers,
 5000 + 100 = 5100 seeks
 with takes as the outer relation
 10000  100 + 400 = 1,000,400 block transfers and 10,400 seeks
• In best case scenario, we can read both relations only once

ADMT chp1
• and cost estimate will be 100+400 =500 block transfers.

• Block nested-loops algorithm (next slide) is preferable.

71
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Block Nested-Loop Join


• Variant of nested-loop join in which every block of inner
relation is paired with every block of outer relation.
for each block Br of r do begin

Slides by: Ms. Shree J.


for each block Bs of s do begin
for each tuple tr in Br do begin
for each tuple ts in Bs do begin
Check if (tr,ts) satisfy the join condition
if they do, add tr • ts to the result.
end

ADMT chp1
end
end
end
72
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Block Nested-Loop Join (Cont.)


• Worst case estimate: br  bs + br block transfers + 2 * br seeks
 Each block in the inner relation s is read once for each block in
the outer relation
 It is more efficient to use smaller relation as the outer relation,
in case neither of the relations fit in the memory

Slides by: Ms. Shree J.


• Best case: br + bs block transfers + 2 seeks.
• Improvements to nested loop and block nested loop algorithms:
 In block nested-loop, use M — 2 disk blocks as blocking unit for
outer relations, where M = memory size in blocks; use remaining
two blocks to buffer inner relation and output
 Cost = br / (M-2)  bs + br block transfers +
2 br / (M-2) seeks

ADMT chp1
 If equi-join attribute forms a key or inner relation, stop inner
loop on first match

73
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Block Nested-Loop Join (Cont.)


• Example of student takes

• In the worst case, we have to read each block of takes once for each block of
student. Thus, in the worst case, a total of 100 ∗ 400 + 100 = 40,100 block
transfers plus 2∗100 = 200 seeks are required.

Slides by: Ms. Shree J.


• This cost is a significant improvement over the 5000∗400+100 = 2,000,100
block transfers plus 5100 seeks needed in the worst case for the basic
nested-loop join.

• The best-case cost remains the same—namely, 100 + 400 = 500 block
transfers and 2 seeks.

ADMT chp1
74
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Indexed Nested-Loop Join


• Index lookups can replace file scans if
 join is an equi-join or natural join and
 an index is available on the inner relation’s join attribute
 Can construct an index just to compute a join.
• For each tuple tr in the outer relation r, use the index to look up tuples in s

Slides by: Ms. Shree J.


that satisfy the join condition with tuple tr.
• Worst case: buffer has space for only one page of r, and, for each tuple in r, we
perform an index lookup on s.
• Cost of the join: br (tT + tS) + nr  c
 Where c is the cost of traversing index and fetching all matching s tuples for
one tuple or r

ADMT chp1
 c can be estimated as cost of a single selection on s using the join condition.
• If indices are available on join attributes of both r and s,
use the relation with fewer tuples as the outer relation.
75
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Example of Nested-Loop Join Costs


• Compute student takes, with student as the outer relation.
• Let takes have a primary B+-tree index on the attribute ID, which contains 20 entries
in each index node.
• Since takes has 10,000 tuples, the height of the tree is 4, and one more access is
needed to find the actual data
• student has 5000 tuples

Slides by: Ms. Shree J.


• Cost of block nested loops join
 400*100 + 100 = 40,100 block transfers + 2 * 100 = 200 seeks
 assuming worst case memory
 may be significantly less with more memory
• Cost of indexed nested loops join: br (tT + tS) + nr  c
 100 + 5000 * 5 = 25,100 block transfers and seeks.

ADMT chp1
 c is computed by applying Algorithm A2 cost = (hi + 1) * (tT + tS) = (4+1)*1
 CPU cost likely to be less than that for block nested loops join

76
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Merge-Join( sort-merge-join)
1. Sort both relations on their join attribute (if not already sorted on
the join attributes).
2. Merge the sorted relations to join them
1. Join step is similar to the merge stage of the sort-merge
algorithm.
2. Main difference is handling of duplicate values in join attribute

Slides by: Ms. Shree J.


— every pair with same value on join attribute must be matched
3. Detailed algorithm in book

ADMT chp1
77
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Merge-Join (Cont.) Example


Pr EmpId Name Dept ID Dept ID Dname
1001 Jayanti 01 Ps ts
01 Purchase
1002 Pramod 01 02 Sales
1003 Neha 01 03 Production

Slides by: Ms. Shree J.


1004 Nilesh 02 04 Marketing
1005 Mayur 02
1006 Vinayak 02
1007 Shree 03 S 01 Purchase
1008 Akshata 03
1009 Jaya 03

ADMT chp1
1010 Abhishek 04
1011 Santosh 04

78
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Merge-Join (Cont.) Example


EmpId Name Dept ID Dname
1001 Jayanti 01 Purchase
1002 Pramod 01 Purchase
1003 Neha 01 Purchase

Slides by: Ms. Shree J.


1004 Nilesh 02 Sales
1005 Mayur 02 Sales
1006 Vinayak 02 Sales
1007 Shree 03 Production
1008 Akshata 03 Production
1009 Jaya 03 Production

ADMT chp1
1010 Abhishek 04 Marketing
1011 Santosh 04 Marketing

79
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Merge-Join (Cont.)
• Can be used only for equi-joins and natural joins
• Each block needs to be read only once (assuming all
tuples for any given value of the join attributes fit in

Slides by: Ms. Shree J.


memory)
• Thus the cost of merge join is:
br + bs block transfers + br / bb + bs / bb seeks
 + the cost of sorting if relations are unsorted.

ADMT chp1
80
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Merge-Join (Cont.)
• Example of student takes
 If the relations are already sorted on the join attribute ID, then merge join takes a total of 400 + 100
= 500 block transfers.
 If we assume that in the worst case only one buffer block is allocated to each input relation (that is,
bb = 1), a total of 400 + 100 = 500 seeks

• Suppose the relations are not sorted, and the memory size is the worst case, only three blocks. The
cost is as follows:
1. sorting relation takes requires  log M–1(br / M) =log3−1(400/3) = 8 merge passes.

Slides by: Ms. Shree J.


 Sorting of relation takes = br ( 2  log M–1 (br / M)  + 1)= 400 ∗ (2log3−1(400/3) + 1) = 6800 block
transfers with 400 more transfers to write out the result
 The number of seeks = 2 br / M + br ( 2 logM–1(br / M) - 1)= 2 ∗ 400/3 + 400 ∗ (2 ∗ 8 − 1) = 6268
seeks for sorting, and 400 seeks for writing the output, for a total of 6668 seeks
2. sorting relation student takes log3−1(100/3) = 6 merge passes
 Sorting of relation student = 100 ∗ (2log3−1(100/3) + 1), or 1300, block transfers, with 100 more
transfers to write it out

ADMT chp1
 The number of seeks = 2 ∗ 100/3 + 100 ∗ (2 ∗ 6 − 1) = 1164, and 100 seeks are required for writing
the output, for a total of 1264 seeks.
3. merging the two relations takes 400 + 100 = 500 block transfers and 500 seeks.

• Thus, the total cost is 9100 block transfers plus 8932 seeks if the relations are not sorted, and the
memory size is just 3 blocks.
81
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Example
• Suppose that the relation R (with 150 pages) consists of one attribute a and S (with 90
pages) also consists of one attribute a. Determine the optimal join method for processing the
following query:

• select * from R, S where R.a > S.a

Slides by: Ms. Shree J.


• Assume there are 10 buffer pages available to process the query and there are no indexes
available.

• Assume also that the DBMS only has available the following join methods: nested-loop,
block nested loop and sort-merge.

• Determine the number of page I/Os required by each method to work out which is the
cheapest.

ADMT chp1
82
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Example Solution
• Simple Nested Loops: nr  bs + br
 We use relation S as the outer loop. Total Cost = 90 + (90×150) = 13590
• Block Nested Loops: = br / (M-2)  bs + br block transfers + 2 br / (M-2) seeks
 If R is outer: Total Cost = 150 + (90×ceil(150/(10-2))) = 1860
 If S is outer: Total Cost = 90 + (150×ceil(90/(10-2))) = 1890

Slides by: Ms. Shree J.


• Sort-Merge:
 Denote B as the number of buffer pages, where B = 10; denote M as the number of
pages in the larger relation, where M = 150. Since B < M , the cost on sort-merge is:
 Sorting R: 150×(2×ceil(log10-1(150/10))+1) = 1500
 Sorting S: 90×(2×ceil(log10-1(90/10))+1) = 270
 Merge: 150 + 90 = 240

ADMT chp1
• Total Cost = 1500 + 270 + 240 = 2010
• Therefore, the optimal way to process the query is Block Nested Loop join.
83
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Hash-Join
• Applicable for equi-joins and natural joins.
• A hash function h is used to partition tuples of both relations
• h maps JoinAttrs values to {0, 1, ..., n}, where JoinAttrs denotes the
common attributes of r and s used in the natural join.

Slides by: Ms. Shree J.


 r0, r1, . . ., rn denote partitions of r tuples
 Each tuple tr  r is put in partition ri where i = h(tr [JoinAttrs]).
 s0, s1. . ., sn denotes partitions of s tuples
 Each tuple ts s is put in partition si, where i = h(ts [JoinAttrs]).

Note: In book, ri is denoted as Hri, si is denoted as Hsi and

ADMT chp1

n is denoted as nh.

84
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Probe i/p Build i/p


Hash-Join (Cont.)

Slides by: Ms. Shree J.


ADMT chp1
85
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Hash-Join (Cont.)
•r tuples in ri need only to be compared with s
tuples in si Need not be compared with s tuples
in any other partition, since:

Slides by: Ms. Shree J.


 an r tuple and an s tuple that satisfy the join
condition will have the same value for the join
attributes.
 If that value is hashed to some value i, the r
tuple has to be in ri and the s tuple in si.

ADMT chp1
86
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Hash-Join Algorithm
• The hash-join of r and s is computed as follows.
1.Partition the relation s using hashing function h. When partitioning
a relation, one block of memory is reserved as the output buffer for
each partition.

Slides by: Ms. Shree J.


2.Partition r similarly.
3.For each i:
(a) Load si into memory and build an in-memory hash index on it
using the join attribute. This hash index uses a different hash
function than the earlier one h.
(b) Read the tuples in ri from the disk one by one. For each tuple tr
locate each matching tuple ts in si using the in-memory hash

ADMT chp1
index. Output the concatenation of their attributes.
• Relation s is called the build input and r is called the probe
input.
87
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Hash-Join algorithm (Cont.)


• The value n and the hash function h is chosen such that each si should fit
in memory.
 Typically n is chosen as bs/M * f where f is a “fudge factor”, typically
around 1.2(usually 20% of no. of hashed partitions)

Slides by: Ms. Shree J.


 The probe relation partitions ri need not fit in memory

• Recursive partitioning required if number of partitions n is greater


than number of pages M of memory.
 instead of partitioning n ways, use M – 1 partitions for s
 Further partition the M – 1 partitions using a different hash function
 Use same partitioning method on r

ADMT chp1
 Rarely required

88
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Handling of Overflows
• Partitioning is said to be skewed if some partitions have significantly more tuples than
some others

• Hash-table overflow occurs in partition si if si does not fit in memory. Reasons could
be
 Many tuples in s with same value for join attributes
 Bad hash function

Slides by: Ms. Shree J.


• Overflow resolution can be done in build phase
 Partition si is further partitioned using different hash function.
 Partition ri must be similarly partitioned.

• Overflow avoidance performs partitioning carefully to avoid overflows during build


phase
 E.g. partition build relation into many partitions, then combine them

ADMT chp1
• Both approaches fail with large numbers of duplicates
 Fallback option: use block nested loops join on overflowed partitions

89
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Cost of Hash-Join
• If recursive partitioning is not required: cost of hash join is
3(br + bs) +4  nh block transfers +
2( br / bb + bs / bb) seeks
 nh is overhead for partially filled blocks which can be ignored

Slides by: Ms. Shree J.


• If recursive partitioning required:
 number of passes required for partitioning build relation s to less than
M blocks per partition is logM/bb–1(bs/M)
 best to choose the smaller relation as the build relation.
 Total cost estimate is:
2(br + bs) logM/bb–1(bs/M) + br + bs block transfers +
2(br / bb + bs / bb) logM/bb–1(bs/M)  seeks

ADMT chp1
• If the entire build input can be kept in main memory no partitioning is
required
 Cost estimate goes down to br + bs.

90
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Example of Cost of Hash-Join


• takes student

• Assume that memory size is 20 blocks. Available buffer blocks is 3

• bstudent= 100 and btakes = 400.

Slides by: Ms. Shree J.


• student is to be used as build input. Partition it into five partitions,
each of size 20 blocks. This partitioning can be done in one pass.
• Similarly, partition takes into five partitions,each of size 80. This is
also done in one pass.
• Therefore total cost, ignoring cost of writing partially filled blocks:

ADMT chp1
 3(100 + 400) = 1500 block transfers +
2( 100/3 + 400/3) = 336 seeks

91
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

The Join Operation


• Join algorithms which are applicable to any kind of join:
 Simple Nested Loops Join
 Block Nested Loops Join

Slides by: Ms. Shree J.


• Join algorithms which are applicable to equi and natural
join :
 Index Nested Loops Join
 Sort-Merge Join
 Hash Join

ADMT chp1
92
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Other Operations
• Duplicate elimination can be implemented via hashing or
sorting.
 On sorting duplicates will come adjacent to each other, and
all but one set of duplicates can be deleted.
 Optimization: duplicates can be deleted during run

Slides by: Ms. Shree J.


generation as well as at intermediate merge steps in external
sort-merge.
 Hashing is similar – duplicates will come into the same
bucket.
• Projection:

ADMT chp1
 perform projection on each tuple followed by duplicate
elimination.
 If projection includes a key, no duplicates will exist
93
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Hash Index

Slides by: Ms. Shree J.


ADMT chp1
94
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Other Operations : Aggregation


Example: Compute average salary in each university
department
Select dept_name, avg(salary)

Slides by: Ms. Shree J.


from instructor
group by dept_name;

ADMT chp1
95
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Other Operations : Aggregation


• Aggregation can be implemented in a manner similar to
duplicate elimination.
 Sorting or hashing can be used to bring tuples in the same
group together, and then the aggregate functions can be
applied on each group.

Slides by: Ms. Shree J.


 Optimization: combine tuples in the same group during run
generation and intermediate merges, by computing partial
aggregate values
 For count, min, max, sum: keep aggregate values on tuples
found so far in the group.
 When combining partial aggregate for count, add up the

ADMT chp1
aggregates
 For avg, keep sum and count, and divide sum by count at the
end
96
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Other Operations : Set Operations


• Set operations (,  and ⎯): can either use variant of merge-join after
sorting, or variant of hash-join.
• E.g., Set operations using hashing:
1. Partition both relations using the same hash function

Slides by: Ms. Shree J.


2. Process each partition i as follows.
1. Using a different hashing function, build an in-memory hash index on
ri.
2. Process si as follows
r  s:
1. Add tuples in si to the hash index if they are not already in it.

ADMT chp1
2. At end of si add the tuples in the hash index to the result.

97
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Other Operations : Set Operations


• E.g., Set operations using hashing:
1. as before partition r and s,
2. as before, process each partition i as follows
1. build a hash index on ri
2. Process si as follows

Slides by: Ms. Shree J.


r  s:
1. output tuples in si to the result if they are already
there in the hash index
r – s:
1. for each tuple in si, if it is there in the hash index,

ADMT chp1
delete it from the index.
2. At end of si add remaining tuples in the hash index
to the result.
98
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Other Operations : Outer Join


• Outer join can be computed either as
 A join followed by addition of null-padded non-participating
tuples.
 by modifying the join algorithms.

Slides by: Ms. Shree J.


1. Modifying merge join to compute r s
 In r s, non participating tuples are those in r – R(r s)
 Modify merge-join to compute r s:
 During merging, for every tuple tr from r that do not match
any tuple in s, output tr padded with nulls.

ADMT chp1
 Right outer-join and full outer-join can be computed similarly.

99
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Example
• SELECT FNAME, DNAME
FROM (EMPLOYEE LEFT OUTER JOIN DEPARTMENT
ON DNO = DNUMBER);
• Note: The result of this query is a table of employee names and their associated departments. It is
similar to a regular join result, with the exception that if an employee does not have an associated
department, the employee's name will still appear in the resulting table, although the department name
would be indicated as null.

Slides by: Ms. Shree J.


• Implementation of the left outer join example
 {Compute the JOIN of the EMPLOYEE and DEPARTMENT tables}
 TEMP1FNAME,DNAME(EMPLOYEE DNO=DNUMBER DEPARTMENT)
 {Find the EMPLOYEEs that do not appear in the JOIN}
 TEMP2   FNAME (EMPLOYEE) - FNAME (Temp1)
 {Pad each tuple in TEMP2 with a null DNAME field}
 TEMP2  TEMP2 x 'null'

ADMT chp1
 {UNION the temporary tables to produce the LEFT OUTER JOIN}
 RESULT  TEMP1 υ TEMP2
• The cost of the outer join, as computed above, would include the cost of the
associated steps (i.e., join, projections and union).
100
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Other Operations : Outer Join


2. Modifying join to compute r s
• Modify nested Loop join algorithm to compute left outer
join.

Slides by: Ms. Shree J.


• Tuples in the outer relation that do not match any tuple
in the inner relation are written to the output after being
padded with null values.
• However it is hard to extend the nested-loop join to
compute the full outer join

ADMT chp1
101
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Evaluation of Expressions
• So far: we have seen algorithms for individual operations
• Alternatives for evaluating an entire expression tree
 Materialization: generate results of an expression

Slides by: Ms. Shree J.


whose inputs are relations or are already computed,
materialize (store) it on disk. Repeat.
 Pipelining: pass on tuples to parent operations even
as an operation is being executed
• We study above alternatives in more detail

ADMT chp1
102
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Materialization
• Materialized evaluation: evaluate one operation at a time, starting at
the lowest-level. Use intermediate results materialized into temporary
relations to evaluate next-level operations.
• E.g.,

Slides by: Ms. Shree J.


 building="Watson " (department)
• in figure below, compute and store

then compute the store its join with instructor,

ADMT chp1
and finally compute the projection on name.

103
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Materialization (Cont.)
• Materialized evaluation is always applicable
• Cost of writing results to disk and reading them back can be
quite high
 Our cost formulas for operations ignore cost of writing results

Slides by: Ms. Shree J.


to disk, so
 Overall cost = Sum of costs of individual operations +
cost of writing intermediate results to disk
• Double buffering: use two output buffers for each operation,
when one is full, write it to disk while the other is getting filled
 Allows overlap of disk writes with computation and reduces

ADMT chp1
execution time

104
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Pipelining
• Pipelined evaluation : evaluate several operations simultaneously,
passing the results of one operation on to the next.
• E.g., in previous expression tree, don’t store result of
 building="Watson " (department)

 instead, pass tuples directly to the join.. Similarly, don’t store result of

Slides by: Ms. Shree J.


join, pass tuples directly to projection.
• Much cheaper than materialization: no need to store a temporary relation
to disk.
• Pipelining may not always be possible – e.g., sort, hash-join.
• For pipelining to be effective, use evaluation algorithms that generate

ADMT chp1
output tuples even as tuples are received for inputs to the operation.
• Pipelines can be executed in two ways: demand driven and producer
driven
105
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Pipelining (Cont.)
• In demand driven or lazy evaluation
 system repeatedly requests next tuple from top level operation
 Each operation requests next tuple from children operations as
required, in order to output its next tuple
 In between calls, operation has to maintain “state” so it knows

Slides by: Ms. Shree J.


what to return next
• Alternative name: pull model of pipelining
• More commonly used because its easier to implement

ADMT chp1
106
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Pipelining (Cont.)
• Implementation of demand-driven pipelining
 Each operation is implemented as an iterator implementing the
following operations
 open()
 E.g. file scan: initialize file scan
 state: pointer to beginning of file

Slides by: Ms. Shree J.


 E.g.merge join: sort relations;
 state: pointers to beginning of sorted relations
 next()
 E.g. for file scan: Output next tuple, and advance and store file
pointer
 E.g. for merge join: continue with merge from earlier state till

ADMT chp1
next output tuple is found. Save pointers as iterator state.
 close(): tells iterator that no more tuples are required

107
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Pipelining (Cont.)
• In producer-driven or eager pipelining
 Operators produce tuples eagerly and pass them up to their
parents
 Buffer maintained between operators, child puts tuples in

Slides by: Ms. Shree J.


buffer, parent removes tuples from buffer
 if buffer is full, child waits till there is space in the buffer,
and then generates more tuples
 System schedules operations that have space in output
buffer and can process more input tuples

ADMT chp1
• Alternative name: push model of pipelining
• Useful in parallel processing systems
108
Query Optimization
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

• DBMS Structure

Slides by: Ms. Shree J.


ADMT chp1
110
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Introduction
• Alternative ways of evaluating a given query
 Equivalent expressions
 Different algorithms for each operation

Slides by: Ms. Shree J.


ADMT chp1
111
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Introduction (Cont.)
• An evaluation plan defines exactly what algorithm is used for
each operation, and how the execution of the operations is
coordinated.

Slides by: Ms. Shree J.


ADMT chp1
112
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Introduction (Cont.)
• Cost difference between evaluation plans for a query can
be enormous
 E.g. seconds vs. days in some cases
• Steps in cost-based query optimization
1. Generate logically equivalent expressions using

Slides by: Ms. Shree J.


equivalence rules
2. Annotate resultant expressions to get alternative
query plans
3. Choose the cheapest plan based on estimated cost

ADMT chp1
113
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Introduction (Cont.)
• Estimation of plan cost based on:
 Statistical information about relations.
Examples:

Slides by: Ms. Shree J.


 number of tuples, number of distinct values
for an attribute
 Statistics estimation for intermediate results
 to compute cost of complex expressions
 Cost formulae for algorithms, computed using

ADMT chp1
statistics

114
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Transformation of Relational
Expressions
• Two relational algebra expressions are said to be
equivalent if the two expressions generate the
same set of tuples on every legal database

Slides by: Ms. Shree J.


instance
 Note: order of tuples is irrelevant
 we don’t care if they generate different results
on databases that violate integrity constraints

ADMT chp1
115
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Transformation of Relational
Expressions
• In SQL, inputs and outputs are multisets of
tuples
 Two expressions in the multiset version of the
relational algebra are said to be equivalent if

Slides by: Ms. Shree J.


the two expressions generate the same multiset
of tuples on every legal database instance.
• An equivalence rule says that expressions of
two forms are equivalent
 Can replace expression of first form by second,

ADMT chp1
or vice versa

116
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Equivalence Rules

Slides by: Ms. Shree J.


ADMT chp1
117
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Slides by: Ms. Shree J.


ADMT chp1
118
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Pictorial Depiction of Equivalence


Rules

Slides by: Ms. Shree J.


ADMT chp1
119
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Equivalence Rules (Cont.)

Slides by: Ms. Shree J.


ADMT chp1
120
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Equivalence Rules (Cont.)

Slides by: Ms. Shree J.


ADMT chp1
121
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Equivalence Rules (Cont.)

Slides by: Ms. Shree J.


ADMT chp1
122
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Example
• Consider the following University example with relation
schemas:
• Instructor(ID, name, dept_name, salary)

Slides by: Ms. Shree J.


• Teaches(ID, course_id, semester, year)
• Course(course_id, title, dept_name, credit)

ADMT chp1
123
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Transformation Example: Pushing


Selections
• Query: Find the names of all instructors in the Music
department, along with the titles of the courses that they
teach
 name, title(dept_name= “Music”

Slides by: Ms. Shree J.


(instructor (teaches course_id, title (course))))
• Transformation using rule 7a.

 name, title((dept_name= “Music”(instructor))


(teaches course_id, title (course)))
• Performing the selection as early as possible reduces the

ADMT chp1
size of the relation to be joined.

124
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Transformation Example: Pushing


Selections
• Query: Find the names of all instructors in the Music department
who have taught a course in 2009, along with the titles of the
courses that they taught
 name, title(dept_name= “Music”year = 2009

Slides by: Ms. Shree J.


(instructor (teaches course_id, title (course))))
• Transformation using join associatively (Rule 6a):
 name, title(dept_name= “Music”year = 2009
((instructor teaches) course_id, title (course)))

ADMT chp1
125
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Transformation Example: Pushing


Selections
• Using rule 7a we can rewrite query as,
name, title((dept_name= “Music”year = 2009
((instructor teaches) ) course_id, title (course))

Slides by: Ms. Shree J.


• Second form provides an opportunity to apply the
“perform selections early” rule(Rule 1), resulting in the
subexpression

ADMT chp1
dept_name = “Music” (instructor)  year = 2009 (teaches)

126
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Transformation Example: Pushing


Selections

Slides by: Ms. Shree J.


ADMT chp1
127
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Example
An un-optimized relational algebra expression:

Name (
GPA3.5 and Title = ′Ada Programming Language’ and Students.SSN = Enrollment.SSN and Enrollment.Course_no = Courses.Course_no

(Students Enrollment Courses))

Slides by: Ms. Shree J.


ADMT chp1
128
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Example contd.
• Initial query tree:
 Name
 GPA  3.5 and Title = 'Ada Programming Language’

Slides by: Ms. Shree J.


and Students.SSN = Enrollment.SSN
and Enrollment.Course_no = Courses.Course_no

ADMT chp1
Courses
Students Enrollment 129
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Example contd.
• Perform selections as early as possible.

 Name
 Enrollment.Course_no = Courses.Course_no

Slides by: Ms. Shree J.


Students.SSN = Enrollment.SSN Title = 'Ada Programming
Language’

 GPA  3.5 Enrollment

ADMT chp1
Courses
Students
130
Transformation Example: Pushing
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Projections
• Consider: name, title(dept_name= “Music” (instructor) teaches)
course_id, title (course))))
• When we compute
(dept_name = “Music” (instructor teaches)

Slides by: Ms. Shree J.


we obtain a relation whose schema is:
(ID, name, dept_name, salary, course_id, sec_id, semester, year)
• Push projections using equivalence rules 8a and 8b; eliminate unneeded
attributes from intermediate results to get:
name, title(name, course_id (
dept_name= “Music” (instructor) teaches))

ADMT chp1
course_id, title (course))))
• Performing the projection as early as possible reduces the size of the relation
to be joined.
131
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Join Ordering Example


• For all relations r1, r2, and r3,
(r1 r2) r3 = r1 (r2 r3 )
(Join Associativity Rule 6a)

Slides by: Ms. Shree J.


• If r2 r3 is quite large and r1 r2 is small, we choose

(r1 r2) r3
so that we compute and store a smaller temporary

ADMT chp1
relation.

132
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Join Ordering Example (Cont.)


• Consider the expression
name, title(dept_name= “Music” (instructor) teaches)
course_id, title (course))))

Slides by: Ms. Shree J.


• Could compute teaches course_id, title (course) first, and join result
with
dept_name= “Music” (instructor)
but the result of the first join is likely to be a large relation.
• Only a small fraction of the university’s instructors are likely to be
from the Music department
 it is better to compute

ADMT chp1
dept_name= “Music” (instructor) teaches
first.
133
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Enumeration of Equivalent
Expressions
• Query optimizers use equivalence rules to systematically generate
expressions equivalent to the given expression
• Can generate all equivalent expressions as follows:
 Repeat

Slides by: Ms. Shree J.


 apply all applicable equivalence rules on every subexpression of every
equivalent expression found so far
 add newly generated expressions to the set of equivalent expressions
Until no new equivalent expressions are generated above
• The above approach is very expensive in space and time
 Two approaches

ADMT chp1
 Optimized plan generation based on transformation rules
 Special case approach for queries with only selections, projections and
joins
134
Estimating Statistics of Expression
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Results
• Statistical Information for Cost Estimation using Catalog
information: DBS catalog stores info about DB relations…
• nr: number of tuples in a relation r.
• br: number of blocks containing tuples of r.

Slides by: Ms. Shree J.


• l r: size of a tuple of r.
• f r: blocking factor of r — i.e., the number of tuples of r that fit
into one block.
• V(A, r): number of distinct values that appear in r for attribute A;
which is same as the size of A(r).
If tuples of r are stored together physically in a file, then:

ADMT chp1

nr ùú é
br = ú ê
fr ú ê
ê 135
Histograms
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Statistical Information for Cost Estimation using Catalog


information: Most real world optimizers in DBS store distribution of
values for each attribute as a Histogram
Histogram on attribute age of relation person

50

Slides by: Ms. Shree J.


40

frequency
30

20

10

ADMT chp1
1–5 6–10 11–15 16–20 21–25
Equi-width histograms
value
Equi-depth histograms
136
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Selection Size Estimation


• A=v(r)
 nr / V(A,r) : number of records that will satisfy the selection
 Equality condition on a key attribute: size estimate = 1
• AV(r) (case of A  V(r) is symmetric)
 Let c denote the estimated number of tuples satisfying the condition.

Slides by: Ms. Shree J.


 If min(A,r) and max(A,r) are available in catalog
 c = 0 if v < min(A,r)
v - min( A, r )
c= nr .
max( A, r ) - min( A, r )
 If histograms available, can refine above estimate

ADMT chp1
 In absence of statistical information c is assumed to be nr / 2.

137
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Choice of Evaluation Plans


• Must consider the interaction of evaluation techniques when
choosing evaluation plans
 choosing the cheapest algorithm for each operation
independently may not yield best overall algorithm.
 E.g. merge-join may be costlier than hash-join, but may provide

Slides by: Ms. Shree J.


a sorted output which reduces the cost for an outer level
aggregation.
 nested-loop join may provide opportunity for pipelining
• Practical query optimizers incorporate elements of the following
two broad approaches:

ADMT chp1
1. Search all the plans and choose the best plan in a cost-based
fashion.
2. Uses heuristics to choose a plan.
138
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Cost based join order selection


• Consider finding the best join-order for r1 r2 . . . rn.
• For n = 3, there are 12 different join orderings.
• There are (2( n – 1))!/( n – 1)! different join orders for

Slides by: Ms. Shree J.


above expression. With n = 7, the number is 665280, with
n = 10, the number is greater than 17.6 billion!
• No need to generate all the join orders. Using dynamic
programming, the least -cost join order for any subset of {
r1 , r2 , . . . rn } is computed only once and stored for

ADMT chp1
future use.

139
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Heuristic Optimization
• Cost-based optimization is expensive
• Systems may use heuristics to reduce the number of choices that must
be made in a cost-based fashion.

Slides by: Ms. Shree J.


• Heuristic optimization transforms the query-tree by using a set of rules
that typically (but not in all cases) improve execution performance:
 Perform selection early (reduces the number of tuples)
 Perform projection early (reduces the number of attributes)
 Perform most restrictive selection and join operations (i.e. with
smallest result size) before other similar operations.

ADMT chp1
 Some systems use only heuristics, others combine heuristics with
partial cost-based optimization.

140
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Heuristics based optimisation


• In left-deep join trees, the right-hand-side input for each join is a relation, not
the result of an intermediate join.
• Many optimizers considers only left-deep join orders. Eg. System R optimizer
 Plus heuristics to push selections and projections down the query tree
 Reduces optimization complexity and generates plans amenable to pipelined
evaluation.

Slides by: Ms. Shree J.


ADMT chp1
141
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Questions from MU papers


• Explain sort-merge join and hash join (Dec 2019)……10M….Ans:Chp12,pg553,pg557

• Explain brute force nested loop join algorithm (Dec 2018, 2020)……10M…Ans: Chp12,
pg 550

• Write short note on:


 Query optimization (Dec 2018)……5M… Ans:Chp12, pg 539,Chp13, pg 580

Slides by: Ms. Shree J.


 Query evaluation plan (May 2019)….5M…Ans:Chp12, pg 539
 Measures of query cost (May 2019)….5M…Ans: Chp12, pg 540

• Why BCNF is called as stricter than 3NF? Justify your answer. (Dec
2019)……5M….out of syllabus…Ans: Chp8, pg 333- pg336, watch video
https://www.youtube.com/watch?time_continue=18&v=NNjUhvvwOrk&feature=emb_l
ogo

What is materialized view? What is its utility? (Dec 2019)……5M…..out of

ADMT chp1

syllabus…Ans: Chp13, pg 607, watch video
https://www.youtube.com/watch?v=06HlvmB8mDk
Note: Chapter number and page numbers are from the book, Korth, Slberchatz,Sudarshan, :”Database
System Concepts”, 6th Edition, McGraw – Hill 142
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.

Questions from MU papers


• MCQ from Dec 2020
• Which algorithm uses equality comparison on a key attribute with a primary
index to retrieve a single record that satisfies the corresponding equality
condition.
1. Primary index equality on non key attribute
2. Primary index equality on key attribute

Slides by: Ms. Shree J.


3. secondary index equality on non key attribute
4. secondary index equality on key attribute
• One of the main heuristic rule in query optimization is
1. Apply SELECT operation at the earliest.
2. Apply PROJECT operation at the earliest.
3. Apply SELECT and PROJECT operation at the earliest.

ADMT chp1
4. Apply SELECT and PROJECT operations before applying the
JOIN operation at the earliest

143

You might also like