Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 41

Distributed Query Optimization

Chapter 9
Query Processing and Optimization

Query processing is the process of translating
a query expressed in a high-level language
such as SQL into low-level data manipulation
operations.
Query Optimization refers to the process by
which the best execution strategy for a given
query is found from a set of alternatives.
Query Optimization
The input to the third step is an algebraic query on
fragments.
By permuting the ordering of operations within one
fragment query, many equivalent query execution
plans may be found.
indices should be used.
order the operations of a query (e. g. joins, selects,
and projects).
The goal of query optimization is to find an execution
strategy for the query that is close optimal query.
Distributed Query Optimization
Components of the distributed query optimizer, i. e.

Search Space.
Search strategy. The search strategy explores the search space
and selects the best plan.
Cost Strategy.

Distributed Query Optimization Issues

Linear query trees are not necessarily a good choice
Bushy query trees are not necessarily a bad choice
What and where to ship the relations
How to ship relations (ship as a whole, ship as
needed)
When to use semi-joins instead of joins

Search Space.
The set of alternative query execution plans (QEP)
Typically very large.
The main issue is to optimize the joins.
Equivalent query trees (join trees) of the joins in the
following query
SELECT ENAME,RESP FROM EMP, ASG, PROJ
WHERE EMP.ENO=ASG.ENO AND ASG.PNO=PROJ.PNO


Basic Concepts
Reduction of the search space
-Restrict by means of heuristics
Perform unary operations before binary operations,
Restrict the shape of the join tree
Consider the type of trees (linear trees, vs. bushy
ones)

Search Space
There are two main strategies to scan the search
space
Deterministic
Randomized

Deterministic scan of the search space
Start from base relations and build plans by adding
one relation at each step.
Breadth-first strategy: build all possible plans
before choosing the best plan.
(dynamic programming approach)
Depth-first strategy: build only one plan
(greedy approach)

Randomized scan of the search space
Search for optimal solutions around a particular
starting point e.g., iterative improvement
Trades optimization time for execution time
Does not guarantee that the best solution is obtained,
but avoid the high cost of optimization
The strategy is better when more than 5-6 relations are
involved.

Distributed Cost Model

Two different types of cost functions can be
used
Reduce total time
Reduce response time

Distributed Cost Model . . .

Total time: Sum of the time of all individual
components.
Local processing time: CPU time + I/O time
Communication time: fixed time to initiate a message
+ time to transmit the data

The individual components of the total cost have
different weights:

Wide area network
Local area networks

Distributed Cost Model . . .

Distributed Cost Model . . .
Response time: Elapsed time between the initiation
and the completion of a query

Assume that only the communication cost is considered
Total time = 2 message initialization time + unit transmission time
(x+y)
Response time = max {time to send x from 1 to 3, time to send y
from 2 to 3}
time to send x from 1 to 3 = message initialization time + unit
transmission time x
time to send y from 2 to 3 = message initialization time + unit
transmission time y
Example
Site 1
Site 2
x units
y units
Site 3
Database Statistics

The primary cost factor is the size of intermediate
relations
must be transmitted over the network, if a subsequent
operation is located on a different site.
costly to compute the size of the intermediate relations
precisely.
Instead global statistics of relations and fragments are
computed.
Let R(A1,A2, . . . ,Ak) be a relation fragmented into
R1,R2, . . . ,Rr.
Relation statistics
min and max values of each attribute: min{Ai}, max{Ai}.
length of each attribute: length(Ai)
number of distinct values in each fragment (cardinality):
card(Ai),(card(dom(Ai)))
Fragment statistics
cardinality of the fragment: card(Ri)
cardinality of each attribute of each fragment:
card(Ai(Rj))

Database Statistics
Query Optimization Process
Search Space
Generation
Search
Strategy
Equivalent QEP
Input Query
Transformation
Rules
Cost Model
Best QEP
INGRES
dynamic
interpretive
System R
static
exhaustive search
Centralized Query Optimization
Decompose each multi-variable query into a
sequence of mono-variable queries with a
common variable
Process each by a one variable query processor
Choose an initial execution plan (heuristics)
Order the rest by considering intermediate relation
sizes

No statistical information is maintained
INGRES Algorithm
22
INGRES Language: QUEL
QUEL Language - a tuple calculus language
Example:
range of e is EMP
range of g is ASG
range of j is PROJ
retrieve e.ENAME
where e.ENO=g.ENO and j.PNO=g.PNO
and j.PNAME=CAD/CAM

Note: e, g, and j are called variables
Replace an n variable query q by a series of
queries
q
1
q
2
q
n

where q
i
uses the result of q
i-1
.
Detachment
Query q decomposed into q' q" where q' and q"
have a common variable which is the result of q'
Tuple substitution
Replace the value of each tuple with actual values and
simplify the query
q(V
1
, V
2
, ... V
n
) (q' (t
1
, V
2
, V
2
, ... , V
n
), t
1
R)
INGRES AlgorithmDecomposition
q: SELECT V
2
.A
2
,V
3
.A
3
, ,V
n
.A
n

FROM R
1
V
1
, ,R
n
V
n

WHERE P
1
(V
1
.A
1

) AND P
2
(V
1
.A
1
,V
2
.A
2
,, V
n
.A
n
)

q': SELECT V
1
.A
1
INTO R
1
'
FROM R
1
V
1

WHERE P
1
(V
1
.A
1
)

q": SELECT V
2
.A
2
, , V
n
.A
n

FROM R
1
' V
1
, R
2
V
2
, , R
n
V
n

WHERE P
2
(V
1
.A
1
, V
2
.A
2
, , V
n
.A
n
)
Detachment
Names of employees working on CAD/CAM project
Q
1
: SELECT EMP.ENAME
FROM EMP, ASG, PROJ
WHERE EMP.ENO=ASG.ENO
AND ASG.PNO=PROJ.PNO
AND PROJ.PNAME="CAD/CAM"

q
11
: SELECT PROJ.PNO INTO JVAR
FROM PROJ
WHERE PROJ.PNAME="CAD/CAM"

q': SELECT EMP.ENAME
FROM EMP,ASG,JVAR
WHERE EMP.ENO=ASG.ENO
AND ASG.PNO=JVAR.PNO
Detachment Example
q': SELECT EMP.ENAME
FROM EMP,ASG,JVAR
WHERE EMP.ENO=ASG.ENO
AND ASG.PNO=JVAR.PNO

q
12
: SELECT ASG.ENO INTO GVAR
FROM ASG,JVAR
WHERE ASG.PNO=JVAR.PNO

q
13
: SELECT EMP.ENAME
FROM EMP,GVAR
WHERE EMP.ENO=GVAR.ENO
Detachment Example (contd)
q
11
is a mono-variable query
q
12
and q
13
is subject to tuple substitution
Assume GVAR has two tuples only: <E1> and
<E2>
Then q
13
becomes
q
131
: SELECT EMP.ENAME
FROM EMP
WHERE EMP.ENO="E1"

q
132
: SELECT EMP.ENAME
FROM EMP
WHERE EMP.ENO="E2"
Tuple Substitution
Same as the centralized version except
Movement of relations (and fragments) need
to be considered
Optimization with respect to communication
cost or response time possible
Distributed INGRES Algorithm
Ordering joins
Distributed INGRES
System R*
Join Ordering in Fragment Queries
Consider two relations only
Multiple relations more difficult because too many
alternatives.
Compute the cost of all alternatives and select the best
one.
Necessary to compute the size of intermediate relations
which is difficult.
Use heuristics
Join Ordering
R
if size (R) < size (S)
if size (R) > size (S)
S
Consider
PROJ
PNO
ASG
ENO
EMP
Join Ordering Example
Site 2
Site 3 Site 1
PNO
ENO
PROJ
ASG
EMP

Simple (i.e., mono-relation) queries are
executed according to the best access path
Execute joins
2.1 Determine the possible ordering of joins
2.2 Determine the cost of each ordering
2.3 Choose the join ordering with minimal cost
System R Algorithm
For joins, two alternative algorithms :
Nested loops
for each tuple of external relation (cardinality n
1
)
for each tuple of internal relation (cardinality n
2
)
join two tuples if the join predicate is true
end
end
Complexity: n
1
n
2

Merge join
sort relations
merge relations
Complexity: n
1
+ n
2
if relations are previously sorted and equijoin
System R Algorithm
Names of employees working on the CAD/CAM project
Assume
EMP has an index on ENO,
ASG has an index on PNO,
PROJ has an index on PNO and an index on PNAME
System R Algorithm Example
PNO
ENO
PROJ
ASG
EMP
Choose the best access paths to each
relation
EMP:sequential scan (no selection on EMP)
ASG: sequential scan (no selection on ASG)
PROJ:index on PNAME (there is a selection on
PROJ based on PNAME)
Determine the best join ordering
EMP ASG PROJ
ASG PROJ EMP
PROJ ASG EMP
ASG EMP PROJ
EMP PROJ ASG
PROJ EMP ASG
Select the best ordering based on the join
costs evaluated according to the two methods
System R Example (contd)
Best total join order is one of
((ASG EMP) PROJ)
((PROJ ASG) EMP)
System R Algorithm
EMP ASG
pruned
ASG
EMP
PROJ
(PROJ ASG) EMP
EMP PROJ
pruned
ASG EMP PROJ EMP
pruned
PROJ ASG
(ASG EMP) PROJ
ASG PROJ
pruned
Alternatives
((PROJ ASG) EMP) has a useful index on
the select attribute and direct access to the
join attributes of ASG and EMP
Therefore, chose it with the following access
methods:
select PROJ using index on PNAME
then join with ASG using index on PNO
then join with EMP using index on ENO
System R Algorithm
Distributed Query Optimization Problems
Cost model
multiple query optimization.
heuristics to cut down on alternatives.
Larger set of queries
optimization only on select-project-join queries.
also need to handle complex queries (e.g., unions,
disjunctions, aggregations and sorting).
Optimization cost vs execution cost tradeoff
heuristics to cut down on alternatives.
controllable search strategies.
Optimization/re optimization interval
extent of changes in database profile before re optimization is
necessary.
Summary
Distributed query optimization is more complex that
centralized query processing, since
bushy query trees are not necessarily a bad choice
one needs to decide what, where, and how to ship the
relations between the sites
Query optimization searches the optimal query plan (tree)
For N relations, there are O(N!) equivalent join trees. There
are two main strategies in query optimization: randomized
and deterministic.
(Few) semi-joins can be used to implement a join. The
semi-joins require more operations to perform, however
the data transfer rate is reduced
INGRES, System R, Hill Climbing, and SDD-1 are distributed
query optimization algorithms

You might also like