Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 58

Chapter 1: Query Processing and Query Optimization

Objectives
At the end of this chapter you will be able to:
 Define query processing and query optimization.

 Know how a query is decomposed and semantically analyzed.

 Know how to create a relational algebra tree to represent a query.

 Transformation rules ,heuristic and cost estimation rules to improve the efficiency of a query.

 Evaluate the cost and size of the relational algebra operations

 know the approaches for finding the optimal execution strategy.

12/23/2023 2
1.1. Overview of Query Processing

 What is Query processing?

The process of activities involved in parsing, validating, optimizing, and executing a query.

 Aims

 To transform a query written in a high-level language, typically SQL, into a correct and efficient execution strategy
expressed in a low-level language (implementing the relational algebra), and

 To execute the strategy to retrieve the required data.

Example: select* from customer where custid = 101;

select* from customer where custid > 101 and custid < 300;

select* from customer where LastName < ‘D’;

like C, B, A to display the last name

Figure 1.1 Typical steps when processing a high-level query

12/23/2023 3
Query Processing steps

Three Steps of Query Processing


1.Parsing ,scanning, Validating and translation
2. Optimization
Query
3. Evaluation

Parser &
Translator

Relational Algebra
Expression tree Statistics
About Data

Optimizer
Evaluation Engine
Query code
Output Execution Plan

Data
12/23/2023 4
Three Steps of Query Processing
1) The Parsing and translation

It will first translate the query into its internal form, then translate the query into relational algebra and
verifies relations.

The parser and translator are to check syntax like select* from customer having salary >1000; the others
check schema elements like attributes, relations etc. and also coverts the SQL to RA expression.

2) Optimization

It is to find the most efficient process for executing/ evaluation plan for a query because
there can be more than one way.

3) Evaluation:
It is what the query-execution engine takes a query-evaluation plan to executes that plan and returns the answers to the query.
12/23/2023 5
Phases of query processing

12/23/2023 6
1.2. Translating SQL Queries into Relational Algebra and Other
Operators
• SQL
• Query language used in most RDBMSs
• Query decomposed into query blocks
• Basic units that can be translated into the algebraic operators
• Contains single SELECT-FROM-WHERE expression
• May contain GROUP BY and HAVING clauses

Slide 16- 7
Translating SQL Queries (cont’d...)
• Example:

• Inner block

• Outer block

8
Translating SQL Queries (cont’d.)
• Example (cont’d.)
• Inner block translated into:

• Outer block translated into:

• Query optimizer chooses execution plan for each query block?

Slide 16- 9
Phases of query processing

Query Processing has four main phases.


1. Decomposition.
a. Analysis.
b. Normalization.
c. Semantic Analysis.
d. Simplification.
e. Restructuring.
2. Optimization.
a. Heuristics.
b. Comparing costs estimation
3. Code Generation.
4. Execution.

12/23/2023 10
1.3. Query Decomposition

Query Decomposition
 Aim
• transform high-level query into RA query.
• check that query is syntactically and semantically correct.
Typical stages are:
a.Analysis,
b.Normalization,
c.Semantic analysis,
d.Simplification,
e.Query restructuring.

12/23/2023 11
1.a. Analysis
Analyze query lexically and syntactically using compiler techniques.
Verify relations and attributes exist.
Verify operations are appropriate for object type .
• Example
SELECT staff_no FROM Staff WHERE position > 10;
• This query error would be rejected on two grounds:
• staff_no is not defined for Staff relation (should be staffNo).
• Comparison ‘>10’ is incompatible with type position, which is variable character
string.

12/23/2023 12
1.a. Analysis(cont’d...)
 Finally, query transformed into a query tree constructed as follows:
Leaf node for each base relation.
Non-leaf node for each intermediate relation produced by RA operation.
Root of tree represents query result.
• Sequence is directed from leaves to root.

12/23/2023 13
1.b. Normalization
Converts query into a normalized form for easier manipulation.
Predicate can be converted into one of two forms:
Conjunctive normal form:
(position = 'Manager'  salary > 20000)  (branchNo = 'B003')
Disjunctive normal form:
(position = 'Manager'  branchNo = 'B003' )  (salary > 20000  branchNo = 'B003')

12/23/2023 14
1.c. Semantic Analysis
Rejects normalized queries that are incorrectly formulated or contradictory.

Query is incorrectly formulated if components do not contribute to generation of result.

Query is contradictory if its predicate cannot be satisfied by any tuple.

Algorithms to determine correctness exist only for queries that do not contain disjunction and
negation.
For these queries (no disjunction and no negation) could construct two graphs:
1. A relation connection graph.
2. Normalized attribute connection graph.

1.Relation connection graph


a. Create node for each relation and node for result.
b. Create edges between two nodes that represent a join.
c. Create edges between nodes that represent projection.
15
• If12/23/2023
not connected, query is incorrectly formulated.
Example 1.2
Checking Semantic Correctness

SELECT p.propertyNo, p.street FROM Client c, Viewing v, PropertyForRent p


WHERE c.clientNo = v.clientNo AND c.maxRent >= 500 AND c.prefType = ‘Flat’
AND p.ownerNo = ‘CO93’;

Relation Connection graph


 Relation connection graph not
fully connected, so query is not
correctly formulated.
 Have omitted the join condition
(v.propertyNo = p.propertyNo) .
12/23/2023 16
1.d. Simplification
 Aims:
1. Detects redundant qualifications,
2. eliminates common sub-expressions,
3. transforms query to semantically equivalent but more easily and efficiently
computed form.

12/23/2023 17
1.4 Optimization Process

Query optimization:
The activity of choosing an efficient execution strategy for processing a query.
 Aim
To choose the one that minimizes resource usage.

Generally, we try to reduce the total execution time of the query, which is the sum of
the execution times of all individual operations that make up the query.
Disk access tends to be dominant cost in query processing for centralized DBMS.

It recognizes an expression tree for a query using RA transformation rules

12/23/2023 18
Query Optimization(QO)

 Generally

 DBMS has algorithms to implement relational algebra expressions

 SQL is a different kind of high level language; specify what is wanted, not how it is obtained

 Optimization – not necessarily “optimal”, but reasonably efficient query.

 Conducted by a query optimizer in a DBMS

 Goal: select best available strategy for executing query

Approaches or Techniques of Query Optimization

a. Heuristic rules

b. Cost estimation(Comparing costs of different plans)


12/23/2023 19
1.5 Approaches/ Techniques of Query Optimization
a. Heuristically rules Processing Strategies
► Perform Selection operations as early as possible.
 Keep predicates on same relation together.

► Combine Cartesian product with subsequent Selection whose predicate represents join condition into a Join operation.

► Use associativity of binary operations to rearrange leaf nodes so leaf nodes with most restrictive Selection operations
executed first.

► Perform Projection as early as possible.


► Keep projection attributes on same relation together.

► Compute common expressions once.


► If common expression appears more than once, and result not too large, store result and reuse it when
required.
► Useful when querying views, as same expression is used to construct view each time.
12/23/2023 20
Example Query operation of Heuristics
•At this point we’ve generated the many different query plans for the same
query.
•See the query example below the heuristic different operations
•Adding strategies/algorithms for implementing individual operations would
add even more potential plans

12/23/2023 22
Example Query
• select e.lname, e.fname, w.pno, w.hours
from employee e, works_on w
where e.ssn = w.essn and w.hours > 20;
Find the relational algebra query tree in the above SQl
query?
πe.lname, e.fname, w.pno, w.hours

σe.ssn=w.essn AND w.hours > 20

Employee e works_on w
12/23/2023 23
Cont...
• Find Heuristic of Conjunctive select cascade of selection?

πe.lname, e.fname, w.pno, w.hours

Heuristic: σw.hours > 20


Conjunctive Note: selects
select  cascade could commute
σe.ssn=w.essn
of selects

Employee e works_on w
12/23/2023 24
Cont...
• Find the heuristic Combine select and cross  join ?

πe.lname, e.fname, w.pno, w.hours

Heuristic: σw.hours > 20


Combine
select and
cross  join
⋈e.ssn=w.essn

Employee e works_on w
12/23/2023 25
Cont...
Find the heuristic Push projects through join?
πe.lname, e.fname, w.pno, w.hours

πe.ssn, e.lname, e.fname, w.essn, w.pno, w.hours


First cascade projects to
Heuristic: get attributes needed to
Push projects as ⋈e.ssn=w.essn push through join
early as possible

Employee e σw.hours > 20


works_on w

12/23/2023 26
Using Heuristics in Query Optimization
1.The main heuristic is to apply first the operations that reduce the
size of intermediate results.
2.Perform select operations as early as possible to reduce the
number of tuples and perform project operations as early as
possible to reduce the number of attributes. (This is done by
moving select and project operations as far down the tree as
possible.)
3.The select and join operations that are most restrictive should be
executed before other similar operations. (This is done by
reordering the leaf nodes of the tree among themselves and
adjusting the rest of the tree appropriately.)

12/23/2023 29
2.b. Cost Estimation for Relational Algebra Operations
Many different ways of implementing RA operations.

Aim of Query Optimization in Cost Estimation is


► To choose most efficient one query.
► Use formulae that estimate costs for a number of options, and select one with lowest
cost.
► Consider only cost of disk access, which is usually dominant cost in Query Processing.
► Many estimates are based on cardinality of the relation, so need to be able to estimate
this.

12/23/2023 31
An Example
 Query:
Select B, D From R,S
Where R.A = “c” and S.E = 2 and R.C=S.C; Find the answer B,D ?

R S
A B C C D E
a 1 10 15 x 2 Answer
b 1 20
c 2 25
25 y 2 B D
32 y 3 2 y
d 2 10
e 3 26 10 z 1

12/23/2023 32
An Example (cont.)
Plan 1
• Cross product of R & S
• Select tuples using WHERE conditions
• Project on B & D
Algebra expression query tree

B,D

B,D(R.A=‘c’ S.E=2 R.C=S.C (R S)) R.A=‘c’ S.E=2 R.C=S.C

R S
12/23/2023 33
An Example (cont.)
Plan 2
• Select R tuples with R.A=“c”
• Select S tuples with S.E=2
• Natural join
• Project B & D
Algebra expression query tree

B,D

B,D( R.A=“c” (R) S.E=2


(S)) R.A=‘c’ S.E=2

R S

12/23/2023 34
Measures of Query Cost
 Disk Access
 Cpu cycle
• Transit Time in network- distributed system
• CPU cost is difficult to calculate and CPU cost is insignificant compared with disk access cost.
• Consider only disk access
Disk Access Cost:
• No. of seeks (N) –Random I/O cost < Sequential I/O cost
• Cost =N*avg seek time
• No. of blocks read
• Cost =N*avg block read cost
• No. of blocks written
• Cost = N*avg block write cost
• Cost of writing >> cost of reading. because once written it has to be read to check if it is written
correctly.

12/23/2023 35
Query Evaluation Process
Internal
Query Scanne Parse
representatio
r r
n

Executio
DBMS n
Strategie
Answer Data s
Optimize
r
Runtime
Code Execution
Database
Generato plan
Processor
r
12/23/2023 36
Query Evaluation

 How to evaluate individual relational operation?


• Selection: find a subset of rows in a table

• Join: connecting tuples from two tables

• Other operations: union, projection, …

12/23/2023 37
Cost of Operations
 Cost = I/O cost + CPU cost
• I/O cost: # pages (reads & writes) or # operations (multiple pages)
• CPU cost: # comparisons or # tuples processed
• I/O cost dominates (for large databases)
 Cost depends on
• Types of query conditions
• Availability of fast access paths
 DBMSs keep statistics for cost estimation.

12/23/2023 38
Notations
 Used to describe the cost of operations.
 Relations: R, S

 nR: # tuples in R,

 nS: # tuples in S

 bR: # pages in R

 dist(R.A) : # distinct values in R.A

 min(R.A) : smallest value in R.A

 max(R.A) : largest value in R.A

 HI: # index pages accessed (B+ tree height)


12/23/2023 39
Simple Selection
 Simple selection: A op a(R)
 A is a single attribute, a is a constant, op is one of =, , <, , >, .
• Do not further discuss  because it requires a sequential scan of table.
• Eg. Select * from R where A=a;
How many tuples will be selected?
• Selectivity Factor (SFA op a(R)) : Fraction of tuples of R satisfying “A op a”
• 0  SFA op a(R)  1.
• # tuples selected: NS = nR  SFA op a(R)

12/23/2023 40
Options of Simple Selection
Sequential (linear) Scan
• General condition: cost = bR
• Equality on key: average cost = bR / 2
Binary Search
• Records are stored in sorted order
• Equality on key: cost = log2(bR)
• Equality on non-key (duplicates allowed)
cost = log2(bR) + NS/bfR - 1
= sorted search time + selected – first one

12/23/2023 41
Example: Cost of Selection

Relation: R(A, B, C)
nR = 10,000 tuples
bfR = 20 tuples/page
dist(A) = 50, dist(B) = 500
B+ tree clustering index on A with order 25 (p=25)
B+ tree secondary index on B with order 25
Query:
select * from R where A = a and B = b1
Find the relational algebra?
Relational Algebra expression : A=a  B=b1 (R)
12/23/2023 43
Example: Cost of Selection (cont’d...)
• Option 1: Sequential(linear search) Scan
• Have to go thru the entire relation
• Cost bR= nR/bfR = 10,000/20 = 500
• Option 2: Binary Search using A = a
• It is sorted on A
• NS = nR/dist(A)=10,000/50 = 200
• assuming equal distribution
• Cost = log2(bR) + NS/bfR - 1
= log2(500) + 200/20 - 1 = 18

12/23/2023 44
Example: Cost of Selection (cont.)
• Option 3: Use index on R.A:
• The secondary index average order of B+ tree
= (P + 0.5P)/2 =18.75~ 19
• Leaf nodes have 18 entries, internal nodes have 19 pointers
• # leaf nodes = 50/18 = 3
• # nodes next level = 1
• HI = 2
• Clustering index: Cost = HI + NS/bfR
= 2 + 200/20 = 12

12/23/2023 45
Example: Cost of Selection (cont.)
• Option 4: Use index on R.B
• Average order = 19
• NS =10000/500 = 20
• Use Option I (allow duplicate keys)
• # nodes 1st level = 10000/18 = 556 (leaf)
• # nodes 2nd level = 556/19 = 29 (internal)
• # nodes 3rd level = 29/19 = 2 (internal)
• # nodes 4th level = 1
• HI = 4
• Secondary index Cost = HI + NS
= 4+20 =24

12/23/2023 46
Estimate Size of Join Result
• How many tuples in join result?
• Cross product (special case of join)
NJ = nR  nS
• R.A is a foreign key referencing S.B
NJ = nR (assume no null value)
• S.B is a foreign key referencing R.A
NJ = nS (assume no null value)
• Both R.A & S.B are non-key

nR nS nR nS
NJ  min ( , )
dist(R.A) dist(S.B)

12/23/2023 47
Basic Algorithms for Executing Query Operations
Consider only single table queries
•Three categories:
1. simple SELECT: one condition, no AND or OR
2. conjunctive SELECT : multiple conditions, connected by AND
3. disjunctive SELECT : multiple conditions, connected by OR

12/23/2023 49
1. Simple SELECT or methods for simple selection
1.1. Linear search (brute force algorithm):
• algorithm:
• Retrieve every record in the file
• test whether its attribute values satisfy the selection condition
• works when:
• always works
• best on small files
• only choice when no indexes or ordering
• cost
• average case: b/2, where b = # blocks in file
• worst case: b

1.2. Binary search:


• algorithm:
• use binary search to find record(s)
• works when:
• selection condition is an equality test on an ordering attribute
• cost:

where b
12/23/2023 = # blocks in file, s = # selected records 50
Simple SELECT(cont’d...)
1.3. Primary index to retrieve a single record:
• algorithm:
• look up record using primary index
• works when:
• selection condition is equality test on key attribute with primary index
• cost:
x + 1 , where x = # index levels or height of the tree
1.4.Clustering index to retrieve multiple records:
• algorithm:
• use clustering(secondary) index to retrieve all the records satisfying the selection condition
• works when:
• selection condition is equality comparison on a non-key attribute with clustering index
• cost:

where x = # index levels, s = # selected records


• Blocks are no in consecutive locations because the data is not sorted with the non-key
attribute.
12/23/2023 51
2.Conjunctive SELECT or methods for conjunctive (logical AND) selection
2.1. Conjunctive selection Using an individual index
• algorithm:
• use one of the simple SELECT algorithms to find records matching one condition
• check those records for remaining conditions
• works when:
• conjunctive selection in which one of the simple SELECT algorithms can be applied to one condition
• cost:
• same as simple SELECT cost
• example:
• select * from EMPLOYEE
where IDNO=6 AND salary > 7000 , and an index exists on< IDNO>
2.2. Conjunctive selection using a composite index
• algorithm:
• use composite index directly
• works when:
• selection condition is equality tests on two or more attributes for which a composite index exists
• cost:
• x+1
• example:
• select
12/23/2023 * from EMPLOYEE 52
where LNAME=‘Jones’ and FNAME=‘Sami’, and an index exists on <LNAME, FNAME>
Conjunctive SELECT(cont’d...)
2.3. Conjunctive selection by intersection of record pointers:
• algorithm:
• use indexes to find record pointers for each equality condition
• compute intersection of record pointer sets
• retrieve records using records pointers
• works when:
• secondary indexes are available on all (or some of) the fields involved in equality
comparison conditions
• indexes include record pointers (rather than block pointers)
• cost:
• sum of index operations plus s, where s = #records selected
example:
Select * from WORKS_ON
where PNO=10 and LNAME=‘Wolfe’;
and indexes exist for both PNO and LNAME

12/23/2023 53
3.Disjunctive SELECT ( logical OR) selection
•Disjunctive selects are much harder to optimize
• no single condition can be used to ‘pre-filter’ the results
• result is union of each condition
• best you can do is to try to optimize each individual query, then compute the union
example:
select * from EMPLOYEE
where DNO=3 or SALARY > 80,000 or SEX=‘F’;

12/23/2023 54
Join Algorithms
•We’ll consider joins such as R ⋈A=B S.
•Extends to joins like R ⋈A=B and C=D S
by considering <A,C> and <B,D> as single attributes.

12/23/2023 55
Semantic Query Optimization
Semantic Query Optimization:
Uses constraints specified on the database schema in order to modify one query into another query that is
more efficient to execute.
Consider the following SQL query,
select e.lname, m.lname from employee e, m
where e.superssn = m.ssn and e.salary > m.salary
Explanation:
Suppose that we had a constraint on the database schema that stated that no employee can earn more than
his or her direct supervisor. If the semantic query optimizer checks for the existence of this constraint, it
need not execute the query at all because it knows that the result of the query will be empty. Techniques
known as theorem proving can be used for this purpose.

12/23/2023 56
1.6 Transformation Rules

Transformation Rules for RA Operations


Apply well-known transformation rules of Boolean algebra.
1.Conjunctive Selection operations- can cascade into individual Selection operations (and
vice versa).
p  q  r(R) = p(q(r(R)))
Sometimes referred to as cascades of Selection.
branchNo='B003'  salary>15000(Staff) = branchNo='B003'(salary>15000(Staff)), this selects can be
commuted.
2. Commutativity of Selection:
p(q(R)) = q(p(R))
• For example:
branchNo='B003'(salary>15000(Staff)) =salary>15000(branchNo='B003'(Staff))
12/23/2023 58
Transformation Rules for RA Operations(cont’d...)
3. Commutativity of Theta join (and Cartesian product).

 Rulealso applies to Equijoin and Natural join.


For example:

4.Commutativity of Projection and Union.


L(R  S) = L(S)  L(R)
5. Associativity of Union and Intersection (but not Set difference).

12/23/2023 59
Transformation Rules for RA Operations(cont’d...)

6. Associativity of Theta join (and Cartesian product).


Cartesian product and Natural join are always associative:

If join condition q involves attributes only from S and T, then Theta
join is associative:

12/23/2023 60
Example 1.3
Use of Transformation Rules

For prospective renters of flats, find properties that match requirements and owned
by CO93.
SELECT p.propertyNo,p.street FROM Client c, Viewing v, PropertyForRent p
WHERE c.prefType = ‘Flat’ AND c.clientNo = v.clientNo AND
v.propertyNo = p.propertyNo AND c.maxRent >= p.rent AND
c.prefType = p.type AND p.ownerNo = ‘CO93’;
Find the RA tree in the above SQL query statements?

12/23/2023 61
Con’d…

12/23/2023 62
Algorithms for PROJECT and Set Operations
• Set operations
• UNION
• INTERSECTION
• SET DIFFERENCE
• CARTESIAN PRODUCT
• Set operations sometimes expensive to implement
• Sort-merge technique
• Hashing
• Use of anti-join for SET DIFFERENCE
• EXCEPT or MINUS in SQL
• Example: Find which departments have no employees becomes

65
Implementing Aggregate Operations and Different Types of JOINs
• Aggregate operators
• MIN, MAX, COUNT, AVERAGE, SUM
• Can be computed by a table scan or using an appropriate index
• Example:

• If an (ascending) B+ -tree index on Salary exists:


• Optimizer can use the Salary index to search for the largest Salary value
• Follow the rightmost pointer in each index node from the root to the rightmost leaf.
• AVERAGE or SUM
• Index can be used if it is a dense index
• Computation applied to the values in the index
• Nondense index can be used if actual number of records associated with each index value
is stored in each index entry
• COUNT
66
• Number of values can be computed from the index
Ch1-Assignment
1. Discuses the difference between query processing and query optimization within examples ?
2. Discuss the reasons for converting SQL queries into relational algebra queries before optimization is to
work. Give examples
3. Discuss the cost components for query execution estimate?
4. Find the heuristic optimizations of the following question based on the SQL query.
select fname from Emp, works_on, Project
where Pname=“abay_dam” AND Pnumber=Pno AND Essn=Ssn AND Bdate=“2001-07-24”;
a.Intial query tree for SQL query and query graph ?
b.Moving SELECT operations down the query tree?
c.Applying more restrictive SELECT operation first to done?
d.Repalce cross product and SELECT with Join operation?
e.Moving PROJECTION operations down the query tree?

12/23/2023 67

You might also like