Chapter 1: Query Processing and Query Optimization

At the end of this chapter you will be able to:
 Define query processing and query optimization.

 Know how a query is decomposed and semantically analyzed.

 Know how to create a relational algebra tree to represent a query.

 Transformation rules ,heuristic and cost estimation rules to improve the efficiency of a query.

 Evaluate the cost and size of the relational algebra operations

 know the approaches for finding the optimal execution strategy.

1.1. Overview of Query Processing

 What is Query processing?

The process of activities involved in parsing, validating, optimizing, and executing a query.

 Aims

 To transform a query written in a high-level language, typically SQL, into a correct and efficient execution strategy
expressed in a low-level language (implementing the relational algebra), and

 To execute the strategy to retrieve the required data.

Example: select* from customer where custid = 101;

select* from customer where custid > 101 and custid < 300;

select* from customer where LastName < ‘D’;

like C, B, A to display the last name

Figure 1.1 Typical steps when processing a high-level query

Query Processing steps

Three Steps of Query Processing

1.Parsing ,scanning, Validating and translation
2. Optimization
3. Evaluation

Parser &

Relational Algebra
Expression tree Statistics
About Data

Evaluation Engine
Query code
Output Execution Plan

Three Steps of Query Processing
1) The Parsing and translation

It will first translate the query into its internal form, then translate the query into relational algebra and
verifies relations.

The parser and translator are to check syntax like select* from customer having salary >1000; the others
check schema elements like attributes, relations etc. and also coverts the SQL to RA expression.

2) Optimization

It is to find the most efficient process for executing/ evaluation plan for a query because
there can be more than one way.

3) Evaluation:
It is what the query-execution engine takes a query-evaluation plan to executes that plan and returns the answers to the query.
Phases of query processing

1.2. Translating SQL Queries into Relational Algebra and Other
• Query language used in most RDBMSs
• Query decomposed into query blocks
• Basic units that can be translated into the algebraic operators
• Contains single SELECT-FROM-WHERE expression
• May contain GROUP BY and HAVING clauses

Translating SQL Queries (cont’d...)
• Example:

• Inner block

• Outer block

Translating SQL Queries (cont’d.)
• Example (cont’d.)
• Inner block translated into:

• Outer block translated into:

• Query optimizer chooses execution plan for each query block?

Phases of query processing

Query Processing has four main phases.

1. Decomposition.
a. Analysis.
b. Normalization.
c. Semantic Analysis.
d. Simplification.
e. Restructuring.
2. Optimization.
a. Heuristics.
b. Comparing costs estimation
3. Code Generation.
4. Execution.

1.3. Query Decomposition

Query Decomposition
 Aim
• transform high-level query into RA query.
• check that query is syntactically and semantically correct.
Typical stages are:
c.Semantic analysis,
e.Query restructuring.

1.a. Analysis
Analyze query lexically and syntactically using compiler techniques.
Verify relations and attributes exist.
Verify operations are appropriate for object type .
• Example
SELECT staff_no FROM Staff WHERE position > 10;
• This query error would be rejected on two grounds:
• staff_no is not defined for Staff relation (should be staffNo).
• Comparison ‘>10’ is incompatible with type position, which is variable character

1.a. Analysis(cont’d...)
 Finally, query transformed into a query tree constructed as follows:
Leaf node for each base relation.
Non-leaf node for each intermediate relation produced by RA operation.
Root of tree represents query result.
• Sequence is directed from leaves to root.

1.b. Normalization
Converts query into a normalized form for easier manipulation.
Predicate can be converted into one of two forms:
Conjunctive normal form:
(position = 'Manager'  salary > 20000)  (branchNo = 'B003')
Disjunctive normal form:
(position = 'Manager'  branchNo = 'B003' )  (salary > 20000  branchNo = 'B003')

1.c. Semantic Analysis
Rejects normalized queries that are incorrectly formulated or contradictory.

Query is incorrectly formulated if components do not contribute to generation of result.

Query is contradictory if its predicate cannot be satisfied by any tuple.

Algorithms to determine correctness exist only for queries that do not contain disjunction and
For these queries (no disjunction and no negation) could construct two graphs:
1. A relation connection graph.
2. Normalized attribute connection graph.

1.Relation connection graph

a. Create node for each relation and node for result.
b. Create edges between two nodes that represent a join.
c. Create edges between nodes that represent projection.
not connected, query is incorrectly formulated.
Example 1.2
Checking Semantic Correctness

SELECT p.propertyNo, p.street FROM Client c, Viewing v, PropertyForRent p

WHERE c.clientNo = v.clientNo AND c.maxRent >= 500 AND c.prefType = ‘Flat’
AND p.ownerNo = ‘CO93’;

Relation Connection graph

 Relation connection graph not
fully connected, so query is not
correctly formulated.
 Have omitted the join condition
(v.propertyNo = p.propertyNo) .
1.d. Simplification
 Aims:
1. Detects redundant qualifications,
2. eliminates common sub-expressions,
3. transforms query to semantically equivalent but more easily and efficiently
computed form.

1.4 Optimization Process

Query optimization:
The activity of choosing an efficient execution strategy for processing a query.
 Aim
To choose the one that minimizes resource usage.

Generally, we try to reduce the total execution time of the query, which is the sum of
the execution times of all individual operations that make up the query.
Disk access tends to be dominant cost in query processing for centralized DBMS.

It recognizes an expression tree for a query using RA transformation rules

Query Optimization(QO)

 Generally

 DBMS has algorithms to implement relational algebra expressions

 SQL is a different kind of high level language; specify what is wanted, not how it is obtained

 Optimization – not necessarily “optimal”, but reasonably efficient query.

 Conducted by a query optimizer in a DBMS

 Goal: select best available strategy for executing query

Approaches or Techniques of Query Optimization

a. Heuristic rules

b. Cost estimation(Comparing costs of different plans)

1.5 Approaches/ Techniques of Query Optimization
a. Heuristically rules Processing Strategies
► Perform Selection operations as early as possible.
 Keep predicates on same relation together.

► Combine Cartesian product with subsequent Selection whose predicate represents join condition into a Join operation.

► Use associativity of binary operations to rearrange leaf nodes so leaf nodes with most restrictive Selection operations
executed first.

► Perform Projection as early as possible.

► Keep projection attributes on same relation together.

► Compute common expressions once.

► If common expression appears more than once, and result not too large, store result and reuse it when
► Useful when querying views, as same expression is used to construct view each time.
Example Query operation of Heuristics
•At this point we’ve generated the many different query plans for the same
•See the query example below the heuristic different operations
•Adding strategies/algorithms for implementing individual operations would
add even more potential plans

Example Query
• select e.lname, e.fname, w.pno, w.hours
from employee e, works_on w
where e.ssn = w.essn and w.hours > 20;
Find the relational algebra query tree in the above SQl
πe.lname, e.fname, w.pno, w.hours

σe.ssn=w.essn AND w.hours > 20

Employee e works_on w
• Find Heuristic of Conjunctive select cascade of selection?

πe.lname, e.fname, w.pno, w.hours

Heuristic: σw.hours > 20

Conjunctive Note: selects
select  cascade could commute
of selects

Employee e works_on w
• Find the heuristic Combine select and cross  join ?

πe.lname, e.fname, w.pno, w.hours

Heuristic: σw.hours > 20

select and
cross  join

Employee e works_on w
Find the heuristic Push projects through join?
πe.lname, e.fname, w.pno, w.hours

πe.ssn, e.lname, e.fname, w.essn, w.pno, w.hours

First cascade projects to
Heuristic: get attributes needed to
Push projects as ⋈e.ssn=w.essn push through join
early as possible

Employee e σw.hours > 20

works_on w

Using Heuristics in Query Optimization
1.The main heuristic is to apply first the operations that reduce the
size of intermediate results.
2.Perform select operations as early as possible to reduce the
number of tuples and perform project operations as early as
possible to reduce the number of attributes. (This is done by
moving select and project operations as far down the tree as
3.The select and join operations that are most restrictive should be
executed before other similar operations. (This is done by
reordering the leaf nodes of the tree among themselves and
adjusting the rest of the tree appropriately.)

2.b. Cost Estimation for Relational Algebra Operations
Many different ways of implementing RA operations.

Aim of Query Optimization in Cost Estimation is

► To choose most efficient one query.
► Use formulae that estimate costs for a number of options, and select one with lowest
► Consider only cost of disk access, which is usually dominant cost in Query Processing.
► Many estimates are based on cardinality of the relation, so need to be able to estimate

An Example
 Query:
Select B, D From R,S
Where R.A = “c” and S.E = 2 and R.C=S.C; Find the answer B,D ?

a 1 10 15 x 2 Answer
b 1 20
c 2 25
25 y 2 B D
32 y 3 2 y
d 2 10
e 3 26 10 z 1

An Example (cont.)
Plan 1
• Cross product of R & S
• Select tuples using WHERE conditions
• Project on B & D
Algebra expression query tree


12/23/2023 33
An Example (cont.)
Plan 2
• Select R tuples with R.A=“c”
• Select S tuples with S.E=2
• Natural join
• Project B & D
Algebra expression query tree


B,D( R.A=“c” (R) S.E=2

(S)) R.A=‘c’ S.E=2


Measures of Query Cost
 Disk Access
 Cpu cycle
• Transit Time in network- distributed system
• CPU cost is difficult to calculate and CPU cost is insignificant compared with disk access cost.
• Consider only disk access
Disk Access Cost:
• No. of seeks (N) –Random I/O cost < Sequential I/O cost
• Cost =N*avg seek time
• No. of blocks read
• Cost =N*avg block read cost
• No. of blocks written
• Cost = N*avg block write cost
• Cost of writing >> cost of reading. because once written it has to be read to check if it is written

Query Evaluation Process
Query Scanne Parse
r r

Answer Data s
Code Execution
Generato plan
Query Evaluation

 How to evaluate individual relational operation?

• Selection: find a subset of rows in a table

• Join: connecting tuples from two tables

• Other operations: union, projection, …

Cost of Operations
 Cost = I/O cost + CPU cost
• I/O cost: # pages (reads & writes) or # operations (multiple pages)
• CPU cost: # comparisons or # tuples processed
• I/O cost dominates (for large databases)
 Cost depends on
• Types of query conditions
• Availability of fast access paths
 DBMSs keep statistics for cost estimation.

 Used to describe the cost of operations.
 Relations: R, S

 nR: # tuples in R,

 nS: # tuples in S

 bR: # pages in R

 dist(R.A) : # distinct values in R.A

 min(R.A) : smallest value in R.A

 max(R.A) : largest value in R.A

 HI: # index pages accessed (B+ tree height)

Simple Selection
 Simple selection: A op a(R)
 A is a single attribute, a is a constant, op is one of =, , <, , >, .
• Do not further discuss  because it requires a sequential scan of table.
• Eg. Select * from R where A=a;
How many tuples will be selected?
• Selectivity Factor (SFA op a(R)) : Fraction of tuples of R satisfying “A op a”
• 0  SFA op a(R)  1.
• # tuples selected: NS = nR  SFA op a(R)

Options of Simple Selection
Sequential (linear) Scan
• General condition: cost = bR
• Equality on key: average cost = bR / 2
Binary Search
• Records are stored in sorted order
• Equality on key: cost = log2(bR)
• Equality on non-key (duplicates allowed)
cost = log2(bR) + NS/bfR - 1
= sorted search time + selected – first one

Example: Cost of Selection

Relation: R(A, B, C)
nR = 10,000 tuples
bfR = 20 tuples/page
dist(A) = 50, dist(B) = 500
B+ tree clustering index on A with order 25 (p=25)
B+ tree secondary index on B with order 25
select * from R where A = a and B = b1
Find the relational algebra?
Relational Algebra expression : A=a  B=b1 (R)
Example: Cost of Selection (cont’d...)
• Option 1: Sequential(linear search) Scan
• Have to go thru the entire relation
• Cost bR= nR/bfR = 10,000/20 = 500
• Option 2: Binary Search using A = a
• It is sorted on A
• NS = nR/dist(A)=10,000/50 = 200
• assuming equal distribution
• Cost = log2(bR) + NS/bfR - 1
= log2(500) + 200/20 - 1 = 18

Example: Cost of Selection (cont.)
• Option 3: Use index on R.A:
• The secondary index average order of B+ tree
= (P + 0.5P)/2 =18.75~ 19
• Leaf nodes have 18 entries, internal nodes have 19 pointers
• # leaf nodes = 50/18 = 3
• # nodes next level = 1
• HI = 2
• Clustering index: Cost = HI + NS/bfR
= 2 + 200/20 = 12

Example: Cost of Selection (cont.)
• Option 4: Use index on R.B
• Average order = 19
• NS =10000/500 = 20
• Use Option I (allow duplicate keys)
• # nodes 1st level = 10000/18 = 556 (leaf)
• # nodes 2nd level = 556/19 = 29 (internal)
• # nodes 3rd level = 29/19 = 2 (internal)
• # nodes 4th level = 1
• HI = 4
• Secondary index Cost = HI + NS
= 4+20 =24

Estimate Size of Join Result
• How many tuples in join result?
• Cross product (special case of join)
NJ = nR  nS
• R.A is a foreign key referencing S.B
NJ = nR (assume no null value)
• S.B is a foreign key referencing R.A
NJ = nS (assume no null value)
• Both R.A & S.B are non-key

nR nS nR nS
NJ  min ( , )
dist(R.A) dist(S.B)

Basic Algorithms for Executing Query Operations
Consider only single table queries
•Three categories:
1. simple SELECT: one condition, no AND or OR
2. conjunctive SELECT : multiple conditions, connected by AND
3. disjunctive SELECT : multiple conditions, connected by OR

1. Simple SELECT or methods for simple selection
1.1. Linear search (brute force algorithm):
• algorithm:
• Retrieve every record in the file
• test whether its attribute values satisfy the selection condition
• works when:
• always works
• best on small files
• only choice when no indexes or ordering
• cost
• average case: b/2, where b = # blocks in file
• worst case: b

1.2. Binary search:

• algorithm:
• use binary search to find record(s)
• works when:
• selection condition is an equality test on an ordering attribute
• cost:

where b
Simple SELECT(cont’d...)
1.3. Primary index to retrieve a single record:
• algorithm:
• look up record using primary index
• works when:
• selection condition is equality test on key attribute with primary index
• cost:
x + 1 , where x = # index levels or height of the tree
1.4.Clustering index to retrieve multiple records:
• algorithm:
• use clustering(secondary) index to retrieve all the records satisfying the selection condition
• works when:
• selection condition is equality comparison on a non-key attribute with clustering index
• cost:

where x = # index levels, s = # selected records

• Blocks are no in consecutive locations because the data is not sorted with the non-key
2.Conjunctive SELECT or methods for conjunctive (logical AND) selection
2.1. Conjunctive selection Using an individual index
• algorithm:
• use one of the simple SELECT algorithms to find records matching one condition
• check those records for remaining conditions
• works when:
• conjunctive selection in which one of the simple SELECT algorithms can be applied to one condition
• cost:
• same as simple SELECT cost
• example:
• select * from EMPLOYEE
where IDNO=6 AND salary > 7000 , and an index exists on< IDNO>
2.2. Conjunctive selection using a composite index
• algorithm:
• use composite index directly
• works when:
• selection condition is equality tests on two or more attributes for which a composite index exists
• cost:
• x+1
• example:
• select
Conjunctive SELECT(cont’d...)
2.3. Conjunctive selection by intersection of record pointers:
• algorithm:
• use indexes to find record pointers for each equality condition
• compute intersection of record pointer sets
• retrieve records using records pointers
• works when:
• secondary indexes are available on all (or some of) the fields involved in equality
comparison conditions
• indexes include record pointers (rather than block pointers)
• cost:
• sum of index operations plus s, where s = #records selected
Select * from WORKS_ON
where PNO=10 and LNAME=‘Wolfe’;
and indexes exist for both PNO and LNAME

3.Disjunctive SELECT ( logical OR) selection
•Disjunctive selects are much harder to optimize
• no single condition can be used to ‘pre-filter’ the results
• result is union of each condition
• best you can do is to try to optimize each individual query, then compute the union
select * from EMPLOYEE
where DNO=3 or SALARY > 80,000 or SEX=‘F’;

Join Algorithms
•We’ll consider joins such as R ⋈A=B S.
•Extends to joins like R ⋈A=B and C=D S
by considering <A,C> and <B,D> as single attributes.

Semantic Query Optimization
Semantic Query Optimization:
Uses constraints specified on the database schema in order to modify one query into another query that is
more efficient to execute.
Consider the following SQL query,
select e.lname, m.lname from employee e, m
where e.superssn = m.ssn and e.salary > m.salary
Suppose that we had a constraint on the database schema that stated that no employee can earn more than
his or her direct supervisor. If the semantic query optimizer checks for the existence of this constraint, it
need not execute the query at all because it knows that the result of the query will be empty. Techniques
known as theorem proving can be used for this purpose.

1.6 Transformation Rules

Transformation Rules for RA Operations

Apply well-known transformation rules of Boolean algebra.
1.Conjunctive Selection operations- can cascade into individual Selection operations (and
vice versa).
p  q  r(R) = p(q(r(R)))
Sometimes referred to as cascades of Selection.
branchNo='B003'  salary>15000(Staff) = branchNo='B003'(salary>15000(Staff)), this selects can be
2. Commutativity of Selection:
p(q(R)) = q(p(R))
• For example:
branchNo='B003'(salary>15000(Staff)) =salary>15000(branchNo='B003'(Staff))
Transformation Rules for RA Operations(cont’d...)
3. Commutativity of Theta join (and Cartesian product).

 Rulealso applies to Equijoin and Natural join.

For example:

4.Commutativity of Projection and Union.

L(R  S) = L(S)  L(R)
5. Associativity of Union and Intersection (but not Set difference).

Transformation Rules for RA Operations(cont’d...)

6. Associativity of Theta join (and Cartesian product).

Cartesian product and Natural join are always associative:

If join condition q involves attributes only from S and T, then Theta
join is associative:

Example 1.3
Use of Transformation Rules

For prospective renters of flats, find properties that match requirements and owned
by CO93.
SELECT p.propertyNo,p.street FROM Client c, Viewing v, PropertyForRent p
WHERE c.prefType = ‘Flat’ AND c.clientNo = v.clientNo AND
v.propertyNo = p.propertyNo AND c.maxRent >= AND
c.prefType = p.type AND p.ownerNo = ‘CO93’;
Find the RA tree in the above SQL query statements?

Algorithms for PROJECT and Set Operations
• Set operations
• Set operations sometimes expensive to implement
• Sort-merge technique
• Hashing
• Use of anti-join for SET DIFFERENCE
• Example: Find which departments have no employees becomes

Implementing Aggregate Operations and Different Types of JOINs
• Aggregate operators
• Can be computed by a table scan or using an appropriate index
• Example:

• If an (ascending) B+ -tree index on Salary exists:

• Optimizer can use the Salary index to search for the largest Salary value
• Follow the rightmost pointer in each index node from the root to the rightmost leaf.
• Index can be used if it is a dense index
• Computation applied to the values in the index
• Nondense index can be used if actual number of records associated with each index value
is stored in each index entry
• Number of values can be computed from the index
1. Discuses the difference between query processing and query optimization within examples ?
2. Discuss the reasons for converting SQL queries into relational algebra queries before optimization is to
work. Give examples
3. Discuss the cost components for query execution estimate?
4. Find the heuristic optimizations of the following question based on the SQL query.
select fname from Emp, works_on, Project
where Pname=“abay_dam” AND Pnumber=Pno AND Essn=Ssn AND Bdate=“2001-07-24”;
a.Intial query tree for SQL query and query graph ?
b.Moving SELECT operations down the query tree?
c.Applying more restrictive SELECT operation first to done?
d.Repalce cross product and SELECT with Join operation?
e.Moving PROJECTION operations down the query tree?

