Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

CHAPTER TWO

Query processing and Optimization


Translating SQL Queries into Relational Algebra
Relational Algebra is a widely used procedural query language, which takes relation as an input
and generates relation as an output.

SQL Relational algebra query operations are performed recursively on a relation. The output of
these operations is a new relation, which might be formed from one or more input relations. It uses
operators to perform queries. An operator can be unary or binary.

Relational Algebra operations

Basic Operations:

1. Selection ( ): It selects a subset of rows (records) from a relation.


2. Projection ( ): Deletes unwanted columns tuples from relation.
3. Cross product (X): allows us to combine two relations.
4. Set difference (-): Tuples in relation-1 but not in relation-2.
5. Union (U): Tuples in relation-1 and in relation-2.

Additional Operations:

Intersection, join, division, and renaming are not essential, but (very!) useful.
1. Selection
 Selection operator is a unary operator that acts like a filter on a relation by returning
only a certain number of rows that satisfy the selection condition.
 The resulting relation will have the same degree (schema) as the original (input)
relation.
 The resulting relation may have fewer rows than the original relation.
 It is denoted by sigma:
Condition (R) (Returns only those tuples in R that satisfy condition C)
 The tuples (rows) to be returned are dependent on a condition that is part of the
selection operator.
 A condition C can be made up of any combination of comparison or logical operators
that operate on the attributes of R.

 Comparison operators:

 Logical operators:

Page | 1
Examples:

Assume the following relation is EMP.

 Select only those Employees in the CS department:


 Select * from EMP where Dept = CS
Dept = 'CS' (EMP)

Office Dept Rank


Name
Smith 400 CS Assistant
Jones 220 Econ Adjunct
Green 160 Econ Assistant
Brown 420 CS Associate
Smith 500 Fin Associate

The Resulting relation will be:

Name Office Dept Rank


Smith 400 CS Assistant
Brown 420 CS Associate

 Select only those Employees with last name Smith who are assistant professors:
 Select * from EMP where Name=Smith and Rank=Assistant.
Name = 'Smith' Rank = 'Assistant' (EMP)

The resulting relation will be:

Name Office Dept Rank


Smith 400 CS Assistant

 Select only those Employees who are either Assistant Professors or in the Economics
department:
 Select * from EMP Rank=Assistant or Dept=Econ
Rank = 'Assistant' Dept = 'Econ' (EMP)

Page | 2
The resulting relation will be:

Name Office Dept Rank


Smith 400 CS Assistant
Jones 220 Econ Adjunct
Green 160 Econ Assistant

 Select only those Employees who are not in the CS department or Adjuncts:
 Select * from EMP where Dept=not CS or Adjunct
(Rank = 'Adjunct' Dept = 'CS') (EMP)

The resulting Relation will be:

Name Office Dept Rank


Green 160 Econ Assistant
Smith 500 Fin Associate

2. Projection
 Projection is also a Unary operator that limits the attributes that will be returned
from the original relation.
 The Projection operator is pi:
 The general syntax is:
attributes R
(where attributes is the list of attributes to be displayed and R is the relation.)
 The projection method defines a relation that contains a vertical subset of relation.
 The resulting relation will have the same number of tuples as the original relation
(unless there are duplicate tuples produced).
 The degree (schema) of the resulting relation may be equal to or less than that of the
original relation because it contains exactly the fields in the projection list.
 Projection operator has to eliminate duplicates.

Examples
Assume the same EMP relation above is used.

 Project only the names and departments of the employees:

Page | 3
name, dept (EMP)

The resulting relation will be:

Name Dept
Smith CS
Jones Econ
Green Econ
Brown CS
Smith Fin

Example 2: The following relation is called Sailor

Sid Sname Rating Age


28 Smith 9 35
31 lubber 8 55
44 Guppy 5 35
58 Rusty 10 35

 Select Sname, Rating from Sailor

Sname, Rating (Sailor)

Sname Rating
Smith 9
lubber 8
Guppy 5
Rusty 10

Combining Selection and Projection

Page | 4
 The selection and projection operators can be combined to perform both
operations.

Office Dept Rank


Name
Smith 400 CS Assistant
Jones 220 Econ Adjunct
Green 160 Econ Assistant
Brown 420 CS Associate
Smith 500 Fin Associate

Example: - Show the names of all employees working in the CS department:


name ( Dept = 'CS' (EMP))
The resulting relation will be:

Name
Smith
Brown
Example: - Show the name and rank of those Employees who are not in the CS department or
Adjuncts:
name, rank ( (Rank = 'Adjunct' Dept = 'CS') (EMP) )
The resulting relation will be:

Name Rank
Green Assistant
Smith Associate
3. Cartesian product
The Cartesian product also referred to as a cross-join, returns all the rows in all the tables listed in
the query. Each row in the first table is paired with all the rows in the second table. This happens
when there is no relationship defined between the two tables.

R X S: The Cartesian product operation defines a relation that is the concatenation of every tuple
of relation R with every tuple of relation S

Example: The following two tables are named R and S

Page | 5
R S
First Last Age Dinner Dessert
Bill Smith 22 Steak Ice Cream
Mary Keen 23 Lobster Cheesecake
Tony Jones 32

RXS

First Last Age Dinner Dessert


Bill Smith 22 Steak Ice Cream
Bill Smith 22 Lobster Cheesecake
Mary Keen 23 Steak Ice Cream
Mary Keen 23 Lobster Cheesecake
Tony Jones 32 Steak Ice Cream
Tony Jones 32 Lobster Cheesecake
4. Union, Intersection and Set Difference
All of these operations take two input relations, which must be union-compatible.

 Same number of fields.


 Corresponding fields have the same type.

Union
R U S: The union of two relations R and S defines a relation that contains all the tuples of R or S
or both R and S. Duplicate tuples are eliminated.

R and S must be union-compatible: Same number of fields and corresponding’ fields have the
same type.

Example: - Consider the following relations R and S

R S

Page | 6
First Last Age
First Last Age
Bill Smith 22
Forrest Gump 36
Sally Green 28
Sally Green 28
Mary Keen 23
DonJuan DeMarco 27
Tony Jones 32

R S First Last Age


Bill Smith 22
Sally Green 28
Mary Keen 23
Tony Jones 32
Forrest Gump 36
DonJuan DeMarco 27

Intersection
R n S: The intersection operation defines a relation consisting of the set of all tuples that are in
both R and S. R and S must be union-compatible.

Example: - Consider the following relations R and S


Relation R

First Last Age


Bill Smith 22
Sally Green 28
Mary Keen 23
Tony Jones 32

Relation S

Page | 7
First Last Age
Forrest Gump 36
Sally Green 28
DonJuan DeMarco 27

R S First Last Age


Sally Green 28
Set Difference
R-S: The set difference operation defines a relation consisting of the tuples that are in relation R,
but not in S. It displays all tuples/ records belonging to the first relation (left relation) but not in
the second relation (right relation). R and S must be Union compatible.

First Last Age


First Last Age
Bill Smith 22
Forrest Gump 36
Sally Green 28
Sally Green 28
Mary Keen 23
DonJuan DeMarco 27
Tony Jones 32

Relation R Relation S

First Last Age


R-S Bill Smith 22
Mary Keen 23
Tony Jones 32

5. Join Operations
Join operation is used to combine rows from two or more tables based on a related column between
them. Typically we want only combinations of the Cartesian product that satisfy certain conditions
and so we would normally use a Join Operation instead of the Cartesian product operation. Join is
a derivative Cartesian product, equivalent to performing a selection operation, using the join
predicate as the selection formula, over the Cartesian product of the operand relations. It is one of

Page | 8
the most difficult operations to implement efficiently in an RDBMS and is one of the reasons why
relational systems have intrinsic performance problems.

There are various forms of Join operations:

1. (INNER) JOIN: It returns records that have matching values in both tables.

2. LEFT (OUTER) JOIN: It returns all records from the left table and the matched records
from the right table. Tuples from R that do not have matching values in the common
attributes of S are also included in the result relation. Missing values in the second relation
are set to null.

3. RIGHT (OUTER) JOIN: It returns all records from the right table and the matched
records from the left table.

4. FULL (OUTER) JOIN: It returns all records when there is a match in either left or right
table.

Page | 9
Condition Join: R CS = C (RXS)

Example: S1 S1.sid<R1.sidR1

S1 R1

Sid Sname Rating Age Sid Bid Day

22 Peter 7 45 22 101 10/10/96

31 Mengistu 8 55 58 103 11/12/96

58 Solomon 10 35

The resulting relation will be:

Sid Sname Rating Age Sid Bid Day

22 Peter 7 45 58 103 11/12/96

31 Mengistu 8 55 58 103 11/12/96

5. Equi join: Natural join (to remove one of the similar attributes). It is a special case of
condition join where the condition C contains only equalities.

Example:

sid, .., age, bid, ..(S1 sidR1)

S1 R1

Page | 10
Sid Sname Rating Age Sid Bid Day

22 Peter 7 45 22 101 10/10/96

31 Mengistu 8 55 58 103 11/12/96

58 Solomon 10 35

The resulting relation will be:

Sid Sname Rating Age Bid Day

22 Peter 7 45 101 10/10/96

58 Solomon 10 35 103 11/12/96

6. Division Operation
 The division operation is not supported as a primitive operator but useful for
expressing queries like:
Find sailor who have reserved all boats
 Precondition: in A/B the attributes in B must be included in the schema for A also
the result has attributes A-B.
 SALES(SupId,ProdId);
 PRODUCTS(ProdId);
 Relations SALES and PRODUCTS must be built using projections.
 SALES/PRODUCTS: the ids of the suppliers supplying all products.

Example:

Page | 11
A

Sno Pno
S1 P1
S1 P2
S1 P3
S1 P4
S2 P1
S2 P2
S3 P2
S4 P2
S4 P4

Sno
B1 S1

Pno S2

P2 A/B1= S3
S4

B2
Pno Sno
P2 A/B2= S1
P4 S4

Page | 12
B3

Pno
P1
Sno
P2
S1
P4 A/B3=

Page | 13
OVERVIEW OF QUERY PROCESSING

There are many ways that complex query can be performed, and one of the aims of query
processing is to determine which one is the most cost effective. In declarative languages such as
SQL, the user specifies what data is required rather than how it is retrieved

Query processing

Query processing is the activities involved in retrieving data from the database. Giving the DBMS
the responsibility for selecting the best strategy prevents users from choosing strategies that are
known to be inefficient and gives the DBMS more control system performance

The aim of query processing are to transform a query written in a high level language, typically
SQL, into a correct and efficient execution strategy expressed in low level language (implementing
the relational algebra),and to execute the strategy to retrieve the required data.

Query processing can be divided into four main phases:

1. Decomposition(consists of parsing and validation)

2. Optimization

3. Code generation and

4. Execution

Page | 14
Query in high-level language
(typically SQL)

Query decomposition System Catalog

Relational Algebra
expression
Database statistics
Compile time Query optimization

Execution
plan
Cost generation

Generated code

Runtime Runtime Query Main databases


execution

Query output

Query Decomposition

The aim of query decomposition is to transform a high-level query into a relational algebra query,
and to check that the query is syntactically and semantically correct

The typical stages of query decomposition are:

 Analysis

 Normalization

 Semantic analysis

 Simplification and

 Query restructuring.

1. Analysis
In this stage, the query is lexically and syntactically analyzed using the techniques of programming
language compilers

Page | 15
In addition, this stage verifies that the relations and attributes specified in the query are defined in
the system catalog. It also verifies that any operations applied to database objects are appropriate
for the object type.

Example:

SELECT StaffNumber

FROM staff

WHERE position >10; data type mismatch

On compilation of this stage, the high-level query has been transformed into some internal
representation that is more suitable for processing.

The internal form that is typically chosen is some kind of tree, which is constructed as follows:

 A leaf node is created for each base relation in the query


 A non-leaf node is created for each intermediate relation produced by a relational algebra
operation.
 The root of the tree represents the result of the query
 The sequence of operations is derived from the leaves to the root

Example: Find all managers who work at a Wolkite branch

∞s.branchNo=b.branchNo

root root

σs.position=’manager’ σb.city=’Wolkite’ Intermediate op

Staff Branch leaves

Page | 16
2. Normalization
Converts the query into a normalized form that can be more easily manipulated. The predicate
WHERE will be converted to conjunctive (ᴧ) or disjunctive (ᴠ) normal form

Conjunction normal form: A sequence of conjunction that are connected with the ᴧ (And)
operator. A conjunction selection contains only those tuples that satisfy all conjuncts

Example:

(position=’Manager’ ᴠ salary >20000) ᴧ (branchNo= ‘B003’)

Disjunctive normal form: A sequence of disjuncts that are connected with the ᴠ (OR) operator.
A disjunctive selection contains those tuples formed by the union of all tuples that satisfy the
disjuncts.

Example:

(position = ‘Manager’ ᴧ branchNo= ‘B003’) ᴠ (salary > 20000 ᴧ branchNo=’B003’)

3. Semantic Analysis
Objective of semantic analysis is to reject normalized queries that are incorrectly formalized or
contradictory.

Incorrect: If components do not contribute to the generation of the results. It happens if some join
specifications are missing.

Contradictory: A query is contradictory if it is predicate cannot be satisfied by any tuple

Example:

(position=’Manager’^ position =’Assistance’) v salary >20000

The above query could be simplified to (salary>20000)

 F v salary >20000
4. Simplifications
The objective of the simplification stage is:

 To detect redundant qualifications


 To eliminate common sub expression and
 To transform the query to a semantically equivalent but more easily and efficiently
computed form.

Page | 17
Typically:

 Access restrictions
 View definitions and
 Integrity constraints are considered at this stage ,some of which may also introduce
redundancy

If the user the does not have the appropriate access to all the components of the query the query
must be rejected.

View definition

CREATE VIEW Staff3 AS

SELECT staffNo,fName,lName,salary,branchNo

FROM Staff

WHERE branchNo=‘B003’;

SELECT *

FROM Staff3

WHERE (branchNo=‘B003’ AND salary > 20000);

SELECT staffNo,fName,lName,salary

FROM Staff

WHERE(branchNo=‘B003’ AND salary>20000) AND branchNo=‘B003’; WHERE

(branchNo=‘B003’ AND salary>20000)

Assuming that the user has the appropriate access privileges, an initial optimization is to apply the
well-known idempotent rules of Boolean algebra, such as:

 p ^ p =p
 P ^ False= False
 P ^ true = P
 P ^ -P=False
 p^ (p V q)=p
 p v p=p
 p v False =p
 p v true =true

Page | 18
 p v –p = true
 p v (p^q)=p

Integrity constraints

Pos=manager salary >20000

CREARE ASSERTION onlyManager SalaryHigh

CHECK((position<>’Manager’ AND Salary <20000)

OR (position=’Manager’ AND salary > 20000));

5. Query restructuring
The query is restructured to provide a more efficient implementation.

Heuristically Approach-uses transformation rules to convert one relational algebra expression into
an equivalent form that is known to be more efficient.

Example: Select before join- manager working in Wolkite Branch

TRANSFORMATION RULES

In listing these rules, we use three relations R,S, and T

 R defines over the attributes A={ A1,A2,…An} , and


 S defines over B={B1,B2,…Bn};
 p, q, and r denotes predicates, and
 L,L1,L2,M,M1,M2, and N denote sets of attributes
1. Conjunctive selection operations can cascade into individual selection operations (and vice
versa)
 σp^q^r(R)= σp( σp(σq(σr(R) ))
 σ branchNo= ‘B0003’^salary>15000(Staff) = σbranchNo =
‘B0003’(σsalary>15000(Staff)
2. Commutativity of selection operations
 σp(σq(R)) = σq(σp(R))
 σ branchNo= ‘B0003’(σ salary>15000(Staff))= (σsalary>15000(σ branchNo=
‘B0003’(Staff))
 male students whose grade above 3.2
3. In a sequence of projection operaions, only the last in the sequence is required
 ∏lName ∏ branchNo,lName(Staff)= ∏lName(Staff)
4. Commutativity of selection and projection

Page | 19
If the predicate P involves only the attributes in the projection list, then the selection and projection
operations commute:

∏A1,….Am(σp(R)= σp(∏ A1,….Am(R))


∏fName,lName(σlName=’Beech’(Staff))= σlName=’Beech’(∏
fName,lName(Staff))
5. Commutativity of Theta Join (and Cartesian Product)
 R ⋈ p S = S ⋈p R
 R X S=SXR

Example: staff ⋈ staff.branchNo=Branch.branchNo Branch = Branch ⋈


staff.branchNo=Branch.branchNo Staff

6. Commutative of Union and Intersection (but not set difference)


 RUS=SUR
 RnS=SnR
7. Commutativity of selection and set operations (Union, Intersection and Set Difference)
 σp(R U S) = σp(R) U σp(S)
 σp(R n S) = σp(R) n σp(S)
 σp(R -S) = σp(R) - σp(S)
8. Commutativity of projection and union
 πL(RUS)= πL(S) U π(R)
9. Associativity of Theta join(and Cartesian product)
 (R ⋈ S) ⋈ T = R ⋈ (S ⋈ T)
 (R X S) X T = R X (S X T)
10. Associativity of union and Intersection (but not set Difference)
 (R US) U T = S U ( R U T)
 (R n S) n T = S n ( R n T)
 Many DBMS use heuristics to determine strategies for query processing

QUERY OPTIMIZATION

The activity of choosing an efficient execution strategy for processing a query. The aim of query
optimization is to choose the one that minimizes resource usage.

Generally, we try to reduce the total execution time of the query, which is the sum of execution
time of all individual operations that make up the query

However, resource usage may also be viewed as the response time of the query in which case we
concentrate on maximizing the number of parallel operations. (Pipelining)

Both methods of query optimization depend on database statistics to evaluate properly the
difference options that are available.

Page | 20
The accuracy and currency of these statistics have a significant bearing on the efficiency of the
execution strategy chosen.

The statistics cover information about relations, attributes and indexes.

The system catalog may store statistics

 giving the cardinality of relations

 the number of distinct values for each attribute and

 the number of levels in a multilevel index

Example: Find all managers who work at a Wolkite branch

SELECT *

FROM Staff s,Branch b

WHERE S.branchNo=b.branchNo AND

(s.position=’manager’AND b.city =’Wolkite’);

The equivalent relational algebra queries corresponding to these SQL statements are:

1. σ (position=manager’) ᴧ(city=’ Wolkite’) ᴧ(staff.branchNo=Branch.branchNo)(StaffXBranch)

2. σ (position=manager’) ᴧ(city=’ Wolkite’) (Staff ∞ staff.branchNo=Branch.branchNo Branch)

3. σ (position=manager’(Staff)) ∞ staff.branchNo=Branch.branchNo(σ city=’Wolkite’(Branch))

• Assume there are 1000 tuples in staff, 50 tuples in Branch, 50 Managers (one for each
branch), and 5 Wolkite branches.

• We compare these queries based on the number of disk accesses required

Page | 21
• Assume there are no indexes or sort keys on either relations and that the result of any
intermediate operations is stored on disk.

• Assume tuples are accessed one at a time and

• Main memory is large enough to process entire relations for each relational algebra
operation

The first query calculates the Cartesian product of staff and Branch, which requires (1000+50)
disk accesses to read the relations, and creates a relation with (1000*50)tuples. We then have to
read each of these tuples again to test them against the selection predicate at a cost of another
(1000*50) disk accesses, giving a total cost of :

(1000 +50) +2*(1000*50) =101050 disk accesses

The second query joins staff and Branch on the branch number branchNo, which again requires
(1000+50) disk accesses to read each of the relations. We know that the join of the two relations
has 1000 tuples, one for each member of staff ( a member of staff can only work at one
branch).Consequently, the selection operation requires 1000 disk accesses to read the result of the
join ,giving a total cost

2*1000+(1000+50)=3050 disk accesses

The final query first reads each staff tuple to determine the manager tuples, which requires 1000
disk access and produces relations with 50 tuples .The second selections operations, reads each
Branch tuple to determine the Wolkite branch, which requires 50 disk access and produces a
relations with 5 tuples. The final operation is the join of the reduced staff and Branch relations,
which requires (50+5) disk accesses, giving a total cost of:

1000+2*50 +5+ (50+5) =1160 disk access

COST ESTIMATION AND STATISTICS

The dominant cost in query processing is usually that of disk access which are slow compared with
memory access. Many of the cost estimates are based on the cardinality of the relation

The success of estimating the size and cost of intermediate relational algebra operations depends
on the amount and currency of the statistical information that the DBMS holds.

Page | 22

You might also like