Activity 9 Advance DBM Query Execution Research 5 DARWIN G RARALIO

Republic of the Philippines
Cagayan State University

CARIG CAMPUS
Carig Sur, Tuguegarao City
MSIT 202-ADVANCED DATABASE MANAGEMENT
TOPIC/s Quiz 9: Query Execution(Research 5)

Title of topic/s
 Introduction to Physical-Query-Plan Operators

 Nested-Loop Joins
 One-Pass Algorithms for Database
 Two-Pass Algorithms Based on Sorting
 Two-Pass Algorithms Based on Hashing
 Index-Based Algorithms
 Parallel Algorithms for Relational Operations
 Basic Algorithms for Executing Query Operations
KEYWORDS Tuple, Hash, Algorithm, Index, Joins, Iterator, Buffer

Main words used in the
discussion
GUIDE QUESTIONS  To familiarize Introduction to Physical Query Plan (Query
get it from the discussions Execution.
 To Be able to merge different tables using Join
 To be able to use different algorithms in executing queries.
DEFINITION OF Tuple- is one record

TERMS Hash - passing some data through a formula that produces a
Define term based on how it was result,
used in the discussion Algorithm - a step-by-step procedure, which defines a set of
instructions to be executed in a certain order to get the desired
output.
Index - copy of selected columns of data, from a table, that is
designed to enable very efficient search.
Iterator - an object that enables a programmer to traverse a
container, particularly lists.
Buffer - is a main-memory area used to cache database pages.
Parsing - involves separating the pieces of a SQL statement into
a data structure that other routines can process.
Processor - the module responsible for executing database
queries.
Nested-Loop - reads rows from the first table in a loop one at a
time, passing each row to a nested loop that processes the next
table in the join.
SUBMITTED BY: DARWIN G. RARALIO
DISCUSSIONS:
INTRODUCTION TO PHYSICAL-QUERY-PLAN OPERATORS
MSIT-GRADUATE SCHOOL
CARIG CAMPUS

Query Processor
Group of components of a DBMS that turns user queries and data modification
commands into a sequence of database operations and executes those operations.
Query Compilation
Three Parts:
1. Parsing: Construct Parse tree

2. Query rewrite: parse tree -> query algebra - > logical query plan (faster)
3. Physical plan generation: Converts logical query plan to physical query plan by selecting
appropriate algorithms and order of execution.
CARIG CAMPUS
Physical Operators often are implementations of relational algebra operators

Examples of non-relational operators:
 Scan: bring into memory each tuple of some relation
 Iterators; method by which operators comprising a physical query plan can pass requests for
tuples and answers among themselves.
Reading the contents of a relation R

Table-scan:
Relation R is stored in secondary memory
Blocks containing tuples of R are know, and it is possible to get the blocks one by one.
Index Scan
If there is an index on any attribute of R, we may be able to use this index to get all the tuples
of R.
Sort Relation as we read tuples for multiple reasons

Examples;
 Order By clause
 Operations requiring relations to be sorted
Physical-query plan operator sort-scan can be implemented many ways. One example is a
B-Tree index on sorted attribute a.
CARIG CAMPUS

Query is made of several operations of relational algebra, and query plan composed of
several physical operators.
Estimate cost by number of disk I/O’s.
To compare algorithms, we assume that the arguments of any operator are found on disk, but
the result of the operator is left in main memory..
Because size of result doesn’t depend on algorithm
Final write is cost of query, not algorithm.
M: Number of main memory buffers (size of block) available to operator. Could be smaller
than total main memory if several operators share memory.
B or B(R): Size of relation R – number of block to hold all tuples of R
T or t(R); Number of tuples in R.
t/B = tuples per block
V(R,[a],a2,…an]): number f distinct values in a column, or columns for multiple attributes.
Table-scan:
 If R is clustered, need B disk I/Os
 If R is not clustered, could be up to T dsk I/Os – as many blocks as there are tuples
Index-Scan:
 If column data is contained in the index
o SELECT category_id FROM tbl WHERE category_id BETWEEN 10 AND 100;
 Don’t need to access the table
 Often smaller than B
Design pattern to implement physical operators

Three Methods
1. Open(): Iintializes data structures

2. GetNext(); Returns the next tuple in the result and adjust data strutures as necessary.
If no more tuples, return not found
3. Close(): Ends the iteration for all tuples. Calls close on any arguments of the operator.
Nested-Loop Joins
Introduction to Nested Loop Joins in SQL Server
A relational database system uses SQL as the language for querying and maintaining
databases. To see the data of two or more tables together, we need to join the tables; the
CARIG CAMPUS

joining can be further categorized into INNER JOIN, LEFT OUTER JOIN, RIGHT
OUTER JOIN, FULL OUTER JOIN, and CROSS JOIN. All these types of joins that we
use are actually based on the requirement of the users.
SQL is a declarative language; we just write a query by the SQL language standard and ask
for the database to fulfill the request. Now, it is the responsibility of the database to fulfill the
user’s request optimally. Fortunately, SQL Server has the Query
Optimizer which is responsible for fulfilling the user requests optimally.
We need not to worry how things actually happen in the SQL Server, but it’s always good to
know what’s happening behind the curtain sometimes so that we can figure out why a query
is running slow.
However, in the Execution plan, there are many iterators for different operations, but in this
article, we will learn one iterator only, that is, the Nested Loop Join. It is a physical join type
iterator. Whenever you join a table to another table logically, the Query Optimizer can choose
one of the three physical join iterators based on some cost based decision, these are Hash
Match, Nested Loop Join and Merge Join. This article only focuses on the Nested Loop Join,
and hence let us quickly move to the joining part.
CARIG CAMPUS
Basics of Nested Loop Join

In relational databases, a JOIN is the mechanism which we use to combine the data set of two
or more tables, and a Nested Loop Join is the simplest physical implementation of joining
two tables. For example, how would you match the following data manually, if you have
these two tables given below:
Look at the above two result sets. StudentInfo table has student’s information; it has roll
number, name and address columns. The attendance table contains daily attendance of the
students; this table has the student’s roll number, present, and attendance date columns.
If you want to see a student’s name, address, present and attendance date in a new
spreadsheet, you can only use one row at a time of a table. So, how would you do that?
In all probability, some of us will start with rollnumber 1 of the studentinfo table
 We shall copy the student’s name and address from Attendance table, paste it into a new
spreadsheet,
 Then we shall copy all the Present and date from the attendance table for roll number 1 and
paste it in the spreadsheet.
CARIG CAMPUS

The result set will look something like this after filling the data of roll number 1:
Then we will repeat the same process for roll number 2 and roll number 3 after completing it
all; your final result set will look somewhat like this:
If we try to convert what we did above in the pseudocode, then it will be like this:
For each row from StudentInfo table until end of Attendance table
Match row from table2
If StudentInfo.rollnumber = Attendance.rollnumber
Return (StudentInfo.name , StudentInfo.Address, Attendance.Present,
Attendance.AttandanceDate)
Congratulations, now you already know how Nested Loop Join works.
CARIG CAMPUS

ONE PASS ALGORITHMS FOR DATABASE
One-Pass for Unary Operations

 Consider the unary, tuple at a time operations, selection and projection on relation R.
 Read all blocks of R into the input buffer, one at a time
 Perform the operation on each tuple and move the selected/ projected tuple to the output
buffer.
 Output buffer may be input buffer of other operation and is not counted.
 Thus, algorithm requires only M = 1 buffer blocks.
 I/O cost is B®
 If some index is applicable for a selection, have to read only blocks that contain qualifying
tuples.
One Pass Algorithms for Binary Operations

 Binary operations; union, intersection, difference, Cartesian Product and join.
 Use subscripts B and S to distinguish between the set and bag version
 The bag union Rb S can be computed using a very simple one pass algorithm: Copy each
tuple the output. (For the SUM model of b ag union)
 I/O cost is B(R) + B(S), M = 1.
 Other binary operations require the reading of the smaller of the two input relations into the
main memory.
 One buffer to read blocks of the larger relation, M-1 buffers for holding the entire smaller
table.
 I/O cost is B(R) + B(S) .
 In main memory, a data structure is built that efficiently supports insertions and searches.
 Data structure, eg. Hash table or binary balance trees. Space overhead can be neglected
 M > min(b(R), B(S)).
 For set union, read the smaller relations (S) into M-1 buffers, representing it in a data
structure whose search key consists of all attributes.
 All these tuples are also copied to the output.
 Read all b locks of R into the M-th buffer, one at a time.
 For each tuple t of R, check whether t is in S. if not, copy t to the output.
 For set intersection, copy t to output if it also is in S.
Two-Pass Algorithms Based on Sorting
Two Pass Algorithms
CARIG CAMPUS
Data from operand relation is read into main memory, processed, written out to disk again,
and reread from disk to complete the operation.
Two-Phase, Multiway Merge Sort

 To sort very large relations in two passes.
Phase 1: Repeatedly fill the M buffers with new tuples from R and sort them, using any
main-memory sorting algorithm. Write out each sorted sublist to secondary storage.
Phase 2 : Merge the sorted sublists. For this phase to work, there can be at most M —1 sorted
sublists, which limits the size of R. We allocate one input block to each sorted sublist and one
block to the output.
Merging
 Find the smallest key
 Move smallest element to first available position of output block.
 If output block full-write to disk and reinitialize the same buffer in main memory to hold the
next output block.
 If this block exhausted of records, read next block from the same sorted sublist into the same
buffer that was used for the block just exhausted.
 If no blocks remain-stop.
Duplicate Elimination Using Sorting

 Same as previous...
 Instead of sorting on the second pass, -repeatedly select first unconsidered tuplet among all
sorted sub lists.
 Write one copy of t to the output and eliminate from the input blocks all occurrences of t.
 Output -exactly one copy of any tuplein R.
Grouping and Aggregation Using Sorting

 Read the tuples of R into memory, M blocks at a time. Sort the tuplesin each set of M blocks,
using the grouping attributes of L as the sort key. Write each sorted sublist to disk.
 Use one main-memory buffer for each sublist, and initially load the first block of each sublist
into its buffer.
 Repeatedly find the least value of the sort key present among the first available tuples in the
buffers.
CARIG CAMPUS

A Sort-Based Union Algorithm
 In the first phase, create sorted sublists from both R and S.
 Use one main-memory buffer for each sublist of R and S. Initialize each with the first block
from the corresponding sublist.
 Repeatedly find the first remaining tuplet among all the buffers
Two-Pass Algorithms Based on Hashing
 Partitioning Relations by hashing

 A hash Based Algorithm for Duplicate Elimination
 Hash Based Union, Intersection, and Difference
 The has-Join Algorithm
The essential idea behind all these hash-based algorithms are if the data is too big to store in
main memory, hash all the tuples of the argument or arguments using an appropriate hash
key.
 Partition R into M-1 buckets or roughly equal size

 Associate one buffer with each bucket.
 Each tuple t in the block I s hashed to bucket h(t) and copied to the appropriate buffer.
 Assumes that tuples are never too large to it in an empty buffer.
A hash-Based Algorithm for Duplicate Elimination
Two copies of the same tuple t will hash to the same bucket.
We can examine one bucket at a time, perform on that bucket in isolation, an take as the anser
the union of Ri , where Ri is the portion of R that hashes to the ith bucket.
Hash-Based Grouping and Aggregation

In order to make sure that all tuples of the same group wind up in the same bucket, we must
choos a hash unction that depends only on the grouping attributes of the list L.
If there are few groups, then we may actually be ablt tohandle much larger relations R that is
indicated by th B(R) <= M2 rule.
Hash-Based Union, Intersection, and Difference

When the operation is binary, we must make sure that we use the same hash function to hash
tuples of both arguments.
The one-pass algorihms for union, intersection, and difference require that the smaller
operand occupies at most M-1 blocks.
The Hash Join Algorithm’
The only difference of he join operation from the other operations is that we must u se as the
hash key just the join attributes, then we can be sure that if tulples of R and S join, they will
wind up in corresponding buckets Ri and Si for some i.
CARIG CAMPUS
Saving some Disk I/O

if there is more memory available on the first pass than we need to hold one block per bucket,
then we have some opportunities to save disk I/O.
Hybrid hash-join; when we has S, we can choose to keep m of the k buckets entirely in main
memory, while keeping only one block for each of the other k-m buckets if
Index-Based Algorithms
Clustering and Nonclusterng indexes

A relation is clustered if its tuples are packed into roughly as few blocks as can possibly hold
those tuples.
Clustering indexes, which are indexes, on an attribute or attributes such that all the tples with
a fixed value for the search key of this inex appear on roughly as fefw blocks as can hold
them. Note that a relation that isn’t clustered cannot have a clustering index, but even a
clustered relation can have nonclustering indexes.
A clustering index has all tuples with a fixed value packed into the minimum possible
number of blocks.
Index-Based Selection
 Selection on equality :
 Clustered index on a cost B(R) / V(R,a)

o If the index oon R.a is clustering, then the number of disk I/os to retrieve the set
will average B(R)/V(R,a). the actual number

 Unclustered index on a: cost T(R)/V(R,a)
o If the index on R.a is nonclustering, each tuple we retrieve will be on a different
block, and we must access T(R)/V(R,a) is an estimate of the number of disk I/o’s we
need.
Joining by using an Index

 RxS: this is a natural join
 Assume S as an index on the join attribute
 Iterate over R, for each tuple fetch corresponding tuple(s) from S
 Assume R is clustered. Cost:
o If index is clustered: B(R) + t(R) B(s)/V(S,a)
o If index is unclustered: B(R) + T(R)T(S)/V(S,a)
CARIG CAMPUS

 Assume Both R and S have a sorted index (B + tree ) on the join attribute
 Then perform a erge join (called zig-zag join)
 Cost: B(R) + B(S).
Joins Using a Sorted Index

 Still consider R(X,Y) S(Y,Z)
 Assume there’s a sorted index on both R.Y and S.Y
o B-tree or a sorted sequential index
 Scan both indexes in the increasing order of Y
o Like merge-join, without need to sort first
o If index dense, can skip nonmatching tuples without loading them
o Very efficient
When the index is a B tree. Or any other structure from which we easily can extract
the tuples of a relation in sored order, we have number of other opportunities to use the index.
Perhaps the simplest is when we want to compute R(X,Y) x S (Y,Z), and we have such an
index on Y for either R or S. We can then perform an ordinary sort-join, but we do not have
to perform the intermediate step of sorting one of the relations on Y.
As an extreme case, if we have sorting indexes on Y for both R and S, then we need
to perform only the final step of the simple sort-based join. This method is sometimes called
zig-zag join, because we jump back and forth between the indexes finding Y-values that they
share in common Notice that tuples form R with a Y-value that does not appear in S need
never be retrieved, and similarly, Tuples of S whose Y value does not appear in R need not be
retrieved.
A zig-zag join using two indexes
CARIG CAMPUS
Parallel Algorithms for Relational Operations
Models of Parallelism
There is a collection of processors.

-Often the number of processors p is large, in the hundreds or thousands.
Each processor has its own local cache.
Of great importance to database processing is the fact that along with these processors are
many disks, perhaps one or more per processor.
CARIG CAMPUS

Shared Nothing Architecture II
Shared nothing machines are relatively inexpensive to build but when we design
algorithms for these machines we must be aware that is is costly to send data from one
processor to another.
Typically, the cost of a message can be broken into a large fixed overhead plus a small
amount of time per byte transmitted.
Significant advantage to designing a parallel algorithm so that communications between
processors involve large amounts of data sent at once.
For instance, we might buffer several blocks of data at processor P, all bound for
processor Q.
If Q does not need the data immediately, it may be much more efficient to wait until we have
a long message at P and then send it to Q.
Tuple a a time Operations in Parallel
First we have to decide how data is best stored. It is useful to distribute our data
across as many disks as possible.
Assume there is one disk per processor. Then if there are p processors, divide any
relation R’s tuples evenly among the p processor’s disks.
Suppose we want to perform use each processor to examine the tuples of R
present on its won disk. To avoid communication among processor, we store those output
tuples t in at the same processor that has t on its disk.
Thus, the result relation is divided among the processor, just like R is.
We woul like to be divided evenly among the processors. However, a selection
could radically change the distribution of tuples in the result, compared to the distribution of
R.
Selection
Suppose the selection that is, find all the tuples of R whose value in the attribute a.
Suppose also that we have divided R according to the value of the attribute a. then all tuples
of R with a = 10 are at one of the processors, and the entre relation is at one
processor.
To avoid the problem, we need to think carefully about the policy for partitioning our stored
relations among the processors. The best we can do is to use a hash function h that involves al
the components of a tuple. Number of buckets is the number of processors. We can associate
each processor with a bucket ang give that processor the contents of its bucket.
Basic Algorithms for Executing Query Operations
Catalog Information for Cost Estimates

 Keep statistics in catalog to estimate size of
CARIG CAMPUS

 result and cost for various operations
 If we wish to maintain accurate statistics, then, every time a relation is modified, we must also
update statistics
 Expensive!
 Instead, find compromise between accuracy of statistics and query response
 Updates are done during periods of light load
 Note real-world optimizers maintain lots more variables
Simple Catalog
 nr number of tuples in relation r
 br number of blocks containing tuples in r
 sr size of tuple in relation r (bytes)
 fr blocking factor of r
 V(A,r) number of distinct values in r for
 attribute A
 SC(A,r)
 Selection cardinality SC is average number of
 records satisfying condition on A, r(R) is
 total number of records in R
 SC(A,r) (r(R)/V(A,r)
 e.g., SC(A,r) 1 if A is key of R and cond. is
 equality
 HTi number of levels in index i
Measures of Query Cost

 Could be measured in terms of disk accesses, CPU time to execute query, cost of
communication (e.g., distributed or parallel systems)
 Disk accesses is usually most important time
 Measuring CPU time is hard
 Disk-access cost is considered reasonable measure of the cost of a query evaluation plan
 Assume all transfers of blocks have same cost
 Use number of block transfers from disk as a measure of the actual query cost
CARIG CAMPUS

Simple SELECT Operation
Linear search Retrieve every record in file and test condition
E br Selection on key attribute, assume E br/2
Binary search If selection involves equality

comparison on attribute on which file is ordered,
use binary search
E ?log2(br)? (SC(A,r)/fr) - 1 E ?log2(br)?

, if attribute is key
Using primary index If selection involves equality on key attribute with primary index, use
primary index to retrieve (at most one) record
E HTi 1
Using primary index to retrieve multiple records If selection condition involves range on key
field with primary index, use index to find record satisfying corresponding equality condition
then retrieve subsequent records
E HTi br/2 assume half of tuples satisfy condition E HTi ?c/fr? if actual value used in
comparison is known
Use a clustering index to retrieve multiple records If selection condition involves equality
comparison on non-key attribute with clustering index, use index to retrieve all records
satisfying condition
E HTi ?SC(A,r)/fr?
Using a secondary index If selection condition involves equality or inequality on key or non-
key field with secondary index (non-ordering field),use index to retrieve records
E HTi SC(A,r)
CARIG CAMPUS
References:
https://slideplayer.com/slide/13028733/
https://www.sqlshack.com/introduction-to-nested-loop-joins-in-sql-server/
https://www2.cs.sfu.ca/CourseCentral/454/bzhou/documents/s22.pdf
https://www.powershow.com/view/f21b5-OTA5M/
Basic_Algorithms_for_Executing_Query_Operations_powerpoint_ppt_presentation

Activity 9 Advance DBM Query Execution Research 5 DARWIN G RARALIO

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Activity 9 Advance DBM Query Execution Research 5 DARWIN G RARALIO

Uploaded by

Copyright:

Available Formats

Republic of the Philippines

Cagayan State University

MSIT 202-ADVANCED DATABASE MANAGEMENT

TOPIC/s Quiz 9: Query Execution(Research 5)

 Introduction to Physical-Query-Plan Operators

KEYWORDS Tuple, Hash, Algorithm, Index, Joins, Iterator, Buffer

DEFINITION OF Tuple- is one record

INTRODUCTION TO PHYSICAL-QUERY-PLAN OPERATORS

MSIT 202-ADVANCED DATABASE MANAGEMENT

1. Parsing: Construct Parse tree

MSIT 202-ADVANCED DATABASE MANAGEMENT

Physical Operators often are implementations of relational algebra operators

Reading the contents of a relation R

Sort Relation as we read tuples for multiple reasons

MSIT 202-ADVANCED DATABASE MANAGEMENT

Design pattern to implement physical operators

1. Open(): Iintializes data structures

Introduction to Nested Loop Joins in SQL Server

MSIT 202-ADVANCED DATABASE MANAGEMENT

Optimizer which is responsible for fulfilling the user requests optimally.

MSIT 202-ADVANCED DATABASE MANAGEMENT

Basics of Nested Loop Join

MSIT 202-ADVANCED DATABASE MANAGEMENT

MSIT 202-ADVANCED DATABASE MANAGEMENT

One-Pass for Unary Operations

One Pass Algorithms for Binary Operations

Two-Pass Algorithms Based on Sorting

Two Pass Algorithms

MSIT 202-ADVANCED DATABASE MANAGEMENT

Two-Phase, Multiway Merge Sort

Duplicate Elimination Using Sorting

Grouping and Aggregation Using Sorting

MSIT 202-ADVANCED DATABASE MANAGEMENT

Two-Pass Algorithms Based on Hashing

 Partitioning Relations by hashing

 Partition R into M-1 buckets or roughly equal size

A hash-Based Algorithm for Duplicate Elimination

Hash-Based Grouping and Aggregation

Hash-Based Union, Intersection, and Difference

MSIT 202-ADVANCED DATABASE MANAGEMENT

Saving some Disk I/O

Clustering and Nonclusterng indexes

 Clustered index on a cost B(R) / V(R,a)

will average B(R)/V(R,a). the actual number

Joining by using an Index

MSIT 202-ADVANCED DATABASE MANAGEMENT

Joins Using a Sorted Index

A zig-zag join using two indexes

MSIT 202-ADVANCED DATABASE MANAGEMENT

Parallel Algorithms for Relational Operations

There is a collection of processors.

MSIT 202-ADVANCED DATABASE MANAGEMENT

Tuple a a time Operations in Parallel

Basic Algorithms for Executing Query Operations

Catalog Information for Cost Estimates

MSIT 202-ADVANCED DATABASE MANAGEMENT

Measures of Query Cost

MSIT 202-ADVANCED DATABASE MANAGEMENT

E br Selection on key attribute, assume E br/2

Binary search If selection involves equality

E ?log2(br)? (SC(A,r)/fr) - 1 E ?log2(br)?

MSIT 202-ADVANCED DATABASE MANAGEMENT

You might also like