Professional Documents
Culture Documents
RDBMS Unit-V
RDBMS Unit-V
UNIT - V
Query Processing
Query Processing would mean the entire process or activity which involves query
translation into low level instructions, query optimization to save resources, cost estimation
or evaluation of query, and extraction of data from the database.
Goal: To find an efficient Query Execution Plan for a given SQL query which would
minimize the cost considerably, especially time.
Cost Factors: Disk accesses [which typically consumes time], read/write operations [which
typically needs resources such as memory/RAM].
SELECT
emp_name
FROM
employee
WHERE
salary>10000;
The problem here is that DBMS won't understand this statement. So for that, we have SQL
(Structured Query Language) queries. SQL being a High-Level Language makes it easier
not just for the users to query data based on their needs but also bridges the communication
gap between the DBMS which does not really understand human language. In fact, the
underlying system of DBMS won't even understand these SQL queries. For them to
understand and execute a query, they first need to be converted to a Low-Level Language.
The SQL queries go through a processing unit that converts them into low-level Language via
Relational Algebra. Since relational algebra queries are a bit more complex than SQL
queries, DBMS expects the user to write only the SQL queries. It then processes the query
before evaluating it.
As mentioned in the above image, query processing can be divided into compile-
time and run-time phases. Compile-time phase includes:
In the Runtime phase, the database engine is primarily responsible for interpreting and
executing the hence generated query with physical operators and delivering the query output.
A note here that as soon as any of the above stages encounters an error, they simply throw the
error and return without going any further (since warnings are not fatal/terminal, that is not
the case with warnings).
The first step in query processing is Parsing and Translation. The fired queries undergo
lexical, syntactic, and semantic analysis. Essentially, the query gets broken down into
different tokens and white spaces are removed along with the comments (Lexical Analysis).
In the next step, the query gets checked for the correctness, both syntax and semantic wise.
The query processor first checks the query if the rules of SQL have been correctly followed
or not (Syntactic Analysis).
Finally, the query processor checks if the meaning of the query is right or not. Things like if
the table(s) mentioned in the query are present in the DB or not? if the column(s) referred
from all the table(s) are actually present in them or not? (Semantic Analysis)
Once the above mentioned checks pass, the flow moves to convert all the tokens into
relational expressions, graphs, and trees. This makes the processing of the query easier for the
other parsers.
Let's consider the same query (mentioned below as well) as an example and see how the flow
works.
Query:
SELECT
emp_name
FROM
employee
WHERE
salary>10000;
The name of the queried table is looked into the data dictionary table.
The name of the columns mentioned (emp_name and salary) in the tokens are
validated for existence.
The type of column(s) being compared have to be of the same type (salary and the
value 10000 should have the same data type).
The next step is to translate the generated set of tokens into a relational algebra query. These
are easy to handle for the optimizer in further processes.
Relational graphs and trees can also be generated but for the sake of simplicity, let's keep
them out of the scope for now.
Query Evaluation
Once the query processor has the above-mentioned relational forms with it, the next step is to
apply certain rules and algorithms to generate a few other powerful and efficient data
structures. These data structures help in constructing the query evaluation plans. For example,
if the relational graph was constructed, there could be multiple paths from source to
destination. A query execution plan will be generated for each of the paths.
As you can see in the above possible graphs, one way could be first projecting followed by
selection (on the right). Another way would be to do selection followed by projection (on the
left). The above sample query is kept simple and straightforward to ensure better
comprehension but in the case of joins and views, more such paths (evaluation plans) start to
open up. The evaluation plans may also include different annotations referring to the
algorithm(s) to be used. Relational Algebra which has annotations of these sorts is known
as Evaluation Primitives. You might have figured out by now that these evaluation
primitives are very essential and play an important role as they define the sequence of
operations to be performed for a given plan.
Query Optimization
In the next step, DMBS picks up the most efficient evaluation plan based on the cost each
plan has. The aim here is to minimize the query evaluation time. The optimizer also evaluates
the usage of index present in the table and the columns being used. It also finds out the best
order of subqueries to be executed so as to ensure only the best of the plans gets executed.
Simply put, for any query, there are multiple evaluation plans to execute it. Choosing the one
which costs the least is called Query Optimization. Some of the factors weighed in by the
optimizer to calculate the cost of a query evaluation plan is:
CPU time
Number of tuples to be scanned
Disk access time
number of operations
Conclusion
Once the query evaluation plan is selected, the system evaluates the generated low-level
query and delivers the output.
Even though the query undergoes different processes before it finally gets executed, these
processes take very little time in comparison to the time it actually would take to execute an
un-optimized and un-validated query. Read more on indexes and see how your queries would
perform in case you don't add relevant indexes. To summarise, the flow of a query processing
involves two steps:
Compile time
o Parsing and Translation: break the query into tokens and check for the
correctness of the query
o Query Optimisation: Evaluate multiple query execution plans and pick the best
out of them.
o Query Generation: Generate a low level, DB executable code
Runtime time (execute/evaluate the hence generated query)
Transaction Concept
Atomicity:
By this, we mean that either the entire transaction takes place at once or
doesn’t happen at all. There is no midway i.e. transactions do not occur partially.
Each transaction is considered as one unit and either runs to completion or is not
executed at all.
It involves the following two operations.
—Abort: If a transaction aborts, changes made to the database are not visible.
—Commit: If a transaction commits, changes made are visible.
Atomicity is also known as the ‘All or nothing rule’.
Consider the following transaction T consisting of T1 and T2: Transfer of 100 from
account X to account Y.
If the transaction fails after completion of T1 but before completion of T2.( say,
after write(X) but before write(Y)), then the amount has been deducted from X but
not added to Y. This results in an inconsistent database state. Therefore, the
transaction must be executed in its entirety in order to ensure the correctness of the
database state.
Consistency:
This means that integrity constraints must be maintained so that the database
is consistent before and after the transaction. It refers to the correctness of a
database. Referring to the example above,
The total amount before and after the transaction must be maintained.
Total before T occurs = 500 + 200 = 700.
Total after T occurs = 400 + 300 = 700.
Therefore, the database is consistent. Inconsistency occurs in case T1 completes
but T2 fails. As a result, T is incomplete.
Isolation:
will not be visible to any other transaction until that particular change in that
transaction is written to memory or has been committed. This property ensures that
the execution of transactions concurrently will result in a state that is equivalent to a
state achieved these were executed serially in some order.
Let X= 500, Y = 500.
Suppose T has been executed till Read (Y) and then T’’ starts. As a result,
interleaving of operations takes place due to which T’’ reads the correct value
of X but the incorrect value of Y and sum computed by
T’’: (X+Y = 50, 000+500=50, 500)
is thus not consistent with the sum at end of the transaction:
T: (X+Y = 50, 000 + 450 = 50, 450).
Durability:
This property ensures that once the transaction has completed execution, the updates
and modifications to the database are stored in and written to disk and they persist
even if a system failure occurs. These updates now become permanent and are stored
in non-volatile memory. The effects of the transaction, thus, are never lost.
Concurrency Control
Concurrency Control is the management procedure that is required for controlling concurrent
execution of the operations that take place on a database.
But before knowing about concurrency control, we should know about concurrent execution.
Concurrent Execution
o In a multi-user system, multiple users can access and use the same database at one
time, which is known as the concurrent execution of the database. It means that the
same database is executed simultaneously on a multi-user system by different users.
o While working on the database transactions, there occurs the requirement of using the
database by multiple users for performing different operations, and in that case,
concurrent execution of the database is performed.
o The thing is that the simultaneous execution that is performed should be done in an
interleaved manner, and no operation should affect the other executing operations,
thus maintaining the consistency of the database. Thus, on making the concurrent
execution of the transaction operations, there occur several challenging problems that
need to be solved.
In a database transaction, the two main operations are READ and WRITE operations. So,
there is a need to manage these two operations in the concurrent execution of the transactions
as if these operations are not performed in an interleaved manner, and the data may become
inconsistent. So, the following problems occur with the Concurrent Execution of the
operations:
The problem occurs when two different database transactions perform the read/write
operations on the same database items in an interleaved manner (i.e., concurrent execution)
that makes the values of the items incorrect hence making the database inconsistent.
For example:
Consider the below diagram where two transactions TX and TY, are performed on the
same account A where the balance of account A is $300.
o At time t1, transaction TX reads the value of account A, i.e., $300 (only read).
o At time t2, transaction TX deducts $50 from account A that becomes $250 (only
deducted and not updated/write).
o Alternately, at time t3, transaction TY reads the value of account A that will be $300
only because TX didn't update the value yet.
o At time t4, transaction TY adds $100 to account A that becomes $400 (only added but
not updated/write).
o At time t6, transaction TX writes the value of account A that will be updated as $250
only, as TY didn't update the value yet.
o Similarly, at time t7, transaction TY writes the values of account A, so it will write as
done at time t4 that will be $400. It means the value written by TX is lost, i.e., $250 is
lost.
The dirty read problem occurs when one transaction updates an item of the database, and
somehow the transaction fails, and before the data gets rollback, the updated database item
is accessed by another transaction. There comes the Read-Write Conflict between both
transactions.
For example:
o But the value for account A remains $350 for transaction TY as committed, which is
the dirty read and therefore known as the Dirty Read Problem.
Also known as Inconsistent Retrievals Problem that occurs when in a transaction, two
different values are read for the same database item.
For example:
Consider two transactions, TX and TY, performing the read/write operations on account
A, having an available balance = $300. The diagram is shown below:
o At time t1, transaction TX reads the value from account A, i.e., $300.
o At time t2, transaction TY reads the value from account A, i.e., $300.
o At time t3, transaction TY updates the value of account A by adding $100 to the
available balance, and then it becomes $400.
o At time t4, transaction TY writes the updated value, i.e., $400.
o After that, at time t5, transaction TX reads the available value of account A, and that
will be read as $400.
o It means that within the same transaction TX, it reads two different values of account
A, i.e., $ 300 initially, and after updation made by transaction TY, it reads $400. It is
an unrepeatable read and is therefore known as the Unrepeatable read problem.
Thus, in order to maintain consistency in the database and avoid such problems that take
place in concurrent execution, management is needed, and that is where the concept of
Concurrency Control comes into role.
Concurrency Control
Concurrency Control is the working concept that is required for controlling and managing the
concurrent execution of database operations and thus avoiding the inconsistencies in the
database. Thus, for maintaining the concurrency of the database, we have the concurrency
control protocols.
Lock-Based Protocol
In this type of protocol, any transaction cannot read or write data until it acquires an
appropriate lock on it. There are two types of lock:
1. Shared lock:
o It is also known as a Read-only lock. In a shared lock, the data item can only read by
the transaction.
o It can be shared between the transactions because when the transaction holds a lock,
then it can't update the data on the data item.
2. Exclusive lock:
o In the exclusive lock, the data item can be both reads as well as written by the
transaction.
o This lock is exclusive, and in this lock, multiple transactions do not modify the same
data simultaneously.
It is the simplest way of locking the data while transaction. Simplistic lock-based protocols
allow all the transactions to get the lock on the data before insert or delete or update on it. It
will unlock the data item after completing the transaction.
o Before initiating an execution of the transaction, it requests DBMS for all the lock on all those
data items.
o If all the locks are granted then this protocol allows the transaction to begin. When the
transaction is completed then it releases all the lock.
o If all the locks are not granted then this protocol allows the transaction to rolls back and waits
until all the locks are granted.
Growing phase: In the growing phase, a new lock on the data item may be acquired by the
transaction, but none can be released.
Shrinking phase: In the shrinking phase, existing lock held by the transaction may be
released, but no new locks can be acquired.
In the below example, if lock conversion is allowed then the following phase can happen:
Example:
The following way shows how unlocking and locking work with 2-PL.
Transaction T1:
o Lock point: at 3
Transaction T2:
Deadlock in DBMS
A deadlock is a condition where two or more transactions are waiting indefinitely for one
another to give up locks. Deadlock is said to be one of the most feared complications in
DBMS as no task ever gets finished and is in waiting state forever.
For example: In the student table, transaction T1 holds a lock on some rows and needs to
update some rows in the grade table. Simultaneously, transaction T2 holds locks on some
rows in the grade table and needs to update the rows in the Student table held by Transaction
T1.
Now, the main problem arises. Now Transaction T1 is waiting for T2 to release its lock and
similarly, transaction T2 is waiting for T1 to release its lock. All activities come to a halt state
and remain at a standstill. It will remain in a standstill until the DBMS detects the deadlock
and aborts one of the transactions.
Deadlock Avoidance
o When a database is stuck in a deadlock state, then it is better to avoid the database rather than
aborting or restating the database. This is a waste of time and resource.
Deadlock Detection
In a database, when a transaction waits indefinitely to obtain a lock, then the DBMS should
detect whether the transaction is involved in a deadlock or not. The lock manager maintains a
Wait for the graph to detect the deadlock cycle in the database.
39.9M
741
The wait for a graph for the above scenario is shown below:
Deadlock Prevention
o Deadlock prevention method is suitable for a large database. If the resources are allocated in
such a way that deadlock never occurs, then the deadlock can be prevented.
o The Database management system analyzes the operations of the transaction whether they can
create a deadlock situation or not. If they do, then the DBMS never allowed that transaction to
be executed.
Wait-Die scheme
In this scheme, if a transaction requests for a resource which is already held with a conflicting
lock by another transaction then the DBMS simply checks the timestamp of both transactions.
It allows the older transaction to wait until the resource is available for execution.
Let's assume there are two transactions Ti and Tj and let TS(T) is a timestamp of any
transaction T. If T2 holds a lock by some other transaction and T1 is requesting for resources
held by T2 then the following actions are performed by DBMS:
1. Check if TS(Ti) < TS(Tj) - If Ti is the older transaction and Tj has held some resource, then
Ti is allowed to wait until the data-item is available for execution. That means if the older
transaction is waiting for a resource which is locked by the younger transaction, then the older
transaction is allowed to wait for resource until it is available.
2. Check if TS(Ti) < TS(Tj) - If Ti is older transaction and has held some resource and if Tj is
waiting for it, then Tj is killed and restarted later with the random delay but with the same
timestamp.
RECOVERY SYSTEM
DATABASE RECOVERY There can be any case in database system like any
computer system when database failure happens. So data stored in database should be
available all the time whenever it is needed. So Database recovery means recovering the data
when it get deleted, hacked or damaged accidentally. Atomicity is must whether is
transaction is over or not it should reflect in the database permanently or it should not effect
the database at all. So database recovery and database recovery techniques are must
in DBMS. So database recovery techniques in DBMS are given below.
Recovery is the process of restoring a database to the correct state in the event of a
failure.
It ensures that the database is reliable and remains in consistent state in case of a
failure.
Database recovery can be classified into two parts;
1. Rolling Forward applies redo records to the corresponding data blocks.
2. Rolling Back applies rollback segments to the datafiles. It is stored in transaction tables.
Crash recovery:
DBMS may be an extremely complicated system with many transactions being executed each
second. The sturdiness and hardiness of software rely upon its complicated design and its
underlying hardware and system package. If it fails or crashes amid transactions, it’s
expected that the system would follow some style of rule or techniques to recover lost
knowledge.
Database recovery in dbms and its techniques
Classification of failure:
To see wherever the matter has occurred, we tend to generalize a failure into numerous
classes, as follows:
Transaction failure
System crash
Disk failure
Types of Failure
Storage structure:
Classification of Storage
1. Volatile storage: As the name suggests, a memory board (volatile storage) cannot
survive system crashes. Volatile storage devices are placed terribly near to the CPU;
usually, they’re embedded on the chipset itself. For instance, main memory and cache
memory are samples of the memory board. They’re quick however will store a solely
little quantity of knowledge.
2. Non-volatile storage: These recollections are created to survive system crashes.
they’re immense in information storage capability, however slower in the
accessibility. Examples could include hard-disks, magnetic tapes, flash memory, and
non-volatile (battery backed up) RAM.
Recovery and Atomicity:
When a system crashes, it should have many transactions being executed and numerous files
opened for them to switch the information items. Transactions are a product of numerous
operations that are atomic in nature. However consistent with ACID properties of a database,
atomicity of transactions as an entire should be maintained, that is, either all the operations
are executed or none.
When a database management system recovers from a crash, it ought to maintain the
subsequent:
It ought to check the states of all the transactions that were being executed.
A transaction could also be within the middle of some operation; the database
management system should make sure the atomicity of the transaction during this
case.
There are 2 forms of techniques, which may facilitate a database management system in
recovering as well as maintaining the atomicity of a transaction:
Maintaining the logs of every transaction, and writing them onto some stable storage
before truly modifying the info.
Maintaining shadow paging, wherever the changes are done on a volatile memory,
and later, and the particular info is updated.
Deferred database modification − All logs are written on to the stable storage and
the database is updated when a transaction commits.
Immediate database modification − Each log follows an actual database
modification. That is, the database is modified immediately after every operation.
When more than one transaction are being executed in parallel, the logs are interleaved. At
the time of recovery, it would become hard for the recovery system to backtrack all logs, and
then start recovering. To ease this situation, most modern DBMS use the concept of
'checkpoints'.
Checkpoint
Keeping and maintaining logs in real time and in real environment may fill out all the
memory space available in the system. As time passes, the log file may grow too big to be
handled at all. Checkpoint is a mechanism where all the previous logs are removed from the
system and stored permanently in a storage disk. Checkpoint declares a point before which
the DBMS was in consistent state, and all the transactions were committed.
Recovery
When a system with concurrent transactions crashes and recovers, it behaves in the following
manner −
The recovery system reads the logs backwards from the end to the last checkpoint.
It maintains two lists, an undo-list and a redo-list.
If the recovery system sees a log with <Tn, Start> and <Tn, Commit> or just <Tn,
Commit>, it puts the transaction in the redo-list.
If the recovery system sees a log with <Tn, Start> but no commit or abort log found, it
puts the transaction in undo-list.
All the transactions in the undo-list are then undone and their logs are removed. All the
transactions in the redo-list and their previous logs are removed and then redone before
saving their logs.