Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

UNIT V

Transaction Concept: Transaction State, Implementation of Atomicity and Durability, Concurrent


Executions, Serializability, Recoverability, Implementation of Isolation, Testing for Serializability, Failure
Classification, Storage, Recovery and Atomicity, Recovery algorithm. Indexing Techniques: B+ Trees:
Search, Insert, Delete algorithms, File Organization and Indexing, Cluster Indexes, Primary and Secondary
Indexes , Index data Structures, Hash Based Indexing: Tree base Indexing, Comparison of File
Organizations, Indexes and Performance Tuning

Part-1 Transaction Concept:

Introduction

• A transaction is a unit of program execution that accesses and possibly updates various data items.
• The transaction consists of all operations executed between the statements begin and end of the
transaction
• Transaction operations: Access to the database is accomplished in a transaction by the following two
operations: ƒ
• read (X): Performs the reading operation of data item X from the database ƒ
• write (X): Performs the writing operation of data item X to the database
• A transaction must see a consistent database. During transaction execution the database may be
inconsistent. When the transaction is committed, the database must be consistent
• Two main issues to deal with: ƒ
1. Failures, e.g. hardware failures and system crashes ƒ
2. Concurrency, for simultaneous execution of multiple transactions

PROPERTIES OF TRANSACTION (ACID PROPERTIES):

To ensure consistency, completeness of the database in scenario of concurrent access, system failure, the
following ACID properties can be enforced on to database.

1. Atomicity,

2. Consistency,

3. Isolation and

4. Durability

Atomicity:

• This property states that all of the instructions with in a transaction must be executed or none of them
should be executed.
• This property states that all transactions execution must be atomic i.e. all actions should be carried out
or none of the actions should be executed.
• It involves following two operations.
o Abort: If a transaction aborts, changes made to database are not visible.
o Commit: If a transaction commits, changes made are visible.
• Atomicity is also known as the ‘All or nothing rule’.

Example:

• Consider the following transaction T consisting of T1 and T2: Transfer of 100 from account X to
account Y.

If the transaction fails after completion of T1 but before completion of T2.( say, after write(X) but
before write(Y)), then amount has been deducted from X but not added to Y. This results in an
inconsistent database state. Therefore, the transaction must be executed in entirety in order to ensure
correctness of database state.

Consistency
• The database must remain in consistence state even after performing any kind of transaction ensuring
correctness of the database. This means that integrity constraints must be maintained so that the
database is consistent before and after the transaction. It refers to the correctness of a database.
• Each transaction, run by itself with no concurrent execution of other transactions, must preserve the
consistency of the database. This property is called consistency and the DBMS assumes that it holds for
each transaction. Ensuring this property of a transaction is the responsibility of the user.
Example:

• Consider the following transaction T consisting of T1 and T2: Transfer of 100 from account X to
account Y.

• Referring to the example above,


The total amount before and after the transaction must be maintained.
Total before T occurs = 500 + 200 = 700.
Total after T occurs = 400 + 300 = 700.
Therefore, database is consistent. Inconsistency occurs in case T1 completes but T2 fails. As a result T
is incomplete.
Isolation
• This property ensures that multiple transactions can occur concurrently without leading to the
inconsistency of database state.
• Transactions occur independently without interference. Changes occurring in a particular transaction
will not be visible to any other transaction until that particular change in that transaction is written to
memory or has been committed.
• This property ensures that the execution of transactions concurrently will result in a state that is
equivalent to a state achieved these were executed serially in some order.

Example:
• Let X= 500, Y = 500.
Consider two transactions T and T”.

Suppose T has been executed till Read (Y) and then T’’ starts. As a result , interleaving of operations
takes place due to which T’’ reads correct value of X but incorrect value of Y and sum computed by
T’’: (X+Y = 50, 000+500=50, 500)
is thus not consistent with the sum at end of transaction:
T: (X+Y = 50, 000 + 450 = 50, 450).
• This results in database inconsistency, due to a loss of 50 units. Hence, transactions must take
place in isolation and changes should be visible only after they have been made to the main
memory.

Durability:
• This property ensures that once the transaction has completed execution, the updates and
modifications to the database are stored in and written to disk and they persist even if a system failure
occurs.
• These updates now become permanent and are stored in non-volatile memory. The effects of the
transaction, thus, are never lost.

-------------------------------------------------------x----------------------------------------------------------------
5.1 Transaction State
A transaction can be referred to be an atomic operation by the user, but in fact it goes through a number of
states during its lifetime as given as follows:

These are different types of Transaction States :

1.Active State –
When the instructions of the transaction are running then the transaction is in active state. If all the ‘read
and write’ operations are performed without any error then it goes to the “partially committed state”; if any
instruction fails, it goes to the “failed state”.
2.Partially Committed –
After completion of all the read and write operation the changes are made in main memory or local buffer.
If the the changes are made permanent on the Data Base then the state will change to “committed state”
and in case of failure it will go to the “failed state”.

3.Failed State –
When any instruction of the transaction fails, it goes to the “failed state” or if failure occurs in making a
permanent change of data on Data Base.

4.Aborted State –
After having any type of failure the transaction goes from “failed state” to “aborted state” and since in
previous states, the changes are only made to local buffer or main memory and hence these changes are
deleted or rolled-back.

5.Committed State –
It is the state when the changes are made permanent on the Data Base and the transaction is complete.

----------------------------------------------------------x-----------------------------------------------------------

5.2 Implementation of Atomicity and Durability

Durability and atomicity can be ensured by using Recovery manager which is available by default in every
DBMS. We can implement atomicity by using
1. Shadow copying technique
2. Using recovery manager which available by default in DBMS.
Shadow copying technique:
• Maintaining a shadow copy of original database & reflecting all changes to the database as a result of
any transaction after committing the transaction.
• The scheme also assumes that the database is simply a file on disk.
• A pointer called db-pointer is maintained on disk; it points to the current copy of the database. 4. In the
shadow-copy scheme, a transaction that wants to update the database first creates a complete copy of the
database. All updates are done on the new database copy, leaving the original copy, the shadow copy,
untouched. If at any point the transaction has to be aborted, the system merely deletes the new copy. The
old copy of the database has not been affected.
• If the transaction completes, it is committed as follows.
• First, the operating system is asked to make sure that all pages of the new copy of the database have
been written out to disk. (Unix systems use the flush command for this purpose.)
• After the operating system has written all the pages to disk, the database system updates the pointer db-
pointer to point to the new copy of the database; the new copy then becomes the current copy of the
database. The old copy of the database is then deleted
• We now consider how the technique handles transaction and system failures.
• First, consider transaction failure. If the transaction fails at any time before db-pointer is updated, the
old contents of the database are not affected. We can abort the trans- action by just deleting the new
copy of the database. Once the transaction has been committed, all the updates that it performed are in
the database pointed to by db- pointer. Thus, either all updates of the transaction are reflected, or none
of the effects are reflected, regardless of transaction failure.

• Now consider the issue of system failure. Suppose that the system fails at any time before the updated
db-pointer is written to disk. Then, when the system restarts, it will read db-pointer and will thus see the
original contents of the database, and none of the effects of the transaction will be visible on the
database. Next, suppose that the system fails after db-pointer has been updated on disk. Before the
pointer is updated, all updated pages of the new copy of the database were written to disk. Again, we
assume that, once a file is written to disk, its contents will not be damaged even if there is a system
failure. Therefore, when the system restarts, it will read db-pointer and will thus see the contents of the
database after all the updates performed by the transaction.
-----------------------------------------------------------x--------------------------------------------------------
5.3 Concurrent Executions

• Executing a set of transactions simultaneously in a pre emptive and time shared method. In DBMS
concurrent execution of transaction can be implemented with interleaved execution.

Transaction Schedules:
• Schedule: It refers to the list of actions to be executed by transaction.
• A schedule is a process of grouping the transactions into one and executing them in a predefined order.
• Schedule of actions can be classified into 2 types.
1. Serializable schedule/serial schedule.
2. Concurrent schedule.
1. Serial schedule:
In the serial schedule the transactions are allowed to execute one after the other ensuring
correctness of data. A schedule is called serial schedule, if the transactions in the schedule are
defined to execute one after the other.
T1 T2
R(A)
W(A)
R(B)
W(B)
R(A)
R(B)

Where R(A) denotes that a read operation is performed on some data item ‘A’. This is a serial
schedule since the transactions perform serially in the order T 1 —> T2

2. Concurrent schedule: Concurrent schedule allows the transaction to be executed in interleaved


manner of execution. Complete schedule: It is a schedule of transactions where each transaction is
committed before terminating. The example is shown below where transactions T1 and T2
terminates after committing the transactions
--------------------------------------------------------------X-----------------------------------------------------------------

5.4 Serializability

• A transaction is said to be Serializable if it is equivalent to serial schedule.


• Serializability aspects are:
1. Conflict serializability.
2. View serializability.

1. Conflict Serializable:
A schedule is called conflict serializable if it can be transformed into a serial schedule by swapping
non-conflicting operations.
Conflicting operations: Two operations are said to be conflicting if all conditions satisfy:
1. They belong to different transactions
2. They operate on the same data item
3. At Least one of them is a write operation
Example: –
• Conflicting operations pair (R 1(A), W2(A)) because they belong to two different transactions on
same data item A and one of them is write operation.
• Similarly, (W1(A), W2(A)) and (W1(A), R2(A)) pairs are also conflicting.
• On the other hand, (R1(A), W2(B)) pair is non-conflicting because they operate on different data
item.
• Similarly, ((W1(A), W2(B)) pair is non-conflicting.
Example1:
Consider the following schedule:
S1: R1(A), W1(A), R2(A), W2(A), R1(B), W1(B), R2(B), W2(B)

If Oi and Oj are two operations in a transaction and O i< Oj (Oi is executed before Oj), same order will
follow in the schedule as well. Using this property, we can get two transactions of schedule S1 as:

T1: R1(A), W1(A), R1(B), W1(B)


T2: R2(A), W2(A), R2(B), W2(B)
Possible Serial Schedules are: T1->T2 or T2->T1
-> Swapping non-conflicting operations R2(A) and R1(B) in S1, the schedule becomes,
S11: R1(A), W1(A), R1(B), W2(A), R2(A), W1(B), R2(B), W2(B)
-> Similarly, swapping non-conflicting operations W2(A) and W1(B) in S11, the schedule becomes,
S12: R1(A), W1(A), R1(B), W1(B), R2(A), W2(A), R2(B), W2(B)

S12 is a serial schedule in which all operations of T1 are performed before starting any operation of T2.
Since S has been transformed into a serial schedule S12 by swapping non-conflicting operations of S1,
S1 is conflict serializable.
Example2:
Let us take another Schedule:
S2: R2(A), W2(A), R1(A), W1(A), R1(B), W1(B), R2(B), W2(B)

Two transactions will be:


T1: R1(A), W1(A), R1(B), W1(B)
T2: R2(A), W2(A), R2(B), W2(B)

Possible Serial Schedules are: T1->T2 or T2->T1


Original Schedule is:
S2: R2(A), W2(A), R1(A), W1(A), R1(B), W1(B), R2(B), W2(B)

Swapping non-conflicting operations R1(A) and R2(B) in S2, the schedule becomes,
S21: R2(A), W2(A), R2(B), W1(A), R1(B), W1(B), R1(A), W2(B)

Similarly, swapping non-conflicting operations W1(A) and W2(B) in S21, the schedule becomes,
S22: R2(A), W2(A), R2(B), W2(B), R1(B), W1(B), R1(A), W1(A)

In schedule S22, all operations of T2 are performed first, but operations of T1 are not in order (order
should be R1(A), W1(A), R1(B), W1(B)). So S2 is not conflict serializable.

Conflict Equivalent: Two schedules are said to be conflict equivalent when one can be transformed to
another by swapping non-conflicting operations. In the example discussed above, S11 is conflict
equivalent to S1 (S1 can be converted to S11 by swapping non-conflicting operations). Similarly, S11 is
conflict equivalent to S12 and so on.

Note 1: Although S2 is not conflict serializable, but still it is conflict equivalent to S21 and S21 because
S2 can be converted to S21 and S22 by swapping non-conflicting operations.

Note 2: The schedule which is conflict serializable is always conflict equivalent to one of the serial
schedule. S1 schedule discussed above (which is conflict serializable) is equivalent to serial schedule
(T1->T2).

2. View serializability.

o View serializability is a concept that is used to compute whether schedules are View-Serializable
or not. A schedule is said to be View-Serializable if it is view equivalent to a Serial
Schedule (where no interleaving of transactions is possible).
o If a schedule is conflict serializable, then it will be view serializable.
o The view serializable which does not conflict serializable contains blind writes.

Condition of schedules to View-equivalent:


Two schedules S1 and S2 are said to be view-equivalent if below conditions are satisfied :
1) Initial Read
If a transaction T1 reading data item A from database in S1 then in S2 also T1 should read A from
database.

T1 T2 T3
------------------------
R(A)
W(A)
R(A)
R(B)
Transaction T2 is reading A from database.
2)Updated Read
If Ti is reading A which is updated by Tj in S1 then in S2 also Ti should read A which is updated by Tj.

T1 T2 T3 T1 T2 T3
----------------------- ----------------------
W(A) W(A)
W(A) R(A)
R(A) W(A)
Above two schedule are not view-equivalent as in S1 :T3 is reading A updated by T2, in S2 T3 is reading
A updated by T1.
3)Final Write operation
If a transaction T1 updated A at last in S1, then in S2 also T1 should perform final write operations.

T1 T2 T1 T2
---------------- ------------------
R(A) R(A)
W(A) W(A)
W(A) W(A)
Above two schedules are not view-equivalent as Final write operation in S1 is done by T1 while in S2
done by T2.
View Serializability: A Schedule is called view serializable if it is view equal to a serial schedule (no
overlapping transactions).
Example:
Schedule S
With 3 transactions, the total number of possible schedule = 3! = 6
1. S1 = <T1 T2 T3>
2. S2 = <T1 T3 T2>
3. S3 = <T2 T3 T1>
4. S4 = <T2 T1 T3>
5. S5 = <T3 T1 T2>
6. S6 = <T3 T2 T1>
Taking first schedule S1:

Schedule S1

Step 1: Initial Read


The initial read operation in S is done by T1 and in S1, it is also done by T1.

Step 2: final updation on data items


In both schedules S and S1, there is no read except the initial read that's why we don't need to check that
condition.

Step 3: Final Write


The final write operation in S is done by T3 and in S1, it is also done by T3. So, S and S1 are view
Equivalent.
The first schedule S1 satisfies all three conditions, so we don't need to check another schedule.
Hence, view equivalent serial schedule is:
T1 → T2 → T3
------------------------------------------------------------x-----------------------------------------------------------
5.5 Recoverability

Recoverable Schedules:
• Schedules in which transactions commit only after all transactions whose changes they read commit
are called recoverable schedules. In other words, if some transaction T j is reading value updated or
written by some other transaction T i, then the commit of Tj must occur after the commit of T i.
Example1:

The following schedule is not recoverable if T9 commits immediately after the read.
If T8 should abort, T9 would have read (and possibly shown to the user) an inconsistent state. Hence,
database must ensure that schedules are recoverable.

Example 2: Consider the following schedule involving two transactions T1 and T2.

T1 T2
R(A)
W(A)
W(A)
R(A)
commit
Commit
This is a recoverable schedule since T 1 commits before T2, that makes the value read by T2 correct.

Recoverable with Cascading Rollback:

A single transaction failure leads to a series of transaction rollbacks. Consider the following schedule
where none of the transactions has yet committed (so the schedule is recoverable)

If T10 fails, T11 and T12 must also be rolled back.

Can lead to the undoing of a significant amount of work

Cascadeless schedules :

o Cascading rollbacks cannot occur; for each pair of transactions Ti and Tj such that Tj reads a data
item previously written by Ti, the commit operation of Ti appears before the read operation of Tj.
o Every cascadeless schedule is also recoverable
o It is desirable to restrict the schedules to those that are cascadeless
Example: Consider the following schedule.
o T2 reads data item A which is previously written by T1, and the commit operation of T1 appears
before the read operation of T2.
o In the same way, T3 reads data item A which is previously written by T2, and the commit
operation of T2 appears before the read operation of T3.
o This type of schedule is called as cascadeless schedule.

NOTE-
o Cascadeless schedule allows only committed read operations.
o However, it allows uncommitted write operations.

--------------------------------------------------------------X------------------------------------------------------------

5.6 Implementation of Isolation

o A database must provide a mechanism that will ensure that all possible schedules are either conflict
or view serializable, and are recoverable and preferably cascadeless.
o A policy (e.g.: using a lock) in which only one transaction can execute at a time generates serial
schedules, but provides a poor degree of concurrency
o Goal – to develop concurrency control protocols that will assure serializability.
o These protocols will impose a discipline that avoids nonseralizable schedules.
o A common concurrency control protocol uses locks.
o While one transaction is accessing a data item, no other transaction can modify it.
o Require a transaction to lock the item before accessing it.

Lock-Based Protocols

o A lock is a mechanism to control concurrent access to a data item.Lock requests are made to
concurrency-control manager. Transaction can proceed only after request is granted.
o Data items can be locked in two modes:
1. exclusive (X) mode. Data item can be both read and written. X-lock is requested using the
lock-X instruction.
2. shared (S) mode. Data item can only be read. S-lock is requested using lock-S.
o FGDFG

--------------------------------------------------------------X------------------------------------------------------------
5.7 Testing for Serializability

o The Serializability of a schedule is tested using a Precedence graph or Serialization graph.


o Consider some schedule of a set of transactions T1, T2, ..., Tn.
o Precedence graph: A directed graph where the vertices are transaction names. We draw an arc
from Ti to Tj if the two transactions conflict, and Ti accessed the data item before Tj.
o We may label the arc by the item that was accessed.

Example:

Schedule S Precedence Graph

• If a precedence graph contains a single edge Ti → Tj, then all the instructions of Ti are executed
before the first instruction of Tj is executed.
• If a precedence graph for schedule S contains a cycle, then S is non-serializable. If the precedence
graph has no cycle, then S is known as serializable.
Example1:

Consider the following schedule S1.

Explanation:

Read(A): In T1, no subsequent writes to A, so no new edges


Read(B): In T2, no subsequent writes to B, so no new edges
Read(C): In T3, no subsequent writes to C, so no new edges
Write(B): B is subsequently read by T3, so add edge T2 → T3
Write(C): C is subsequently read by T1, so add edge T3 → T1
Write(A): A is subsequently read by T2, so add edge T1 → T2
Write(A): In T2, no subsequent reads to A, so no new edges
Write(C): In T1, no subsequent reads to C, so no new edges
Write(B): In T3, no subsequent reads to B, so no new edges

Precedence graph for schedule S1:

The precedence graph for schedule S1 contains a cycle that's why Schedule S1 is non-serializable.

Example2: Consider the following schedule S2.

Explanation:

Read(A): In T4,no subsequent writes to A, so no new edges


Read(C): In T4, no subsequent writes to C, so no new edges
Write(A): A is subsequently read by T5, so add edge T4 → T5
Read(B): In T5,no subsequent writes to B, so no new edges
Write(C): C is subsequently read by T6, so add edge T4 → T6
Write(B): A is subsequently read by T6, so add edge T5 → T6
Write(C): In T6, no subsequent reads to C, so no new edges
Write(A): In T5, no subsequent reads to A, so no new edges
Write(B): In T6, no subsequent reads to B, so no new edges

Precedence graph for schedule S2:

The precedence graph for schedule S2 contains no cycle that's why ScheduleS2 is serializable.

--------------------------------------------------------------X------------------------------------------------------------

5.8 Failure Classification

To find that where the problem has occurred, we generalize a failure into the following categories:
1. Transaction failure
2. System crash
3. Disk failure
1. Transaction failure
The transaction failure occurs when it fails to execute or when it reaches a point from where it can't
go any further. If a few transaction or process is hurt, then this is called as transaction failure.
Reasons for a transaction failure could be -
1. Logical errors: If a transaction cannot complete due to some code error or an internal error
condition, then the logical error occurs.
2. Syntax error: It occurs where the DBMS itself terminates an active transaction because the
database system is not able to execute it. For example, The system aborts an active
transaction, in case of deadlock or resource unavailability.
2. System Crash
o System failure can occur due to power failure or other hardware or software
failure. Example: Operating system error.
Fail-stop assumption: In the system crash, non-volatile storage is assumed not to be
corrupted.
3. Disk Failure
o It occurs where hard-disk drives or storage drives used to fail frequently. It was a common
problem in the early days of technology evolution.
o Disk failure occurs due to the formation of bad sectors, disk head crash, and unreachability
to the disk or any other failure, which destroy all or part of disk storage.
--------------------------------------------------------------X------------------------------------------------------------
5.9 Storage, Recovery and Atomicity

Storage Structure

1. Volatile storage: does not survive system crashes


examples : main memory, cache memory
2. Nonvolatile storage: survives system crashes
examples: disk, tape, flash memory, non-volatile (battery backed up) RAM
3. Stable storage: a mythical form of storage that survives all failures approximated by maintaining
multiple copies on distinct nonvolatile media

Recovery and Atomicity:

• Modifying the database without ensuring that the transaction will commit may leave the database in
an inconsistent state
• Consider transaction T i that transfers $50 from account A to account B; goal is either to perform all
database modifications made by T i or none at all.
• Several output operations may be required for T i (to output A and B ). A failure may occur after
one of these modifications have been made but before all of them are made.
• To ensure atomicity despite failures, we first output information describing the modifications to
stable storage without modifying the database itself.
• We study two approaches: log-based recovery , and shadow-paging
• We assume (initially) that transactions run serially, that is, one after the other.

--------------------------------------------------------------X------------------------------------------------------------

5.10 Recovery algorithm.

• Consider transaction Ti that transfers $50 from account A to account B.


o Two updates: subtract 50 from A and add 50 to B
• Transaction Ti requires updates to A and B to be output to the database.
• A failure may occur after one of these modifications has been made but before both of them are made.
• Modifying the database without ensuring that the transaction will commit may leave the database in an
inconsistent state
• Not modifying the database may result in lost updates if failure occurs just after transaction commits
Recovery algorithms have two parts
1. Actions taken during normal transaction processing to ensure enough information exists to recover
from failures.
2. Actions taken after a failure to recover the database contents to a state that ensures atomicity,
consistency and durability

--------------------------------------------------------------X------------------------------------------------------------

Part-2 Indexing Techniques:

5.11 B+ Trees: Search, Insert, Delete algorithms


Refer PDF

--------------------------------------------------------------X------------------------------------------------------------

5.12File Organization and Indexing


• DBMS stores vast quantities of data. Data is stored on external storage devices and fetched into main
memory as needed for processing. Page is unit of information read from or written to disk. (in DBMS, a
page may have size 8KB or more).
• Data on external storage devices :
o Disks: Can retrieve random page at fixed cost But reading several consecutive pages is
much cheaper than reading them in random order
o Tapes: Can only read pages in sequence Cheaper than disks; used for archival storage
• Cost of page I/O dominates cost of typical database operations
• File organization :
o Method of arranging a file of records on external storage. • Record id (rid) is sufficient to
physically locate record
• Indexes :
o Indexes are data structures that allow to find record ids of records with given values in index
search key fields
• Heap (random order) files: Suitable when typical access is a file scan retrieving all records.
• Sorted Files: Best if records must be retrieved in some order, or only a `range’ of records is needed.
• Indexes: Data structures to organize records to optimize certain kinds of retrieval operations.
o Speed up searches for a subset of records, based on values in certain (“search key”) fields
o Updates are much faster than in sorted files.
Alternatives for Data Entry k*in Index
• Data Entry :Records stored in index file
• Given search key value k, provide for efficient retrieval of all data entries k*with value k.
• In a data entry k* , alternatives include that we can store:
o alternative 1: Full data record with key value k, or
o alternative 2: <k, rid of data record with search key value k>, or
o alternative 3: <k, list of rids of data records with search key k>
• Choice of above 3 alternative data entries is orthogonal to indexing technique used to locate data entries.
• Example indexing techniques: B+ trees, hash-based structures, etc.

--------------------------------------------------------------X------------------------------------------------------------

5.13 Cluster Indexes, Primary and Secondary Indexes

Index Classification
• Primary vs. secondary index:
o If search key contains primary key, then called primary index.
• Clustered vs. unclustered index :
o If order of data records is the same as, or `close to’, order of data entries, then called
clustered index.

Primary indexing:
• Primary indexing is defined mainly on the primary key of the data-file, in which the data-file
is already ordered based on the primary key.
• Primary Index is an ordered file whose records are of fixed length with two fields. The first
field of the index replicates the primary key of the data file in an ordered manner, and the
second field of the ordered file contains a pointer that points to the data-block where a record
containing the key is available.
• The first record of each block is called the Anchor record or Block anchor. There exists a
record in the primary index file for every block of the data-file.
• The average number of blocks using the Primary Index is = log2B + 1, where B is the number
of index blocks.
• Example:

Secondary Indexing:
• A secondary index is an index that is not a primary index and may have duplicates.
• Example:
• Consider a database containing a list of students at a college, each of whom has a unique student ID
number. A typical database would use the student ID number as the key; however, one might also
reasonably want to be able to look up students by last name. Now, we can construct a secondary
index in which the secondary key
• Any number of secondary indexes may be associated with a given primary database, up to
limitations on available memory and the number of open file descriptors.

Clustered indexing:

Clustered index is the type of indexing that established a physical sorting order of rows.Suppose
you have a table Student_info which contains ROLL_NO as a primary key than Clustered index
which is self created on that primary key will sort the Student_info table as per ROLL_NO.
Clustered index is like Dictionary, in the dictionary sorting order is alphabetical there is no separate
index page.
Examples:
Input:
CREATE TABLE Student_info
(
ROLL_NO int(10) primary key,
NAME varchar(20),
DEPARTMENT varchar(20),
);
insert into Student_info values(1410110405, 'H Agarwal', 'CSE')
insert into Student_info values(1410110404, 'S Samadder', 'CSE')
insert into Student_info values(1410110403, 'MD Irfan', 'CSE')

SELECT * FROM Student_info


Output:
ROLL_NO NAME DEPARTMENT
1410110403 MD Irfan CSE
1410110404 S Samadder CSE
1410110405 H Agarwal CSE

Unclustered index:

The Non-Clustered index is an index structure separate from the data stored in a table that reorders
one or more selected columns. The non-clustered index is created to improve the performance of
frequently used queries not covered by clustered index. It’s like a textbook, the index page is
created separately at the beginning of that book.
Examples:
Input:
CREATE TABLE Student_info
(
ROLL_NO int(10),
NAME varchar(20),
DEPARTMENT varchar(20),
);
insert into Student_info values(1410110405, 'H Agarwal', 'CSE')
insert into Student_info values(1410110404, 'S Samadder', 'CSE')
insert into Student_info values(1410110403, 'MD Irfan', 'CSE')

SELECT * FROM Student_info


Output:
ROLL_NO NAME DEPARTMENT
1410110405 H Agarwal CSE
1410110404 S Samadder CSE
1410110403 MD Irfan CSE
Note: We can create one or more Non_Clustered index in a table.
Syntax:
//Create Non-Clustered index
create NonClustered index IX_table_name_column_name
on table_name (column_name ASC)
Table: Student_info
ROLL_NO NAME DEPARTMENT
1410110405 H Agarwal CSE
1410110404 S Samadder CSE
1410110403 MD Irfan CSE
Input:
create NonClustered index IX_Student_info_NAME on Student_info (NAME ASC)
Output:
Index
NAME ROW_ADDRESS
H Agarwal 1
MD Irfan 3
S Samadder 2
Clustered vs Non-Clustered index:
• In a table there can be only one clustered index or one or more than one non_clustered index.
• In Clustered index there is no separate index storage but in Non_Clustered index there is separate
index storage for the index.
• Clustered index is slower than Non_Clustered index.

--------------------------------------------------------------X------------------------------------------------------------

5.14 Index data Structures: Hash Based Indexin, Tree base Indexing

1. Hash based indexing

2. Tree based indexing

Hash based indexing:

• Organizes the records using a technique called hashing to quickly find records that have a given
search key value.
• In this approach, the records in a file are grouped in buckets, where a bucket consists of a primary
page and, possibly, additional pages linked in a chain. The bucket to which a record belongs can be
determined by applying a special function, called a hash function, to the search key. Given a bucket
number, a hash-based index structure allows us to retrieve the primary page for the bucket in one or
two disk l/Os.
• On inserts, the record is inserted into the appropriate bucket, with 'overflow' pages allocated as
necessary.
• To search for a record with a given search key value, we apply the hash function to identify the
bucket to which such records belong and look at all pages in that bucket.
Example:
• The data is stored in a file that is hashed on age;
• The data entries in this first index file are the actual data records.
• Applying the hash function to the age field identifies the page that the record belongs to.
• hash function h is-- it converts the search key value to its binary representation and uses the two
least significant bits as the bucket identifier.
• The figure also shows an index with search key sal that contains (sal, rid) pairs as data entries.
• The rid component of a data entry in this second index is a pointer to a record with search key value
sal.

Tree based indexing:

• The data entries are arranged in sorted order by search key value, and a hierarchical search data
structure is maintained that directs searches to the correct page of data entries.

Example: Employee records organized in a tree-structured index with search key


• Each node (e.g., nodes labeled A, B, L1, L2) is a physical page, and retrieving a node involves a
disk I/O.
• The lowest level of the tree, called the leaf level, contains the data entries
• Example, these are employee records.
• Additional records with age less than 22 would appear in leaf pages to the left page L1, and records
with age greater than 50 would appear in leaf pages to the right of page L3.
• All searches begin at the topmost node, called the root, and the contents of pages in non-leaf levels
direct searches to the correct leaf page.
• Non-leaf pages contain node pointers separated by search key values.
• The node pointer to the left of a key value k points to a subtree that contains only data entries less
than k.
• The node pointer to the right of a key value k points to a subtree that contains only data entries
greater than or equal to k.
• All leaf pages are maintained in a doubly-linked list
• Thus, the number of disk I/Os incurred during a search is equal to the length of a path from the root
to a leaf, plus the number of leaf pages with qualifying data entries.
• The B+ tree is an index structure that ensures that all paths from the root to a leaf in a given tree are
of the same length, that is, the structure is always balanced in height

--------------------------------------------------------------X------------------------------------------------------------

5.15 Comparison of File Organizations

• Assume that the files and indexes are organized according to the composite search key (age, sal)

1. Heap files (random order; insert at eof)


2. Sorted files, sorted on <age, sal>
3. Clustered B+ tree file, search key <age, sal>
4. Heap file with unclustered B + tree index on search key <age, sal>
5. Heap file with unclustered hash index on search key <age, sal>
Operations:
• Scan:
– Fetch all records in the file.
– The pages in the file must be fetched from disk into the buffer pool. There is also a CPU
overhead per record for locating the record on the page.
• Equality search (e.g., “age = 30”)
– Fetch all records that satisfy an equality selection;
– Pages that contain qualifying records must be fetched from disk, and qualifying records
must be located within retrieved pages.
• Range selection (e.g., “age > 30”)
– Fetch all records that satisfy a range selection.
• Insert a record
– Insert a given record into the file.
– Need to identify the page in the file into which the new record must be inserted, fetch that
page from disk, modify it to include the new record, and then write back the modified page.
• Delete a record
– Delete a record that is specified using its rid.
– Need to identify the page that contains the record, fetch it from disk, modify it, and write it
back
Cost Model:
– B: The number of data pages
– R: Number of records per page
– D: (Average) time to read or write disk page
– C: Average-time to process the record.
– H: to apply hash function to the record
– F: For tree indexing
1. Heap Files
• Scan
– cost is B(D + RC)
– Because need to retrieve each of B pages taking time D per page, and for each page, process
R records taking time C per record.
• Equality search
– cost is O.5B(D + RC)- On average need to scan half file.
– Scan entire file If no record found –i.e B(D + RC)
• Range selection
– cost is B(D + RC).
– The entire file must be scanned because qualifying records could appear anywhere in the file
• Insert
– 2D + C.
– Records are always inserted at the end of the file
– i.e First fetch the last page in the file, add the record, and write the page back.
• Delete
– cost of searching plus C + D.
– Fist find the record, remove the record from the page, and write the modified page back.
2. Sorted Files
• Scan
– cost is B(D + RC)
– All pages must be examined.
• Equality search
– Assume selection matches the composite key
– the cost of locating the first such record (Dlog2B+Clog2R) plus the cost of reading all the
qualifying records in sequential order
• Range selection
– Assume range selection matches the composite key
– DlogB + # matching pages
• Insert
– first find the correct position in the file, add the record, and then fetch and rewrite all
subsequent pages
– search cost plus B(D + RC)
– Assume that the inserted record belongs in the middle of the file. Therefore, we must read
the latter half of the file and then write it back after adding the new record.
• Delete
– search cost plus B(D + RC).
3. Clustered Files
• the number of physical data pages is about 1.5B
• Scan
– The cost of a scan is 1.5B(D + RC)
– All pages must be examined.
• Equality search
– Dlog1.5B +Clog2R plus the cost of reading all the qualifying records in sequential order
• Range selection
– Assume range selection matches the composite key
– Dlog1.5B + # Matching pages
• Insert
– first find the correct leaf page in the index, reading every page from root to leaf. Then, we
must add the new record.
– the leaf page has sufficient space for the new record, and all we need to do is to write out the
modified leaf page
– the leaf is full and we need to retrieve and modify other pages
– DlogF L5B + Clog2R + D.
• Delete
– search for the record, remove the record from the page, and write the modified page back
same as insert.

--------------------------------------------------------------X------------------------------------------------------------

5.16 Indexes and Performance Tuning

Effective indexes are one of the best ways to improve performance in a database application. Without an
index, the SQL Server engine is like a reader trying to find a word in a book by examining each page. By
using the index in the back of a book, a reader can complete the task in a much shorter time. In database
terms, a table scan happens when there is no index available to help a query. In a table scan SQL
Server examines every row in the table to satisfy the query results. Table scans are sometimes unavoidable,
but on large tables, scans have a terrific impact on performance.
One of the most important jobs for the database is finding the best index to use when generating an
execution plan. Most major databases ship with tools to show you execution plans for a query and help in
optimizing and tuning indexes. This article outlines several good rules of thumb to apply when creating and
modifying indexes for your database. First, let’s cover the scenarios where indexes help performance, and
when indexes can hurt performance.
Useful Index Queries
Just like the reader searching for a word in a book, an index helps when you are looking for a specific
record or set of records with a WHERE clause. This includes queries looking for a range of values, queries
designed to match a specific value, and queries performing a join on two tables. For example, both of the
queries against the Northwind database below will benefit from an index on the UnitPrice column.
DELETE FROM Products WHERE UnitPrice = 1

SELECT * FROM PRODUCTS


WHERE UnitPrice BETWEEN 14 AND 16
Since index entries are stored in sorted order, indexes also help when processing ORDER BY clauses.
Without an index the database has to load the records and sort them during execution. An index on
UnitPrice will allow the database to process the following query by simply scanning the index and fetching
rows as they are referenced. To order the records in descending order, the database can simply scan the
index in reverse.
SELECT * FROM Products ORDER BY UnitPrice ASC
Grouping records with a GROUP BY clause will often require sorting, so a UnitPrice index will also help
the following query to count the number of products at each price.
SELECT Count(*), UnitPrice FROM Products
GROUP BY UnitPrice
By retrieving the records in sorted order through the UnitPrice index, the database sees matching prices
appear in consecutive index entries, and can easily keep a count of products at each price. Indexes are also
useful for maintaining unique values in a column, since the database can easily search the index to see if an
incoming value already exists. Primary keys are always indexed for this reason.

Index Drawbacks
Indexes are a performance drag when the time comes to modify records. Any time a query modifies the
data in a table the indexes on the data must change also. Achieving the right number of indexes will require
testing and monitoring of your database to see where the best balance lies. Static systems, where databases
are used heavily for reporting, can afford more indexes to support the read only queries. A database with a
heavy number of transactions to modify data will need fewer indexes to allow for higher throughput.
Indexes also use disk space. The exact size will depends on the number of records in the table as well as the
number and size of the columns in the index. Generally this is not a major concern as disk space is easy to
trade for better performance.

Building The Best Index


There are a number of guidelines to building the most effective indexes for your application. From the
columns you select to the data values inside them, consider the following points when selecting the indexes
for your tables.
Short Keys
Having short index is beneficial for two reasons. First, database work is inherently disk intensive. Larger
index keys will cause the database to perform more disk reads, which limits throughput. Secondly, since
index entries are often involved in comparisons, smaller entries are easier to compare. A single integer
column makes the absolute best index key because an integer is small and easy for the database to
compare. Character strings, on the other hand, require a character by character comparison and attention to
collation settings.

Distinct Keys
The most effective indexes are the indexes with a small percentage of duplicated values. As an analogy,
think of a phone book for a town where almost everyone has the last name of Smith. A phone book in this
town is not very useful if sorted in order of last name, because you can only discount a small number of
records when you are looking for a Smith.
An index with a high percentage of unique values is a selective index. Obviously, a unique index is highly
selective since there are no duplicate entries. Many databases will track statistics about each index so they
know how selective each index is. The database uses these statistics when generating an execution plan for
a query.

Covering Queries
Indexes generally contain only the data values for the columns they index and a pointer back to the row
with the rest of the data. This is similar to the index in a book: the index contains only the key word and
then a page reference you can turn to for the rest of the information. Generally the database will have to
follow pointers from an index back to a row to gather all the information required for a query. However, if
the index contains all of he columns needed for a query, the database can save a disk read by not returning
to the table for more information.
Take the index on UnitPrice we discussed earlier. The database could use just the index entries to satisfy
the following query.
SELECT Count(*), UnitPrice FROM Products
GROUP BY UnitPrice
We call these types of queries covered queries, because all of the columns requested in the output are
covered by a single index. For your most crucial queries, you might consider creating a covering index to
give the query the best performance possible. Such an index would probably be a composite index (using
more than one column), which appears to go against our first guideline of keeping index entries as short as
possible. Obviously this is another tradeoff you can only evaluate with performance testing and monitoring.

Clustered Indexes
Many databases have one special index per table where all of the data from a row exists in the index. SQL
Server calls this index a clustered index. Instead of an index at the back of a book, a clustered index is
closer in similarity to a phone book because each index entry contains all the information you need, there
are no references to follow to pick up additional data values.
As a general rule of thumb, every non-trivial table should have a clustered index. If you only create one
index for a table, make the index a clustered index. In SQL Server, creating a primary key will
automatically create a clustered index (if none exists) using the primary key column as the index key.
Clustered indexes are the most effective indexes (when used, they always cover a query), and in many
databases systems will help the database efficiently manage the space required to store the table.
When choosing the column or columns for a clustered index, be careful to choose a column with static
data. If you modify a record and change the value of a column in a clustered index, the database might need
to move the index entry (to keep the entries in sorted order). Remember, index entries for a clustered index
contain all of the column values, so moving an entry is comparable to executing a DELETE statement
followed by an INSERT, which can obviously cause performance problems if done often. For this reason,
clustered indexes are often found on primary or foreign key columns. Key values will rarely, if ever,
change.

You might also like