RDBMS - Unit Iv

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 21

RELATIONAL DATABASE MANAGEMENT SYSTEM UNIT-IV

UNIT- IV: STORAGE AND FILE ORGANIZATION:


Disks - RAID -Tertiary storage - Storage Access -File Organization – organization
of files - Data Dictionary storage.
------------------------------------------------------------------------------------------------------------------------------------------

STORAGE AND FILE ORGANIZATION

4.1 Storage System in DBMS

A database system provides an ultimate view of the stored data. However, data in the form of bits,
bytes get stored in different storage devices.

In this section, we will take an overview of various types of storage devices that are used for
accessing and storing data.

Types of Data Storage

For storing the data, there are different types of storage options available. These storage types
differ from one another as per the speed and accessibility. There are the following types of storage
devices used for storing the data:

o Primary Storage
o Secondary Storage
o Tertiary Storage

RASC- DEPARTMENT COMPUTER SCIENCE AND APPLICATION-N.MANIMOZHI/AP 1


RELATIONAL DATABASE MANAGEMENT SYSTEM UNIT-IV

1) Primary Storage

It is the primary area that offers quick access to the stored data. We also know the primary storage
as volatile storage. It is because this type of memory does not permanently store the data. As soon
as the system leads to a power cut or a crash, the data also get lost. Main memory and cache are
the types of primary storage.

 Main Memory: It is the one that is responsible for operating the data that is available
by the storage medium. The main memory handles each instruction of a computer
machine. This type of memory can store gigabytes of data on a system but is small
enough to carry the entire database. At last, the main memory loses the whole content
if the system shuts down because of power failure or other reasons.
 Cache: It is one of the costly storage media. On the other hand, it is the fastest one. A
cache is a tiny storage media which is maintained by the computer hardware usually.
While designing the algorithms and query processors for the data structures, the
designers keep concern on the cache effects.

2)Secondary Storage

Secondary storage is also called as online storage. It is the storage area that allows the user to save
and store data permanently. This type of memory does not lose the data due to any power failure
or system crash. That's why we also call it non-volatile storage.

There are some commonly described secondary storage media which are available in almost every
type of computer system:

o Flash Memory: A flash memory stores data in USB (Universal Serial Bus) keys which are
further plugged into the USB slots of a computer system. These USB keys help transfer
data to a computer system, but it varies in size limits. Unlike the main memory, it is
possible to get back the stored data which may be lost due to a power cut or other reasons.
This type of memory storage is most commonly used in the server systems for caching the
frequently used data. This leads the systems towards high performance and is capable of
storing large amounts of databases than the main memory.

o Magnetic Disk Storage: 


 Primary medium for long-term storage.
 Typically, the entire database is stored on disk.
 Data must be moved from disk to main memory in order for the data to be operated

RASC- DEPARTMENT COMPUTER SCIENCE AND APPLICATION-N.MANIMOZHI/AP 2


RELATIONAL DATABASE MANAGEMENT SYSTEM UNIT-IV

on.
 After operations are performed, data must be copied back to disk if any changes
were made.
 Disk storage is called direct access storage as it is possible to read data on the disk
in any order (unlike sequential access).
 Disk storage usually survives power failures and system crashes.

Figure: Structure of magnetic disk

Access time: the time it takes from when a read or write request is issued to when data
transfer begins.
Data-transfer rate– the rate at which data can be retrieved from or stored to the disk.

Mean time to failure (MTTF) – the average time the disk is expected to run continuously
without any failure.

3)Tertiary Storage

It is the storage type that is external from the computer system. It has the slowest speed. But it is
capable of storing a large amount of data. It is also known as Offline storage. Tertiary storage is
generally used for data backup. There are following tertiary storage devices available:

o Optical Storage: An optical storage can store megabytes or gigabytes of data. A Compact
Disk (CD) can store 700 megabytes of data with a playtime of around 80 minutes. On the
other hand, a Digital Video Disk or a DVD can store 4.7 or 8.5 gigabytes of data on each
side of the disk.

RASC- DEPARTMENT COMPUTER SCIENCE AND APPLICATION-N.MANIMOZHI/AP 3


RELATIONAL DATABASE MANAGEMENT SYSTEM UNIT-IV

o Tape Storage: It is the cheapest storage medium than disks. Generally, tapes are used for
archiving or backing up the data. It provides slow access to data as it accesses data
sequentially from the start. Thus, tape storage is also known as sequential-access storage.
Disk storage is known as direct-access storage as we can directly access the data from any
location on disk.

Storage Access

o A database file is partitioned into fixed-length storage units called blocks (or pages).
Blocks/pages are units of both storage allocation and data transfer.
o Database system seeks to minimize the number of block transfers between disk and
main memory. Transfer can be reduced by keeping as many blocks as possible in
main memory.
o Buffer Pool: Portion of main memory available to store copies of disk blocks.
o Buffer Manager: System component responsible for allocating and managing buffer
space in main memory.

Buffer Manager
Program calls on buffer manager when it needs block from disk

• The requesting program is given the address of the block in main memory, if it is already
present in the buffer.

• If the block is not in the buffer, the buffer manager allocates space in the buffer for the
block, replacing (throwing out) some other blocks, if necessary to make space for new blocks.

• The block that is thrown out is written back to the disk only if it was modified since the most
recent time that it was written to/fetched from the disk.

• Once space is allocated in the buffer, the buffer manager reads in the block from the disk to the
buffer, and passes the address of the block in the main memory to the requesting program.

Buffer Replacement Policies

• Most operating systems replace the block least recently used (LRU strategy )

RASC- DEPARTMENT COMPUTER SCIENCE AND APPLICATION-N.MANIMOZHI/AP 4


RELATIONAL DATABASE MANAGEMENT SYSTEM UNIT-IV

• LRU – Use past reference of block as a predictor of future references

• Queries have well-defined access patterns (such as sequential scans), and a database system
can use the information in a user’s query to predict future references
LRU can be a bad strategy for certain access patterns involving repeated sequential scans of
data files

• Mixed strategy with hints on replacement strategies provided by the query optimizer is
preferable (based on the used query processing algorithm(s)).

• Pinned block: memory block that is not allowed to be written back to disk

• Toss immediate strategy: frees the space occupied by a block as soon as the final record (tuple)
of that block has been processed.

• Most recently used strategy (MRU): system must pin the block currently being processed.
After the final tuple of that block has been processed, the block is unpinned, and it becomes the
most recently used block.

• Buffer manager can use statistical information regarding the probability that a request will
reference a particular relation, e.g., the data dictionary is frequently accessed ~ keep data
dictionary blocks in main memory buffer.
Storage Hierarchy

Besides the above, various other storage devices reside in the computer system. These storage
media are organized on the basis of data accessing speed, cost per unit of data to buy the medium,
and by medium's reliability. Thus, we can create a hierarchy of storage media on the basis of its
cost and speed.

Thus, on arranging the above-described storage media in a hierarchy according to its speed and
cost, we conclude the below-described image:

RASC- DEPARTMENT COMPUTER SCIENCE AND APPLICATION-N.MANIMOZHI/AP 5


RELATIONAL DATABASE MANAGEMENT SYSTEM UNIT-IV

In the image, the higher levels are expensive but fast. On moving down, the cost per bit is
decreasing, and the access time is increasing. Also, the storage media from the main memory to up
represents the volatile nature, and below the main memory, all are non-volatile devices.

4.2 RAID (Redundant Array of Independent Disks)

RAID refers to Redundancy Array of the Independent Disk. It is a technology which is used to
connect multiple secondary storage devices for increased performance, data redundancy or both. It
gives you the ability to survive one or more drive failure depending upon the RAID level used.

Redundant Array of Independent Disk (RAID) combines multiple small, inexpensive disk drives
into an array of disk drives which yields performance more than that of a Single Large Expensive
Drive (SLED). RAID is also called Redundant Array of Inexpensive Disks.
Storing the same data in different disk increases the fault-tolerance.
The array of Mean Time Between Failure (MTBF) = MTBF of an individual drive, which is
divided by the number of drives in the array. Because of this reason, the MTBF of an array of
drives are too low for many application requirements.
4.2.1 Types of RAID
RAID 0
In this level, a striped array of disks is implemented. The data is broken down into blocks and the
blocks are distributed among disks. Each disk receives a block of data to write/read in parallel. It
enhances the speed and performance of the storage device. There is no parity and backup in Level
0.

RASC- DEPARTMENT COMPUTER SCIENCE AND APPLICATION-N.MANIMOZHI/AP 6


RELATIONAL DATABASE MANAGEMENT SYSTEM UNIT-IV

RAID 1
RAID 1 uses mirroring techniques. When data is sent to a RAID controller, it sends a copy of data
to all the disks in the array. RAID level 1 is also called mirroring and provides 100% redundancy
in case of a failure.

RAID 2
RAID 2 records Error Correction Code using Hamming distance for its data, striped on different
disks. Like level 0, each data bit in a word is recorded on a separate disk and ECC codes of the
data words are stored on a different set disks. Due to its complex structure and high cost, RAID 2
is not commercially available.

RAID 3
RAID 3 stripes the data onto multiple disks. The parity bit generated for data word is stored on a
different disk. This technique makes it to overcome single disk failures.

RAID 4

RASC- DEPARTMENT COMPUTER SCIENCE AND APPLICATION-N.MANIMOZHI/AP 7


RELATIONAL DATABASE MANAGEMENT SYSTEM UNIT-IV

In this level, an entire block of data is written onto data disks and then the parity is generated and
stored on a different disk. Note that level 3 uses byte-level striping, whereas level 4 uses block-
level striping. Both level 3 and level 4 require at least three disks to implement RAID.

RAID 5
RAID 5 writes whole data blocks onto different disks, but the parity bits generated for data block
stripe are distributed among all the data disks rather than storing them on a different dedicated
disk.

RAID 6
RAID 6 is an extension of level 5. In this level, two independent parities are generated and stored
in distributed fashion among multiple disks. Two parities provide additional fault tolerance. This
level requires at least four disk drives to implement RAID.

RASC- DEPARTMENT COMPUTER SCIENCE AND APPLICATION-N.MANIMOZHI/AP 8


RELATIONAL DATABASE MANAGEMENT SYSTEM UNIT-IV

Levels Summary

RAID-0 It is the fastest and most efficient array type but offers no fault-tolerance.

RAID-1 It is the array of choice for a critical, fault tolerant environment.

RAID-2 It is used today because ECC is embedded in almost all modern disk drives.

RAID-3 It is used in single environments which access long sequential records to speed up
data transfer.

RAID-4 It offers no advantages over RAID-5 and does not support multiple simultaneous
write operations.

RAID-5 It is the best choice in a multi-user environment. However, at least three drives are
required for the RAID-5 array.

4.3 File Organizations:

RASC- DEPARTMENT COMPUTER SCIENCE AND APPLICATION-N.MANIMOZHI/AP 9


RELATIONAL DATABASE MANAGEMENT SYSTEM UNIT-IV

File – A file is named collection of related information that is recorded on secondary storage
such as magnetic disks, magnetic tapes and optical disks. 

What is File Organization? 

File Organization refers to the logical relationships among various records that constitute the
file, particularly with respect to the means of identification and access to any specific record. In
simple terms, Storing the files in certain order is called file Organization.

Types of files of The database is stored as a collection of files

o Each file is a sequence of Each file is a sequence of records records

o A record is a sequence of fields

o These are stored in units of blocks!

ƒ Fixed length records

o assume record size is fixed

o each file has records of one particular type only each file has records of one particular type
only

o different files are used for different relations

As an example, let us consider a file of instructor records for our university database. Each
record of this file is defined (in pseudocode) as:

type instructor = record

ID varchar (5);

name varchar(20);

dept name varchar (20);

salary numeric (8,2);

end

RASC- DEPARTMENT COMPUTER SCIENCE AND APPLICATION-N.MANIMOZHI/AP 10


RELATIONAL DATABASE MANAGEMENT SYSTEM UNIT-IV

Assume that each character occupies 1 byte and that numeric (8,2) occupies 8 bytes.
Suppose that instead of allocating a variable amount of bytes for the attributes ID, name,
and dept name, we allocate the maximum number of bytes that each attribute can hold.
Then, the instructor record is 53 bytes long. A simple approach is to use the first 53 bytes
for the first record, the next 53 bytes for the second record, and so on . However, there are
two problems with this simple approach:

1. Unless the block size happens to be a multiple of 53 (which is unlikely), some records
will cross block boundaries. That is, part of the record will be stored in one block and part
in another. It would thus require two block accesses to read or write such a record.

2. It is difficult to delete a record from this structure. The space occupied by the record to
be deleted must be filled with some other record of the file, or we must have a way of
marking deleted records so that they can be ignored

When a record is deleted, we could move the record that came after it into the space
formerly occupied by the deleted record, and so on, until every record following the
deleted record has been moved ahead. Such an approach requires moving a large number
of records. It might be easier simply to move the final record of the file into the space
occupied by the deleted record.

It is undesirable to move records to occupy the space freed by the deleted record, since
doing so requires additional block accesses. Since insertions tend to be more frequent than
deletions, it is acceptable to leave open the space occupied by the deleted record, and to
wait for a subsequent insertion before reusing the space. A simple marker on the deleted
record is not sufficient, since it is hard to find this available space when an insertion is
being done. Thus we need to introduce an additional structure.

At the beginning of the file, we allocate a certain number of bytes as a file header. The
header will contain a variety of information about the file.

For now, all we need to store there is the address of the first record whose contents are
deleted. We use this first record to store the address of the second available record, and so
on. Intuitively we can think of these stored addresses as pointers, since they point to the
location of a record. The deleted records thus form a linked list, which is often referred to
as a free list.

On insertion of a new record, we use the record pointed to by the header. We change the
header pointer to point to the next available record. If no space is available, we add the
new record to the end of the file.

RASC- DEPARTMENT COMPUTER SCIENCE AND APPLICATION-N.MANIMOZHI/AP 11


RELATIONAL DATABASE MANAGEMENT SYSTEM UNIT-IV

Insertion and deletion for files of fixed length records are simple to implement, because the
space made available by a deleted record is exactly the space needed to insert a record. If
we allow records of variable length in a file, this match no longer holds. An inserted record
may not fit in the space left free by a deleted record, or it may fill only part of that space.

ƒ Variable-length records

o Storage of multiple record types in a file

o Record types that allow variable lengths for one or more fields

o Record types that allow repeating fields Record types that allow repeating fields

o Byte string representation, pointer-based methods, …organization:

The slotted page structure is commonly used for organizing records within a block. There is a
header at the beginning of each block, containing the following information.

1) The number of record entries in the header.

2) The end of free space in the block

3) An array whose entries contain the location and size of each record.

RASC- DEPARTMENT COMPUTER SCIENCE AND APPLICATION-N.MANIMOZHI/AP 12


RELATIONAL DATABASE MANAGEMENT SYSTEM UNIT-IV

The actual records are allocated contiguously in the block, starting from the end of the block. The
free space in the block is contiguous, between the final entry in the header array, and the first
record. If a record is inserted, space is allocated for it at the end of free space, and an entry
containing its size and location is added to the header.

If a record is deleted, the space that it occupies is freed, and its entry is set to deleted (its size is set
to -1, for example). Further the records in the block before the deleted records are moved, so that
the free space created by the deletion gets occupied, and all free space is again between the final
entry in the header array and the first record. The end of free space pointer in the header is
appropriately updated as well. Records can be grown or shrunk by similar techniques, as long as
there is space in the block. The cost of moving the records is not too high, since the size of a block
is limited: a typical value is 4 kilobytes.

The slotted page structure requires that there be no pointers that point directly to records. Instead,
pointers must point to the entry in the header that contains the actual location of the record. This
level of indirection allows records to be moved to prevent fragmentation of space inside a block,
while supporting indirect pointers to the record..

4.4 Organization of Records in Files

File organization contains various methods. These particular methods have pros and cons on the
basis of access or selection. In the file organization, the programmer decides the best-suited file
organization method according to his requirement.

Types of file organization are as follows:

Some of the file organizations are

RASC- DEPARTMENT COMPUTER SCIENCE AND APPLICATION-N.MANIMOZHI/AP 13


RELATIONAL DATABASE MANAGEMENT SYSTEM UNIT-IV

1. Sequential File Organization


2. Heap File Organization
3. Hash/Direct File Organization
4. Indexed Sequential Access Method
5. B+ Tree File Organization
6. Cluster File Organization

1. Sequential File Organization:

In sequential file organization, records are placed in the file in some sequential order based on
the unique key field or search key.

The easiest method for file Organization is Sequential method. In this method the the file are
stored one after another in a sequential manner. There are two ways to implement this method:

1. Pile FIle Method

2. Sorted File

1. Pile File Method – This method is quite simple, in which we store the records in a
sequence i.e one after other in the order in which they are inserted into the tables.

Insertion of new record –


Let the R1, R3 and so on upto R5 and R4 be four records in the sequence. Here,
records are nothing but a row in any table. Suppose a new record R2 has to be
inserted in the sequence, then it is simply placed at the end of the file.

RASC- DEPARTMENT COMPUTER SCIENCE AND APPLICATION-N.MANIMOZHI/AP 14


RELATIONAL DATABASE MANAGEMENT SYSTEM UNIT-IV

2. Sorted File Method –In this method, As the name itself suggest whenever a new

record has to be inserted, it is always inserted in a sorted (ascending or descending)


manner. Sorting of records may be based on any primary key or any other key.

Insertion of new record –


Let us assume that there is a preexisting sorted sequence of four records R1, R3,
and so on upto R7 and R8. Suppose a new record R2 has to be inserted in the
sequence, then it will be inserted at the end of the file and then it will sort the
sequence .

Pros and Cons of Sequential File Organization –

Pros –

 Fast and efficient method for huge amount of data.


 Simple design.
 Files can be easily stored in magnetic tapes i.e cheaper storage mechanism.

Cons –

 Time wastage as we cannot jump on a particular record that is required, but we

RASC- DEPARTMENT COMPUTER SCIENCE AND APPLICATION-N.MANIMOZHI/AP 15


RELATIONAL DATABASE MANAGEMENT SYSTEM UNIT-IV

have to move in a sequential manner which takes our time.


 Sorted file method is inefficient as it takes time and space for sorting records.

2. Heap File Organization:

When a file is created using Heap File Organization, the Operating System allocates
memory area to that file without any further accounting details. File records can be placed
anywhere in that memory area.

Heap File Organization works with data blocks. In this method records are inserted at the
end of the file, into the data blocks. No Sorting or Ordering is required in this method. If a
data block is full, the new record is stored in some other block, Here the other data block
need not be the very next data block, but it can be any block in the memory. It is the
responsibility of DBMS to store and manage the new records.

Insertion of new record –


Suppose we have four records in the heap R1, R5, R6, R4 and R3 and suppose a new
record R2 has to be inserted in the heap then, since the last data block i.e data block 3 is
full it will be inserted in any of the database selected by the DBMS, lets say data block 1.

RASC- DEPARTMENT COMPUTER SCIENCE AND APPLICATION-N.MANIMOZHI/AP 16


RELATIONAL DATABASE MANAGEMENT SYSTEM UNIT-IV

If we want to search, delete or update data in heap file Organization the we will traverse
the data from the beginning of the file till we get the requested record. Thus if the database
is very huge, searching, deleting or updating the record will take a lot of time.

Pros and Cons of Heap File Organization – Pros –

 Fetching and retrieving records is faster than sequential record but only in case of
small databases.
 When there is a huge number of data needs to be loaded into the database at a time,
then this method of file Organization is best suited.

Cons –

 Problem of unused memory blocks.


 Inefficient for larger databases.

3. Hash File Organization:

Hash File Organization uses Hash function computation on some fields of the records. The
output of the hash function determines the location of disk block where the records are to be
placed.

In this method of file organization, hash function is used to calculate the address of the
block to store the records.

The hash function can be any simple or complex mathematical function.

RASC- DEPARTMENT COMPUTER SCIENCE AND APPLICATION-N.MANIMOZHI/AP 17


RELATIONAL DATABASE MANAGEMENT SYSTEM UNIT-IV

The hash function is applied on some columns/attributes – either key or non-key columns
to get the block address.

Hence each record is stored randomly irrespective of the order they come. Hence this
method is also known as Direct or Random file organization.

If the hash function is generated on key column, then that column is called hash key, and if
hash function is generated on non-key column, then the column is hash column.

When a record has to be retrieved, based on the hash key column, the address is generated and
directly from that address whole record is retrieved. Here no effort to traverse through whole file.
Similarly when a new record has to be inserted, the address is generated by hash key and record
is directly inserted. Same is the case with update and delete.

Advantages of Hash File Organization

 Records need not be sorted after any of the transaction. Hence the effort of sorting
is reduced in this method.
 Since block address is known by hash function, accessing any record is very faster.
Similarly updating or deleting a record is also very quick.
 This method can handle multiple transactions as each record is independent of
other. i.e.; since there is no dependency on storage location for each record,
multiple records can be accessed at the same time.
 It is suitable for online transaction systems like online banking, ticket booking

RASC- DEPARTMENT COMPUTER SCIENCE AND APPLICATION-N.MANIMOZHI/AP 18


RELATIONAL DATABASE MANAGEMENT SYSTEM UNIT-IV

system etc.

4.Clustered file organization:

Clustered file organization is not considered good for large databases. In this mechanism, related
records from one or more relations are kept in the same disk block, that is, the ordering of records
is not based on primary key or search key.

In this method two or more table which are frequently used to join and get the results are stored
in the same file called clusters. These files will have two or more tables in the same data block
and the key columns which map these tables are stored only once. This method hence reduces the
cost of searching for various records in different files. All the records are found at one place and
hence making search efficient.

4.5 Data Dictionary in DBMS

In the relational database system, it maintains all information of a relation or table, from its
schema to the applied constraints. All the metadata is stored. In general, metadata refers to the data
about data. So, storing the relational schemas and other metadata about the relations in a structure
is known as Data Dictionary or System Catalog.

A data dictionary is like the A-Z dictionary of the relational database system holding all
information of each relation in the database.

The types of information a system must store are:

o Name of the relations


o Name of the attributes of each relation
o Lengths and domains of attributes

RASC- DEPARTMENT COMPUTER SCIENCE AND APPLICATION-N.MANIMOZHI/AP 19


RELATIONAL DATABASE MANAGEMENT SYSTEM UNIT-IV

o Name and definitions of the views defined on the database


o Various integrity constraints

With this, the system also keeps the following data based on users of the system: Accounting
and authorization information about users.

o The authentication information for users, such as passwords or other related information.

In addition to this, the system may also store some statistical and descriptive data about the
relations, such as:

o Number of tuples in each relation


o Method of storage for each relation, such as clustered or non-clustered.

A system may also store the storage organization, whether sequential, hash, or heap. It also
notes the location where each relation is stored:

o If relations are stored in the files of the operating system, the data dictionary note, and
stores the names of the file.
o If the database stores all the relations in a single file, the data dictionary notes and store the
blocks containing records of each relation in a data structure similar to a linked list.

At last, it also stores the information regarding each index of all the relations:

o Name of the index.


o Name of the relation being indexed.
o Attributes on which the index is defined.
o The type of index formed.

All the above information or metadata is stored in a data dictionary. The data dictionary also
maintains updated information whenever they occur in the relations. Such metadata constitutes a
miniature database. Some systems store the metadata in the form of a relation in the database
itself. The system designers design the way of representation of the data dictionary. Also, a data
dictionary stores the data in a non-formalized manner. It does not use any normal form so as to
fastly access the data stored in the dictionary.

RASC- DEPARTMENT COMPUTER SCIENCE AND APPLICATION-N.MANIMOZHI/AP 20


RELATIONAL DATABASE MANAGEMENT SYSTEM UNIT-IV

For example, in the data dictionary, it uses underline below the value to represent that the
following field contains a primary key.

So, whenever the database system requires fetching records from a relation, it firstly finds in the
relation of data dictionary about the location and storage organization of the relation. After
confirming the details, it finally retrieves the required record from the database.

RASC- DEPARTMENT COMPUTER SCIENCE AND APPLICATION-N.MANIMOZHI/AP 21

You might also like