Professional Documents
Culture Documents
21cs1401-Unit 5 - Dbms
21cs1401-Unit 5 - Dbms
21cs1401-Unit 5 - Dbms
A database system provides an ultimate view of the stored data. However, data in the form of bits, bytes get
stored in different storage devices.
For storing the data, there are different types of storage options available. These storage types differ from one
another as per the speed and accessibility. There are the following types of storage devices used for storing
the data:
○ Primary Storage
○ Secondary Storage
○ Tertiary Storage
Primary Storage
It is the primary area that offers quick access to the stored data. We also know the primary storage as volatile
storage. It is because this type of memory does not permanently store the data. As soon as the system leads to
a power cut or a crash, the data also get lost. Main memory and cache are the types of primary storage.
○ Main Memory: It is the one that is responsible for operating the data that is available by the storage
medium. The main memory handles each instruction of a computer machine. This type of memory
can store gigabytes of data on a system but is small enough to carry the entire database. At last, the
main memory loses the whole content if the system shuts down because of power failure or other
reasons.
1
21CS1401 – Database Management Systems Unit –V
1. Cache: It is one of the costly storage media. On the other hand, it is the fastest one. A cache is a tiny
storage media which is maintained by the computer hardware usually. While designing the
algorithms and query processors for the data structures, the designers keep concern on the cache
effects.
Secondary Storage
Secondary storage is also called as Online storage. It is the storage area that allows the user to save and store
data permanently. This type of memory does not lose the data due to any power failure or system crash.
That's why we also call it non-volatile storage.
There are some commonly described secondary storage media which are available in almost every type of
computer system:
○ Flash Memory: A flash memory stores data in USB (Universal Serial Bus) keys which are further
plugged into the USB slots of a computer system. These USB keys help transfer data to a computer
system, but it varies in size limits. Unlike the main memory, it is possible to get back the stored
data which may be lost due to a power cut or other reasons. This type of memory storage is most
commonly used in the server systems for caching the frequently used data. This leads the systems
towards high performance and is capable of storing large amounts of databases than the main
memory.
○ Magnetic Disk Storage: This type of storage media is also known as online storage media. A
magnetic disk is used for storing the data for a long time. It is capable of storing an entire database. It
is the responsibility of the computer system to make availability of the data from a disk to the main
memory for further accessing. Also, if the system performs any operation over the data, the modified
data should be written back to the disk. The tremendous capability of a magnetic disk is that it does
not affect the data due to a system crash or failure, but a disk failure can easily ruin as well as
destroy the stored data.
Tertiary Storage
2
21CS1401 – Database Management Systems Unit –V
It is the storage type that is external from the computer system. It has the slowest speed. But it is
capable of storing a large amount of data. It is also known as Offline storage. Tertiary storage is
generally used for data backup. There are following tertiary storage devices available:
○ Optical Storage: An optical storage can store megabytes or gigabytes of data. A Compact Disk
(CD) can store 700 megabytes of data with a playtime of around 80 minutes. On the other hand, a
Digital Video Disk or a DVD can store 4.7 or 8.5 gigabytes of data on each side of the disk.
○ Tape Storage: It is the cheapest storage medium than disks. Generally, tapes are used for archiving
or backing up the data. It provides slow access to data as it accesses data sequentially from the start.
Thus, tape storage is also known as sequential-access storage. Disk storage is known as direct-access
storage as we can directly access the data from any location on disk.
failed disk and to restore the data on it. Power failures, and natural disasters such as earthquakes, fires may
result in damage to both disks at the same time.
Improvement in Performance via Parallelism
Consider the benefit of paral1el access to multiple disks. With disk mirroring, the rate at which read
requests can be handled is doubled, since read requests can be sent to either disk. The transfer rate of each
read is the same as in a single-disk system but the number of reads per unit time has doubled.
With multiple disks, the transfer rate can be improved as well by striping data across multiple disks.
In its simplest form, data striping consists of splitting the bits of each byte across multiple disks. Such
striping is called bit-level striping. For example, if there is an array of eight disks, then bit i of each byte is
written to disk i. The array of eight disks can be treated as a single disk with sectors that are eight times the
normal size, and, more important, that has eight times the transfer rate.
In such an organization, every disk participates in every access (read or write), so number of accesses
that can be processed per second is about the same as on a single disk, but each access can read eight times as
many data in the same time as on a single disk.
Block-level striping stripes blocks across multiple disks. It treats the array of disks as a single large
disk, and it gives blocks logical numbers. It is assumed that the block numbers start from 0.
With an array of n disks, block-level striping assigns logical block i to the (i/n) th physical block
of the disk (i mod n) + 1. For example, with 8 disks, logical block 0 is stored in physical, block 0 of disk 1,
while logical block 11 is stored in physical block 1 of disk 4.
While reading a large file, block-level striping fetches n blocks at a time in parallel from n disks,
giving a high data transfer rate for large reads. When a single block is read the data transfer rate is the same
as on one disk, but the remaining n - 1 disks is free to perform other actions.
RAID Levels
Mirroring provides high reliability, but it is expensive. Striping provides high data transfer
rates, but does not improve reliability. Various alternative schemes aim to provide redundancy at lower cost
by combining disk striping with "parity" bits. These schemes have different cost-performance trade-offs. The
schemes are classified into RAID levels. RAID level 0 refers to disk arrays with striping at the level of
blocks, but without any redundancy. The below Figure: 5.3(a) shows an array of size 4.
4
21CS1401 – Database Management Systems Unit –V
5
21CS1401 – Database Management Systems Unit –V
RAID level 3, bit-interleaved parity organization, improves on level 2 by exploiting the fact that disk
controllers, can detect whether a sector has been read correctly. So, single parity bit can be used for error
correction, as well as for detection. The idea is as follows. If one of the sectors gets damaged, the disk
controller knows exactly which sector it is, and, for, each bit in the sector, the system can figure out whether
it is a 1 or a 0 by computing the parity of the corresponding bits from sectors in the other disks. If the
parity of the remaining bits is equal to the stored parity, the missing bit is 0. Otherwise, it is 1. The Figure:
5.4(d) shows the RAID level 3.
RAID level 3 is as good as level 2, but is less expensive in the number of extra disks (it has only a
one-disk overhead), so level 2 is not used in practice.
RAID level 3 has two benefits over level 1. It needs only one parity disk for several regular disks,
whereas Level l needs one mirror disk for every disk, and thus reduces the storage overhead.
RAID level 4, block-interleaved parity organization, uses block level striping, like RAID 0, and in
addition keeps a parity block on a separate disk for corresponding blocks from N other disks. This scheme
is shown pictorially in the Figure: 5.5(e). If one of the disks fails, the parity block can be used with the
corresponding blocks from the other disks to restore .the blocks of the failed disk.
A block read accesses only one disk, allowing other requests to be processed by the other disks. Thus,
the data-transfer rate for each access is slower, but multiple read accesses can proceed in parallel, leading to a
higher overall I/O rate. The transfer rates for large reads is high, since all the disks can be read in parallel;
large writes also have high transfer rates, since the data and parity can be written in parallel.
Small independent writes, on the other hand, cannot be performed in parallel. A write of a block has
to access the disk on which the block is stored, as well as the parity disk, since the parity block has to be
updated. Moreover, both the old value of the parity block and the old value of the block being written have to
be read for the new parity to be computed. Thus, a single write requires four disk accesses: two to read the
two old blocks, and two to write the two blocks.
6
21CS1401 – Database Management Systems Unit –V
7
21CS1401 – Database Management Systems Unit –V
8
21CS1401 – Database Management Systems Unit –V
⮚ It is difficult to delete a record from this structure. The space occupied by the record to be
deleted must be filled with some other record of the file, or we must have a way of marking
deleted records so that they can be ignored.
⮚ If the block size happens to be a greater than 40 bytes, some records will cross block
boundaries. That is part of the record will be stored in one block and part in another. It would
thus require two block accesses to read or write such a record.
The deletion can be performed in several ways:
The first approach is when a record is deleted all the records after it should be moved up to the
deleted position. Instead of this approach, it might be easier simply to move the final record of the file into
the space occupied by the deleted record. Another approach is to reuse the space of the deleted record
by inserting new record in that place. This approach avoids the movement of records. Since it is hard find the
available space, it is desirable to use some additional structure. At the beginning of the file, a certain number
of bytes are allocated as a file header. The header will contain a variety of information about the file. In
addition to all the information it maintains the address of the first record whose contents are deleted. And this
first record is used to store the address of the second available record and so on. This stored address is called
as pointers, since they point to the location of a record. The deleted records thus form a linked list, which is
often referred to as a free list. The Figure: 5.12 Shows the file with the free list after records 1, 4, and 6 have
been deleted.
Figure: 5.12 - File of Figure: 5.9, with free list after deletion of records 1,4 and 6.
On insertion of a new record, we can use the record pointed by the header. The header pointer is
change to point to the next available record after insertion. If no space is available the insertion is done at the
end of the file.
Variable Length Records:
Variable-length records arise in database systems in several ways:
9
21CS1401 – Database Management Systems Unit –V
The Figure: 5.13 show, such an organization to represent the account file as variable-length records.
There is a header at the beginning of each block, containing the following information:
● The number of record entries in the header.
● The end of free space in the block
● An array whose entries contain the location and size of each record
The actual records are allocated contiguously in the block, starting from the end of the block. The free
space in the block is contiguous, between the final entry in the header array, and the first record . If a
record is inserted, space is allocated for it at the end of free space, and an entry containing its size and
location is added to the header.
If a record is deleted, the space that it occupies is freed, and its entry is set to deleted (its size is set to
-1, for example). Further, the records in the block before the deleted record are moved, so that the free space
created by the deletion gets occupied and all free space is again between the final entry in the header array
and the first record. The end-of-free-space pointer in the header is appropriately updated as well. Records can
be grown or shrunk by similar techniques, as long as there is space in the block.
11
21CS1401 – Database Management Systems Unit –V
Fixed-length Representation
Another way to implement variable-length records efficiently in a file system is to use one or more
fixed-length records to represent one variable-length record.
There are two ways of doing this:
● Reserved space: If there is a maximum record length that is never exceeded, then fixed-length
records of that length is used. Unused space (for records shorter than the maximum space) is filled
with a special null, or end-of-record, symbol.
● List representation: variable-length records can be represented by lists of fixed length records,
chained together by pointers.
The Figure: 5.15 shows, how the file of account would be represented if, maximum of three accounts
per branch are allowed.
12
21CS1401 – Database Management Systems Unit –V
13
21CS1401 – Database Management Systems Unit –V
operation fetches related records from all the relations. For example, records of the two relations can
be considered to be related if they would match in a join of the two relations.
Heap file organization (heap files or unordered files):
In this simplest and most basic type of organization, records are placed in the file in the order in
which they are inserted, so new records are inserted at the end of the file. Such an organization is called a
heap or pile file. This organization is often used with additional access paths, such as the secondary indexes.
It is also used to collect and store records for future use.
Inserting a new record is efficient. The last disk block of the file is copied into a buffer. The new
record is added and then the block is then rewritten back to the disk. The address of the last file block is
kept in the file header. However, searching for a record using any search condition involves a linear
search through the file by block, which is an expensive procedure. If only one record satisfies the search
condition, then, on the average, a program will read into memory and search half the file blocks before it
finds the record. For a file b blocks requires searching (b/2) blocks on average. If no records satisfy the
search condition, the program must read and search all b blocks in the file.
To delete a record, a program must first find its block, copy the block into the buffer, and finally
rewrite the block back to the disk. This leaves unused space in the disk block. Deleting, a large number of
records in this way results in wasted storage space. Another technique used for record deletion is to have
an extra byte or bit, called a deletion marker, stored with each record. A record is deleted by setting the
deletion marker to a certain value. A different value of the marker indicates a valid (not deleted) record.
Search programs consider only valid records in a block when conducting their search. Both of these
deletion techniques require periodic reorganization of the file to reclaim the unused space of deleted
records. During reorganization, the file blocks are accessed consecutively, and records are packed by
removing deleted records. After such reorganization, the blocks are filled to capacity once more. Another
possibility is to use the space when inserting records although this requires extra bookkeeping to keep
track of empty locations.
Sequential File Organization (sorted files or ordered files):
A sequential file organization is designed for efficient processing of records in sorted order based on
some search key. A search key is any attribute or set of attributes. It need not be the primary key, or even a
super key. To permit fast retrieval of records in search-key order, the records are chained together by
pointers. The pointer in each record points to the next record in search-key order. Furthermore, to minimize
14
21CS1401 – Database Management Systems Unit –V
the number of block accesses in sequential file processing, the records are stored physically in search-key
order, or as close to search-key order as possible.
Figure: 15.18 show a sequential file of account records taken from the banking example. In that
example, the records are stored in search-key order, using branch-name as the search key.
15
21CS1401 – Database Management Systems Unit –V
The Figure: 5.19 show the file of account, after the insertion of the record (North Town, A-888, 800).
The structure in the figure allows fast insertion of new records, but forces sequential file-processing
applications to process records in an order that does not match the physical order of the records. If relatively
few records need to be stored in overflow blocks, this approach works well. Eventually, however, the
correspondence between search-key order and physical order may be totally lost, in which case sequential
processing will become much less efficient. At this point, the file should be reorganized so that it is once
again physically in sequential order. Such reorganizations are costly; and must be done during times when
the system load is low. The frequency with which reorganizations are needed depends on the frequency of
insertion of new records. In the extreme case in which insertions rarely occur, it is possible always to keep
the file in physically sorted order.
16
21CS1401 – Database Management Systems Unit –V
each record will reside on a different block, forcing us to do one block read for each record required by the
query. As an example,
17
21CS1401 – Database Management Systems Unit –V
A clustering file organization is a file organization, such as that illustrated in the Figure: 5.22 that
stores related records of two or more relations in each block. Such a file organization allows us to read
records that would satisfy the join condition by using one block read.
The use of clustering has enhanced processing of a particular join (depositor customer), but it
results in slowing processing of other types of query. For example,
Select * from customer
requires more block accesses than it did in the scheme under which each relation is stored in a separate file.
Instead of several customer records appearing in one block each record is located in a distinct block.
Indeed, simply finding all the customer records is not possible without some additional structure. To locate
all tuples of customer relation in the structure of the Figure: 5.22, it is needed to chain together all records
of that relation using pointers, as in the Figure: 5.23. The usage of clustering depends on the types of query
that the database designer believes to be most frequent. Careful use of clustering can produce significant
performance gains in query
processing.
18
21CS1401 – Database Management Systems Unit –V
Several techniques exist for both ordered indexing and hashing. No one technique is the best. Rather,
each technique is best suited to particular database applications.
Ordered Indices
To gain fast random access to records in a file, an index structure is used. Each index structure is
associated with a particular search key. Just like the index of a book or a library catalog an ordered index
stores the values of the search keys in sorted order, and associates with each search key the records that
contain it.
Ordered indices can be categorized as primary index and secondary index.
Primary Index
In this index, it is assumed that all files are ordered sequentially on some search key. Such files, with
a primary index on the search key, are called index-sequential files. They represent one of the oldest index
schemes used in database systems. They are designed for applications that require both sequential processing
of the entire file and random access to individual records.
The Figure: 5.24 show a sequential file of account records taken from the banking example. In the
example figure, the records are
stored in search- key order, with
branch-name used as the search
key.
Dense index: an index record appears for every search-key value in the file. In a dense primary index, the
index record contains the search-key value and a pointer to the first data record with that search-key value.
19
21CS1401 – Database Management Systems Unit –V
The rest of the records with the same search key-value would be stored sequentially after the first record,
since, because the index is a primary one, records are sorted on the same search key.
Dense index implementations may store a list of pointers to all records with the same search-key
value; doing so is not essential for primary indices. The Figure: 5.25 show the dense index for the account
file.
20
21CS1401 – Database Management Systems Unit –V
follow that pointer. We then read the account file in sequential order until we find the first Perryridge record,
and begin processing at that point.
Thus, it is generally faster to locate a record in a dense index; rather than a sparse index. However,
sparse indices have advantages over dense indices in that they require less space and they impose less
maintenance overhead for insertions and deletions.
Multi level indices
Even if the sparse index is used, the index itself may become too large for efficient processing. It is
not unreasonable, in practice, to have a file with 100,000 records, with 10 records stored in each block. If we
have one index record per block, the index has 10,000 records. Index records are smaller than data records,
so let us assume that 100 index records fit on a block. Thus, our index occupies 100 blocks. Such large
indices are stored as sequential files on disk.
If an index is sufficiently small to be kept in main memory, the search time to find an entry is low.
However, if the index is so large that it must be kept on disk, a search for an entry requires several disk block
reads. Binary search can be used on the index file to locate an entry, but the search still has a large cost. If
overflow blocks have been used, binary search will not be possible. In that case, a sequential search is
typically used, and that requires b block reads, which will take even longer. Thus, the process of searching a
large index may be costly.
To deal with this problem, we treat the index just as we would treat any other sequential file, and
construct a sparse index on the primary index, as in the Figure: 5.27
To locate a record, we first use binary search on the outer index to find the record for the largest
search-key value less than or equal to the one that we desire. The pointer points to a block of the inner index.
We scan this block until we find the record that has the largest search-key value less than or equal to the one
that we desire. The pointer in this record points to the block of the file that contains the record for which we
are looking
.
21
21CS1401 – Database Management Systems Unit –V
Secondary Indices:
Secondary indices must be dense, with an index entry for every search-key value, and, a pointer to
every record in the file. A primary index may be sparse, storing only some of the search-key values, since it
is always possible to find records with intermediate, search-key values by a sequential access to a part of the
file. If a secondary index stores only some of the search-key values, records with intermediate search-key
values may be anywhere in the file and, in general, we cannot find them without searching the entire file.
A secondary index on a candidate key looks just like a dense primary index, except that the records
pointed to by successive values in the index are not stored sequentially. In general, however, secondary
indices may have a different structure from primary indices. If the search key of a primary index is not a
candidate key, it suffices if the index points to the first record with a particular value for the search key, since
the other records can be fetched by a sequential scan of the file.
In contrast, if the search key of a secondary index is not a candidate key, it is not enough to point to
just the first record with each search-key value. The remaining records with the same search-key value could
be anywhere in the file, since records are ordered by the search key of the primary index, rather than by the
search key of the secondary index. Therefore, a secondary index must contain pointers to all the records.
We can use an extra level of indirection to implement secondary indices on search keys that are not
candidate keys. The pointers in such a secondary index do not point directly to the file. Instead, each points
to a bucket that contains pointers to the file.
22
21CS1401 – Database Management Systems Unit –V
The Figure: 5.28 shows the structure of a secondary index that uses an extra level of indirection on the
account file, on the search key balance.
For a huge database structure, it can be almost next to impossible to search all the index values through all its
level and then reach the destination data block to retrieve the desired data. Hashing is an effective technique
to calculate the direct location of a data record on the disk without using index structure.
Hashing uses hash functions with search keys as parameters to generate the address of a data record.
23
21CS1401 – Database Management Systems Unit –V
Hash Organization
Bucket − A hash file stores data in bucket format. Bucket is considered a unit of storage.
A bucket typically stores one complete disk block, which in turn can store one or more
records.
Hash Function − A hash function, h, is a mapping function that maps all the set of search-keys K to
the address where actual records are placed. It is a function from search keys to bucket addresses.
Static Hashing
In static hashing, when a search-key value is provided, the hash function always computes the same address.
For example, if mod-4 hash function is used, then it shall generate only 5 values. The output address shall
always be same for that function. The number of buckets provided remains unchanged at all times.
Operation
24
21CS1401 – Database Management Systems Unit –V
Insertion − When a record is required to be entered using static hash, the hash function h computes the bucket
address for search key K, where the record will be stored.
Bucket address = h(K)
Search − When a record needs to be retrieved, the same hash function can be used to retrieve the address of
the bucket where the data is stored.
Bucket Overflow
The condition of bucket-overflow is known as collision. This is a fatal state for any static hash function. In
this case, overflow chaining can be used.
1. Overflow Chaining − When buckets are full, a new bucket is allocated for the same
hash result and is linked after the previous one. This mechanism is called Closed
Hashing.
2. Linear Probing − When a hash function generates an address at which data is already
stored, the next free bucket is allocated to it. This mechanism is called Open Hashing.
25
21CS1401 – Database Management Systems Unit –V
Dynamic Hashing
The problem with static hashing is that it does not expand or shrink dynamically as the size of the database
grows or shrinks. Dynamic hashing provides a mechanism in which data buckets are added and removed
dynamically and on-demand. Dynamic hashing is also known as extended hashing.
Hash function, in dynamic hashing, is made to produce a large number of values and only a few are used
initially.
26
21CS1401 – Database Management Systems Unit –V
Organization
The prefix of an entire hash value is taken as a hash index. Only a portion of the hash value is used for
computing bucket addresses. Every hash index has a depth value to signify how many bits are used for
computing a hash function. These bits can address 2n buckets. When all these bits are consumed − that is,
when all the buckets are full − then the depth value is increased linearly and twice the buckets are allocated.
Operation
Querying − Look at the depth value of the hash index and use those bits to compute the
bucket address.
Update − Perform a query as above and update the data.
Deletion − Perform a query to locate the desired data and delete the same.
Insertion − Compute the address of the bucket
27
21CS1401 – Database Management Systems Unit –V
Hashing is not favorable when the data is organized in some ordering and the queries require a range of data.
When data is discrete and random, hash performs the best.Hashing algorithms have high complexity than
indexing. All hash operations are done in constant time.
29
21CS1401 – Database Management Systems Unit –V
Special cases:
⮚ If the root is not a leaf, it has at least 2 children.
⮚ If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and (n–1)
values.
B+-Tree Node Structure
● Typical node
30
21CS1401 – Database Management Systems Unit –V
31
21CS1401 – Database Management Systems Unit –V
3. Eventually reach a leaf node. If for some i, key Ki = k follow pointer Pi to the desired record
or bucket. Else no record with search-key value k exists.
Figure 5.43 Result of splitting node containing Brighton and Downtown on inserting Clearview
Insertion Example:
32
21CS1401 – Database Management Systems Unit –V
33
21CS1401 – Database Management Systems Unit –V
Figure: 5.6.1 - (a) B-tree leaf node ,(b) Non leaf node
34
21CS1401 – Database Management Systems Unit –V
Figure: 5.6.2
– B-tree
● Advantages of B-Tree indices:
o May use less tree nodes than a corresponding B+-Tree.
o Sometimes possible to find search-key value before reaching leaf node.
● Disadvantages of B-Tree indices:
o Only small fraction of all search-key values are found early
o Non-leaf nodes are larger, so fan-out is reduced. Thus, B-Trees typically have greater depth than
corresponding B+-Tree
o Insertion and deletion more complicated than in B+-Trees
o Implementation is harder than B+-Trees.
● Typically, advantages of B-Trees do not out weigh disadvantages.
35
21CS1401 – Database Management Systems Unit –V
36
21CS1401 – Database Management Systems Unit –V
In large database systems, however, disk accesses (which we measure as the number of transfers of
blocks from disk) are usually the most important cost, since disk accesses are slow compared to in-memory
operations. Moreover, CPU speeds have been improving much faster than have disk speeds. Thus, it is likely
that the time spent in disk activity will continue to dominate the total time to execute a query. Finally,
estimating the CPU time is relatively hard compared to estimating the disk-access cost. Therefore, most
people consider the disk-access cost a reasonable measure of the cost of a query evaluation plan.
We use the number of block transfers from disk as a measure of the actual cost. To simplify our
computation of disk-access cost, we assume that all transfers of blocks have the same cost. This assumption
ignores the variance arising from rotational latency (waiting for the desired data to spin under the read-write
head) and seek time (the time that it takes to move the head over the desired track or cylinder). To, get more
precise numbers, we need to distinguish between sequential I/O, where the blocks read are contiguous on
disk, and random I/O, where the blocks are noncontiguous, and an extra seek cost must be paid for each disk
I/O operation. We also need to distinguish between reads and writes of blocks, since it takes more time to
write a block to disk than to read a block from disk. A more accurate measure would therefore estimate
⮚ The number of seek operations performed
⮚ The number of blocks read
⮚ The number of blocks written
and then add up these numbers after multiplying them by the average seek time, average transfer time for
reading a block, and average transfer time for writing a block, respectively. Real-life query optimizers also
take CPU costs into account when computing the cost of an operation
The cost estimates given above ignore the cost of writing the final result of an operation back to disk.
These are taken into account separately where required. The costs of all the algorithms that we consider
depend on the size of the buffer in main memory. In the best case, all data can be read into the buffers, and
the disk does not need to be accessed again. In the worst case, we assume that the buffer can hold only a few
blocks of data-approximately one block per relation. When presenting cost estimates, we generally assume
the worst case.
Overview of Query Evaluation
❖ DBMS keeps descriptive data in system catalogs.
❖ SQL queries are translated into an extended form of relational algebra:
▪ Query Plan Reasoning:
• Tree of operators
37
21CS1401 – Database Management Systems Unit –V
38
21CS1401 – Database Management Systems Unit –V
39
21CS1401 – Database Management Systems Unit –V
▪ Index height, low/high key values (Low/High) for each tree index.
❖ Catalogs updated periodically.
▪ Updating whenever many data changes occurred;
▪ Lots of approximation anyway, so slight inconsistency ok.
How Catalogs are stored
• Attr_Cat(attr_name, rel_name, type, position)
• System Catalog is itself a collection of tables.
• Catalog tables describe all tables in
database, including catalog tables
themselves.
NoSQL is a type of database management system (DBMS) that is designed to handle and store large
volumes of unstructured and semi-structured data. Unlike traditional relational databases that use
tables with pre-defined schemas to store data, NoSQL databases use flexible data models that can
adapt to changes in data structures and are capable of scaling horizontally to handle growing
amounts of data.
40
21CS1401 – Database Management Systems Unit –V
The term NoSQL originally referred to “non-SQL” or “non-relational” databases, but the term has
since evolved to mean “not only SQL,” as NoSQL databases have expanded to include a wide range
of different database architectures and data models.
However, NoSQL databases may not be suitable for all applications, as they may not provide the
same level of data consistency and transactional guarantees as traditional relational databases. It is
important to carefully evaluate the specific needs of an application when choosing a database
management system.
NoSQL originally referring to non SQL or non relational is a database that provides a mechanism
for storage and retrieval of data. This data is modeled in means other than the tabular relations used
in relational databases. Such databases came into existence in the late 1960s, but did not obtain the
41
21CS1401 – Database Management Systems Unit –V
NoSQL moniker until a surge of popularity in the early twenty-first century. NoSQL databases are
used in real-time web applications and big data and their use are increasing over time.
● NoSQL systems are also sometimes called Not only SQL to emphasize the fact that they
may support SQL-like query languages. A NoSQL database includes simplicity of
design, simpler horizontal scaling to clusters of machines and finer control over
availability. The data structures used by NoSQL databases are different from those used
by default in relational databases which makes some operations faster in NoSQL. The
suitability of a given NoSQL database depends on the problem it should solve.
● NoSQL databases, also known as “not only SQL” databases, are a new type of database
management system that have gained popularity in recent years. Unlike traditional
relational databases, NoSQL databases are designed to handle large amounts of
unstructured or semi-structured data, and they can accommodate dynamic changes to
the data model. This makes NoSQL databases a good fit for modern web applications,
real- time analytics, and big data processing.
● Data structures used by NoSQL databases are sometimes also viewed as more flexible
than relational database tables. Many NoSQL stores compromise consistency in favor of
availability, speed and partition tolerance. Barriers to the greater adoption of NoSQL
stores include the use of low-level query languages, lack of standardized interfaces, and
huge previous investments in existing relational databases.
● Most NoSQL stores lack true ACID(Atomicity, Consistency, Isolation, Durability)
transactions but a few databases, such as MarkLogic, Aerospike, FairCom c-treeACE,
Google Spanner (though technically a NewSQL database), Symas LMDB, and
OrientDB have made them central to their designs.
● Most NoSQL databases offer a concept of eventual consistency in which database
changes are propagated to all nodes so queries for data might not return updated data
immediately or might result in reading data that is not accurate which is a problem known
as stale
42
21CS1401 – Database Management Systems Unit –V
reads. Also some NoSQL systems may exhibit lost writes and other forms of data loss.
Some NoSQL systems provide concepts such as write-ahead logging to avoid data loss.
● One simple example of a NoSQL database is a document database. In a document
database, data is stored in documents rather than tables. Each document can contain a
different set of fields, making it easy to accommodate changing data requirements
● For example, “Take, for instance, a database that holds data regarding employees.”. In a
relational database, this information might be stored in tables, with one table for
employee information and another table for department information. In a document
database, each employee would be stored as a separate document, with all of their
information contained within the document.
● NoSQL databases are a relatively new type of database management system that have
gained popularity in recent years due to their scalability and flexibility. They are designed
to handle large amounts of unstructured or semi-structured data and can handle dynamic
changes to the data model. This makes NoSQL databases a good fit for modern web
applications, real-time analytics, and big data processing.
1. Dynamic schema: NoSQL databases do not have a fixed schema and can
accommodate changing data structures without the need for migrations or schema
alterations.
2. Horizontal scalability: NoSQL databases are designed to scale out by adding more nodes
to a database cluster, making them well-suited for handling large amounts of data and
high levels of traffic.
3. Document-based: Some NoSQL databases, such as MongoDB, use a document-
based data model, where data is stored in semi-structured format, such as JSON or
BSON.
4. Key-value-based: Other NoSQL databases, such as Redis, use a key-value data
model, where data is stored as a collection of key-value pairs.
43
21CS1401 – Database Management Systems Unit –V
Advantages of NoSQL: There are many advantages of working with NoSQL databases such as
MongoDB and Cassandra. The main advantages are high scalability and high availability.
1. High scalability : NoSQL databases use sharding for horizontal scaling. Partitioning of
data and placing it on multiple machines in such a way that the order of the data is
preserved is sharding. Vertical scaling means adding more resources to the existing
machine whereas horizontal scaling means adding more machines to handle the data.
Vertical scaling is not that easy to implement but horizontal scaling is easy to implement.
Examples of horizontal scaling databases are MongoDB, Cassandra, etc. NoSQL can
handle a huge amount of data because of scalability, as the data grows NoSQL scale
itself to handle that data in an efficient manner.
2. Flexibility: NoSQL databases are designed to handle unstructured or semi-structured data,
which means that they can accommodate dynamic changes to the data model. This makes
NoSQL databases a good fit for applications that need to handle changing data
requirements.
3. High availability : Auto replication feature in NoSQL databases makes it highly
available because in case of any failure data replicates itself to the previous consistent
state.
44
21CS1401 – Database Management Systems Unit –V
4. Scalability: NoSQL databases are highly scalable, which means that they can handle
large amounts of data and traffic with ease. This makes them a good fit for applications
that need to handle large amounts of data or traffic
5. Performance: NoSQL databases are designed to handle large amounts of data and traffic,
which means that they can offer improved performance compared to traditional relational
databases.
6. Cost-effectiveness: NoSQL databases are often more cost-effective than traditional
relational databases, as they are typically less complex and do not require
expensive hardware or software.
1. Lack of standardization : There are many different types of NoSQL databases, each
with its own unique strengths and weaknesses. This lack of standardization can make
it difficult to choose the right database for a specific application
2. Lack of ACID compliance : NoSQL databases are not fully ACID-compliant, which
means that they do not guarantee the consistency, integrity, and durability of data.
This can be a drawback for applications that require strong data consistency
guarantees.
3. Narrow focus : NoSQL databases have a very narrow focus as it is mainly designed for
storage but it provides very little functionality. Relational databases are a better choice in
the field of Transaction Management than NoSQL.
4. Open-source : NoSQL is open-source database. There is no reliable standard for NoSQL
yet. In other words, two database systems are likely to be unequal.
5. Lack of support for complex queries : NoSQL databases are not designed to handle
complex queries, which means that they are not a good fit for applications that require
complex data analysis or reporting.
45
21CS1401 – Database Management Systems Unit –V
6. Lack of maturity : NoSQL databases are relatively new and lack the maturity of
traditional relational databases. This can make them less reliable and less secure
than traditional databases.
7. Management challenge : The purpose of big data tools is to make the management of a
large amount of data as simple as possible. But it is not so easy. Data management in
NoSQL is much more complex than in a relational database. NoSQL, in particular, has a
reputation for being challenging to install and even more hectic to manage on a daily
basis.
8. GUI is not available : GUI mode tools to access the database are not flexibly available
in the market.
9. Backup : Backup is a great weak point for some NoSQL databases like
MongoDB. MongoDB has no approach for the backup of data in a consistent
manner.
10. Large document size : Some database systems like MongoDB and CouchDB store data
in JSON format. This means that documents are quite large (BigData, network
bandwidth, speed), and having descriptive key names actually hurts since they increase
the document size.
Types of NoSQL database: Types of NoSQL databases and the name of the databases system that
falls in that category are:
46
21CS1401 – Database Management Systems Unit –V
2. The relationship between the data you store is not that important
47
21CS1401 – Database Management Systems Unit –V
In conclusion, NoSQL databases offer several benefits over traditional relational databases, such as
scalability, flexibility, and cost-effectiveness. However, they also have several drawbacks, such as a
lack of standardization, lack of ACID compliance, and lack of support for complex queries. When
choosing a database for a specific application, it is important to weigh the benefits and drawbacks
carefully to determine the best fit.
5.9 MONGODB
● Rich queries
● Fast in-place updates
● Professional support by MongoDB
For example, assume we are getting the details of employees in three different documents namely,
Personal_details, Contact and, Address, you can embed all the three documents in a single one as shown
below −
{
_id: ,
Emp_ID: "10025AE336"
Personal_details:{
First_Name: "Radhika",
Last_Name: "Sharma",
Date_Of_Birth: "1995-09-26"
},
Contact: {
e-mail: "radhika_sharma.123@gmail.com",
phone: "9848022338"
},
Address: {
city: "Hyderabad",
Area: "Madapur",
State: "Telangana"
}
}
Employee:
49
21CS1401 – Database Management Systems Unit –V
{
_id: <ObjectId101>,
Emp_ID: "10025AE336"
}
Personal_details:
{
_id: <ObjectId102>,
empDocID: " ObjectId101",
First_Name: "Radhika",
Last_Name: "Sharma",
Date_Of_Birth: "1995-09-26"
}
Contact:
{
_id: <ObjectId103>,
empDocID: " ObjectId101",
e-mail: "radhika_sharma.123@gmail.com",
phone: "9848022338"
}
Address:
{
_id: <ObjectId104>,
empDocID: " ObjectId101",
city: "Hyderabad",
Area: "Madapur",
State: "Telangana"
}
Example
50
21CS1401 – Database Management Systems Unit –V
Suppose a client needs a database design for his blog/website and see the differences between RDBMS and
MongoDB schema design. Website has the following requirements.
In RDBMS schema, design for above requirements will have minimum three tables.
While in MongoDB schema, design will have one collection post and the following structure −
{
_id: POST_ID
title: TITLE_OF_POST,
description: POST_DESCRIPTION,
by: POST_BY,
url: URL_OF_POST,
tags: [TAG1, TAG2, TAG3],
likes: TOTAL_LIKES,
comments: [
{
user:'COMMENT_BY',
message: TEXT,
dateCreated: DATE_TIME,
like: LIKES
},
{
user:'COMMENT_BY',
message: TEXT,
dateCreated: DATE_TIME,
51
21CS1401 – Database Management Systems Unit –V
like: LIKES
}
]
}
So while showing the data, in RDBMS you need to join three tables and in MongoDB, data will be shown
from one collection only.
String − This is the most commonly used datatype to store the data. String in MongoDB must
be UTF-8 valid.
Integer − This type is used to store a numerical value. Integer can be 32 bit or 64 bit depending
upon your server.
Boolean − This type is used to store a boolean (true/ false) value.
Double − This type is used to store floating point values.
Min/ Max keys − This type is used to compare a value against the lowest and highest BSON
elements.
Arrays − This type is used to store arrays or list or multiple values into one key.
Timestamp − ctimestamp. This can be handy for recording when a document has been
modified or added.
Object − This datatype is used for embedded documents.
Null − This type is used to store a Null value.
Symbol − This datatype is used identically to a string; however, it's generally reserved for
languages that use a specific symbol type.
Date − This datatype is used to store the current date or time in UNIX time format. You can
specify your own date time by creating object of Date and passing day, month, year into it.
Object ID − This datatype is used to store the document’s ID.
Binary data − This datatype is used to store binary data.
Code − This datatype is used to store JavaScript code into the document.
Regular expression − This datatype is used to store regular expression.
52
21CS1401 – Database Management Systems Unit –V
As we know that we can use MongoDB for various things like building an application (including web and
mobile), or analysis of data, or an administrator of a MongoDB database, in all these cases we need to
interact with the MongoDB server to perform certain operations like entering new data into the application,
updating data into the application, deleting data from the application, and reading the data of the application.
MongoDB provides a set of some basic but most essential operations that will help you to easily interact with
the MongoDB server and these operations are known as CRUD operations.
Create Operations –
The create or insert operations are used to insert or add new documents in the collection. If a collection does
not exist, then it will create a new collection in the database. You can perform, create operations using the
following methods provided by the MongoDB:
Method Description
53
21CS1401 – Database Management Systems Unit –V
Example 1: In this example, we are inserting details of a single student in the form of document in the student
collection using db.collection.insertOne() method.
Example 2: In this example, we are inserting details of the multiple students in the form of documents in the
54
21CS1401 – Database Management Systems Unit –V
Read Operations –
The Read operations are used to retrieve documents from the collection, or in other words, read operations
are used to query a collection for a document. You can perform read operation using the following
method provided by the MongoDB:
55
21CS1401 – Database Management Systems Unit –V
Method Description
56
21CS1401 – Database Management Systems Unit –V
Example : In this example, we are retrieving the details of students from the student collection using
db.collection.find() method.
Update Operations –
The update operations are used to update or modify the existing document in the collection. You can perform
update operations using the following methods provided by the MongoDB:
57
21CS1401 – Database Management Systems Unit –V
Method Description
Example 1: In this example, we are updating the age of Sumit in the student collection using
db.collection.updateOne() method.
58
21CS1401 – Database Management Systems Unit –V
Example 2: In this example, we are updating the year of course in all the documents in the student collection
59
21CS1401 – Database Management Systems Unit –V
60
21CS1401 – Database Management Systems Unit –V
Delete Operations –
The delete operation are used to delete or remove the documents from a collection. You can perform delete
operations using the following methods provided by the MongoDB:
Method Description
Example 1: In this example, we are deleting a document from the student collection using
db.collection.deleteOne() method.
61
21CS1401 – Database Management Systems Unit –V
59
21CS1401 – Database Management Systems Unit –V
Example 2: In this example, we are deleting all the documents from the student collection using
db.collection.deleteMany() method.
60