Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 64

ASSOSA UNIVERSITY

College of Computing and Informatics


DEPARTMENT OF COMPUTER SCIENCE

Fundamental Database System

Chapter 5: Record Storage and Primary File Organization

DBMS- Fundamental Database System 1


Introduction
 Databases typically stored on magnetic disks
 Accessed using physical database file structures
 Storage hierarchy

 Primary storage:
 This category includes storage media that can be operated on directly by
the computer central processing unit (CPU), such as the computer main
memory and smaller but faster cache memories.
 Primary storage usually provides fast access to data but is of limited
storage capacity.
 CPU main memory, cache memory

06/22/2024 2
Contend…
 Secondary storage:
 Is called Auxiliary Memory

 This category includes magnetic disks, optical disks, and tapes.


 These devices usually have a larger capacity, cost less, and
provide slower access to data than do primary storage devices.
 Data in secondary storage cannot be processed directly by the
CPU; it must first be copied into primary storage.
 Magnetic disks, flash memory, solid-state drives

06/22/2024 3
Contend…
 Tertiary storage:
 It is used to store huge volumes of data.
 Since such storage devices are external to the computer
system, they are the slowest in speed.
 These storage devices are mostly used to take the back up of
an entire system.
 Optical disks and magnetic tapes are widely used as tertiary
storage.
 Removable media
06/22/2024 4
Memory Hierarchies and Storage Devices
 Cache memory: Is typically used by the CPU to speed up execution of
program instructions using techniques such as prefetching and pipelining.
 Static RAM
 DRAM: which provides the main work area for the CPU for keeping
program instructions and data.
 Mass storage
 Magnetic disks
 CD-ROM, DVD, tape drives
 Flash memory
 Nonvolatile

06/22/2024 5
Storage Hierarchy
 Storage media are organized
based on data accessing speed,
cost per unit of data to buy the
medium, and by medium's
reliability. cost per
Access
bit
 The higher levels are expensive time
but fast. On moving down, the
cost per bit is decreasing, and the
access time is increasing.
 Also, the storage media from the
main memory to up represents the
volatile nature, and below the
main memory, all are non-volatile
devices.
6 06/22/2024
Storage Organization of Databases
 Persistent data: Databases typically store large amounts of data
that must persist over long periods of time, and hence the data is
often referred to as persistent data.
 Most databases
 Transient data: The part of Persistent data are accessed and
processed repeatedly during the storage period.
 Exists only during program execution.
 File organization
 Determines how records are physically placed on the disk.
 Determines how records are accessed.

06/22/2024 7
File organization
 A database management system supports several file organization
techniques.
 The most important task of DBA is to choose a best organization for each
file, based on its use.
 The organization of records in a file is influenced by number of factors
that must be taken into consideration while choosing a particular
technique.
 These factors are
A. Fast retrival, updation and transfer of records.
B. Efficient use of disk space and high throughput
C. Type of use and efficient manipulation
D. Security from unauthorized access and reduction in cost.
E. Scalability and protection from failure

06/22/2024 8
Secondary Storage Devices
 Hardware Description of Disk Devices
 The most basic unit of data on the disk is a single bit of information.
 By magnetizing an area on disk in certain ways, one can make it
represent a bit value of either 0 (zero) or 1 (one).
 To code information, bits are grouped into bytes (or characters).
 Byte sizes are typically 4 to 8 bits, depending on the computer and
the device.
 The capacity of a disk is the number of bytes it can store, which is
usually very large.

06/22/2024 9
Secondary Storage Devices
 Disk capacity measures storage size
 Disks may be single or double-sided
 Concentric circles called tracks
 Tracks divided into blocks or sectors
 Disk packs
 Cylinder

06/22/2024 10
Single-Sided Disk and Disk Pack

A. A single-sided disk with read/write hardware


B. A disk pack with read/write hardware
11
06/22/2024
Sectors on a Disk

Different sector organizations on disk


A. Sectors subtending a fixed angle
B. Sectors maintaining a uniform recording density
06/22/2024 12
Magnetic Tape Storage Devices
 Magnetic tapes are sequential access devices; to access the nth block
on tape, we must first scan over the preceding n –1 block.
 Data is stored on reels of high-capacity magnetic tape, somewhat
similar to audio or videotapes.
 A tape drive is required to read the data from or to write the data to a
tape reel.
 A read/write head is used to read or write data on tape.
 Data records on tape are also stored in blocks-although the blocks
may be substantially larger than those for disks, and inter block gaps
are also quite large.

06/22/2024 13
Magnetic Tape Storage Devices (cont’d.…)
 The main characteristic of a tape is its requirement that we
access the data blocks in sequential order.
 To get to a block in the middle of a reel of tape, the tape is
mounted and then scanned until the required block gets under
the read/write head.
 For this reason, tape access can be slow and tapes are not used
to store online data, except for some specialized applications.
 However, tapes serve a very important function-that of backing
up the database.

06/22/2024 14
Storage Types and Characteristics

Table :Types of Storage with Capacity, Access Time, Max Bandwidth (Transfer
Speed), and Commodity Cost.

06/22/2024 15
Making Data Access More Efficient on Disk
 Techniques for efficient data Access.
 Buffering of data
 Proper organization of data on disk
 Reading data ahead of request.
 Proper scheduling of I/O requests.
 Use of log disks to temporarily hold writes .
 Use of SSDs or flash memory for recovery purposes.

06/22/2024 16
06/22/2024 17
Placing File Records on Disk
 A file is a collection of related sequence of records.
 A collection of field names and their corresponding data types constitutes a record.
 A data type, associated with each field, specifies the types of values a field can take.
 All records in a file are of the same record type.
 Record: collection of related data values or items
 Values correspond to record field
 Data types
 Numeric
 String
 Boolean
 Date/time
 Binary large objects (BLOBs)
 Unstructured objects

06/22/2024 18
Placing File Records on Disk (cont’d.…)
 Reasons for variable-length records
 One or more fields have variable length
 One or more fields are repeating
 One or more fields are optional
 File contains records of different types

06/22/2024 19
Record Blocking and Spanned Versus Un spanned Records
 File records allocated to disk blocks
 Spanned records
 Larger than a single block.
 Pointer at end of first block points to block containing
remainder of record.
 Un spanned
 Records not allowed to cross block boundaries.

06/22/2024 20
Record Blocking and Spanned Versus Un spanned
Records(cont’d.…)

Types of record organization (a) Un spanned (b)


Spanned

06/22/2024 21
Record Blocking and Spanned Versus Un spanned
Records (cont’d.…)
 Allocating file blocks on disk
 Contiguous allocation: The file blocks are allocated to
consecutive disk blocks.
 Linked allocation: Each file block contains a pointer to the next
file block.
 Indexed allocation: where one or more index blocks contain
pointers to the actual file blocks. It is also common to use
combinations of these techniques.
 File header (file descriptor)
 Contains file information needed by system programs
 Disk addresses
 Format descriptions

06/22/2024 22
Operations on Files
 Retrieval operations
 No change to file data
 Update operations
 File change by insertion, deletion, or modification
 Records selected based on selection condition(filtering condition)

06/22/2024 23
Operations on Files (cont’d.…)
 Examples of operations for accessing file records
 Open
 Find
 Read
 FindNext
 Delete
 Insert
 Close
 Scan

06/22/2024 24
Files of Unordered Records(Heap Files)
 In this simplest and most basic type of organization, records
are placed in the file in the order in which they are inserted, so
new records are inserted at the end of the file.
 Such an organization is called a heap or pile file.
 Inserting a new record is very efficient: the last disk block of
the file is copied into a buffer; the new record is added; and the
block is then rewritten back to disk.

06/22/2024 25
Files of Unordered Records(Heap Files)
 Deletion techniques
 To delete a record, a program must first find its block, copy
the block into a buffer, then delete the record from the buffer,
and finally rewrite the block back to the disk.
 This leaves unused space in the disk block.
 During reorganization, the file blocks are accessed
consecutively, and records are packed by removing deleted
records.
 To read all records in order of the values of some field, we
create a sorted copy of the file.
 Searching for a record requires linear search.

06/22/2024 26
Files of Ordered Records (Sorted Files)
 We can physically order the records of a file on disk based on the
values of one of their fields-called the ordering field.
 This leads to an ordered or sequential file.
 If the ordering field is also a key field of the file-a field
guaranteed to have a unique value in each record-then the field is
called the ordering key for the file.

06/22/2024 27
Files of Ordered Records (Sorted Files)
 Advantages over unordered records
 Reading records in order of ordering key value is extremely
efficient because no sorting is required.
 Finding the next record from the current one in order of the
ordering key usually requires no additional block accesses.
 Using a search condition based on the value of an ordering key
field results in faster access when the binary search technique is
used, which constitutes an improvement over linear searches.

06/22/2024 28
Access Times for Various File Organizations

 Average access times for a file of b blocks under basic file organizations

06/22/2024 29
Hashing Technique
 Another type of primary file organization is based on hashing, which
provides very fast access to records on certain search conditions. This
organization is usually called a hash file.
 The search condition must be an equality condition on a single field,
called the hash field of the file.
 In most cases, the hash field is also a key field of the file, in which case
it is called the hash key.
 The idea behind hashing is to provide a function h, called a hash
function or randomizing function, which is applied to the hash field
value of a record and yields the address of the disk block in which the
record is stored.
 Hashing also internal search structure
 Used when group of records accessed exclusively by one field
value.
06/22/2024 30
Hashing Techniques (cont’d.…)
 For internal files, hashing is typically implemented as a hash
table through the use of an array of records.
 Suppose that the array index range is from 0 to M – 1 then we
have M slots whose addresses correspond to the array indexes.
 We choose a hash function that transforms the hash field value
into an integer between 0 and M - 1.
 One common hash function is the h(K) = K mod M function,
which returns the remainder of an integer hash field value K
after division by M; this value is then used for the record
address.

06/22/2024 31
Hashing Techniques (cont’d.…)
 Other hashing functions can be used.
 One technique, called folding, involves applying an arithmetic
function such as addition or a logical function such as exclusive
or to different portions of the hash field value to calculate the
hash address.
 Another technique involves picking some digits of the hash field
value-for example, the third, fifth, and eighth digits-to form the
hash address.

 The problem with most hashing functions is that they do not


guarantee that distinct values will hash to distinct addresses.

06/22/2024 32
Hashing Techniques (cont’d.…)
 A collision occurs when the hash field value of a record that is
being inserted hashes to an address that already contains a
different record.
 In this situation, we must insert the new record in some other
position, since its hash address is occupied.
 The process of finding another position is called collision
resolution.
 There are numerous methods for collision resolution, including
the following:

06/22/2024 33
Hashing Techniques (cont’d.…)
Open addressing:
 Proceeding from the occupied position specified by the hash
address, the program checks the subsequent positions in order
until an unused (empty) position is found.
Chaining:
 For this method, various overflow locations are kept, usually by
extending the array with a number of overflow positions.
 In addition, a pointer field is added to each record location.
 A collision is resolved by placing the new record in an unused
overflow location and setting the pointer of the occupied hash
address location to the address of that overflow location.

06/22/2024 34
Hashing Techniques (cont’d.…)
 Multiple hashing: The program applies a second hash function
if the first results in a collision. If another collision results, the
program uses open addressing or applies a third hash function
and then uses open addressing if necessary.

06/22/2024 35
Hashing Techniques (cont’d.)
 Hashing techniques that allow dynamic file expansion
 Extendible hashing
 File performance does not degrade as file grows
 Dynamic hashing
 Maintains tree-structured directory.
 Linear hashing
 Allows hash file to expand and shrink buckets without
needing a directory

06/22/2024 37
Index Structure for Files
 Indexing is a data structure technique to efficiently
retrieve records from the database files based on some
attributes on which the indexing has been done.
 Indexing in database systems is like what we see in books.

06/22/2024 38
Indexes as Access Paths
 A single-level index is an auxiliary file that makes it more
efficient to search for a record in the data file.
 The index is usually specified on one field of the file (although
it could be specified on several fields)
 One form of an index is a file of entries <field value, pointer
to record>, which is ordered by field value
 The index is called an access path on the field.

Slide 14- 39
Indexes as Access Paths (contd.)
 The index file usually occupies considerably less disk blocks
than the data file because its entries are much smaller
 A binary search on the index yields a pointer to the file record
 Indexes can also be characterized as dense or sparse
 A dense index has an index entry for every search key value
(and hence every record) in the data file.
 A sparse (or nondense) index, on the other hand, has index
entries for only some of the search values

Slide 14- 40
Indexes as Access Paths (contd.)

FIGURE. Dense and sparse index.


06/22/2024 41
Indexes as Access Paths (contd.)
 Example: Given the following data file EMPLOYEE(NAME, SSN, ADDRESS,
JOB, SAL, ... )
 Suppose that:
 record size R=150 bytes block size B=512 bytes r=30000 records
 Then, we get:
 blocking factor Bfr= B div R= 512 div 150= 3 records/block
 number of file blocks b= (r/Bfr)= (30000/3)= 10000 blocks
 For an index on the SSN field, assume the field size V SSN=9 bytes, assume the
record pointer size PR=7 bytes. Then:
 index entry size R =(V + P )=(9+7)=16 bytes.
I SSN R
 index blocking factor Bfr = B div R = 512 div 16= 32 entries/block.
I I
 number of index blocks b= (r/ Bfr )= (30000/32)= 938 blocks.
I
 binary search needs log bI= log 938= 10 block accesses.
2 2
 This is compared to an average linear search cost of:
 (b/2)= 30000/2= 15000 block accesses.
 If the file records are ordered, the binary search cost would be:
 log b= log 30000= 15 block accesses.
2 2 Slide 14- 42
Types of Single-Level Indexes
 Primary Index
 Defined on an ordered data file
 The data file is ordered on a key field
 Includes one index entry for each block in the data file; the
index entry has the key field value for the first record in the
block, which is called the block anchor
 A similar scheme can use the last record in a block.
 A primary index is a non dense (sparse) index, since it includes
an entry for each disk block of the data file and the keys of its
anchor record rather than for every search value.

Slide 14- 43
Primary index on the ordering key field

Slide 14- 44
Types of Single-Level Indexes
 Clustering Index
 Defined on an ordered data file
 The data file is ordered on a non-key field unlike primary index,
which requires that the ordering field of the data file have a distinct
value for each record.
 Includes one index entry for each distinct value of the field; the
index entry points to the first data block that contains records with
that field value.
 It is another example of non dense index where Insertion and
Deletion is relatively straightforward with a clustering index.

Slide 14- 45
A Clustering Index Example

 FIGURE
A clustering index on
the DEPTNUMBER
ordering non-key field
of an EMPLOYEE
file.

Slide 14- 46
Another Clustering Index Example

Slide 14- 47
Types of Single-Level Indexes
 Secondary Index
 A secondary index provides a secondary means of accessing a file
for which some primary access already exists.
 The secondary index may be on a field which is a candidate key and
has a unique value in every record, or a non-key with duplicate
values.
 The index is an ordered file with two fields.
 The first field is of the same data type as some non-ordering field of
the data file that is an indexing field.
 The second field is either a block pointer or a record pointer.
 There can be many secondary indexes (and hence, indexing fields) for
the same file.
 Includes one entry for each record in the data file; hence, it is a
dense index
Slide 14- 48
Example of a Dense Secondary Index

Slide 14- 49
An Example of a Secondary Index

Slide 14- 50
Properties of Index Types

Slide 14- 51
Multi-Level Indexes
 Because a single-level index is an ordered file, we can create a
primary index to the index itself;
 In this case, the original index file is called the first-level index and
the index to the index is called the second-level index.
 We can repeat the process, creating a third, fourth, ..., top level
until all entries of the top-level fit in one disk block.
 A multi-level index can be created for any type of first-level index
(primary, secondary, clustering) as long as the first-level index
consists of more than one disk block.
Slide 14- 52
A Two-level Primary Index

Slide 14- 53
Multi-Level Indexes
 Such a multi-level index is a form of search tree.
 However, insertion and deletion of new index entries is a
severe problem because every level of the index is an
ordered file.

Slide 14- 54
A Node in a Search Tree with Pointers to Subtrees below It
 FIGURE

Slide 14- 55
A search tree of order p = 3.

Slide 14- 56
Dynamic Multilevel Indexes Using B-Trees and B+-Trees
 Most multi-level indexes use B-tree or B+-tree data structures
because of the insertion and deletion problem.
 This leaves space in each tree node (disk block) to allow for new
index entries.
 These data structures are variations of search trees that allow
efficient insertion and deletion of new search values.
 In B-Tree and B+-Tree data structures, each node corresponds to a
disk block.
 Each node is kept between half-full and completely full.
Slide 14- 57
Dynamic Multilevel Indexes Using B-Trees and
B+-Trees (contd.)
 An insertion into a node that is not full is quite efficient
 If a node is full the insertion causes a split into two nodes
 Splitting may propagate to other tree levels.
 A deletion is quite efficient if a node does not become less than
half full
 If a deletion causes a node to become less than half full, it must
be merged with neighboring nodes

Slide 14- 58
Difference between B-tree and B+-tree
 In a B-tree, pointers to data records exist at all levels of
the tree
 In a B+-tree, all pointers to data records exists at the leaf-
level nodes
 A B+-tree can have less levels (or higher capacity of
search values) than the corresponding B-tree

Slide 14- 59
B-tree Structures

Slide 14- 60
The Nodes of a B+-tree
 FIGURE : The nodes of a B+-tree
 (a) Internal node of a B+-tree with q –1 search values.
 (b) Leaf node of a B+-tree with q – 1 search values and q – 1
data pointers.

Slide 14- 61
Reading Assignment
 Tree operation
 Insertion, deletion and update in B+ tree and B-Tree

06/22/2024 62
An Example of an Insertion in a B+-tree

Slide 14- 63
An Example of a Deletion in a B+-tree

Slide 14- 64
.

Thank You!

06/22/2024 65

You might also like