Professional Documents
Culture Documents
Chapter 5-Record Storage and Primary File Organization
Chapter 5-Record Storage and Primary File Organization
Primary storage:
This category includes storage media that can be operated on directly by
the computer central processing unit (CPU), such as the computer main
memory and smaller but faster cache memories.
Primary storage usually provides fast access to data but is of limited
storage capacity.
CPU main memory, cache memory
06/22/2024 2
Contend…
Secondary storage:
Is called Auxiliary Memory
06/22/2024 3
Contend…
Tertiary storage:
It is used to store huge volumes of data.
Since such storage devices are external to the computer
system, they are the slowest in speed.
These storage devices are mostly used to take the back up of
an entire system.
Optical disks and magnetic tapes are widely used as tertiary
storage.
Removable media
06/22/2024 4
Memory Hierarchies and Storage Devices
Cache memory: Is typically used by the CPU to speed up execution of
program instructions using techniques such as prefetching and pipelining.
Static RAM
DRAM: which provides the main work area for the CPU for keeping
program instructions and data.
Mass storage
Magnetic disks
CD-ROM, DVD, tape drives
Flash memory
Nonvolatile
06/22/2024 5
Storage Hierarchy
Storage media are organized
based on data accessing speed,
cost per unit of data to buy the
medium, and by medium's
reliability. cost per
Access
bit
The higher levels are expensive time
but fast. On moving down, the
cost per bit is decreasing, and the
access time is increasing.
Also, the storage media from the
main memory to up represents the
volatile nature, and below the
main memory, all are non-volatile
devices.
6 06/22/2024
Storage Organization of Databases
Persistent data: Databases typically store large amounts of data
that must persist over long periods of time, and hence the data is
often referred to as persistent data.
Most databases
Transient data: The part of Persistent data are accessed and
processed repeatedly during the storage period.
Exists only during program execution.
File organization
Determines how records are physically placed on the disk.
Determines how records are accessed.
06/22/2024 7
File organization
A database management system supports several file organization
techniques.
The most important task of DBA is to choose a best organization for each
file, based on its use.
The organization of records in a file is influenced by number of factors
that must be taken into consideration while choosing a particular
technique.
These factors are
A. Fast retrival, updation and transfer of records.
B. Efficient use of disk space and high throughput
C. Type of use and efficient manipulation
D. Security from unauthorized access and reduction in cost.
E. Scalability and protection from failure
06/22/2024 8
Secondary Storage Devices
Hardware Description of Disk Devices
The most basic unit of data on the disk is a single bit of information.
By magnetizing an area on disk in certain ways, one can make it
represent a bit value of either 0 (zero) or 1 (one).
To code information, bits are grouped into bytes (or characters).
Byte sizes are typically 4 to 8 bits, depending on the computer and
the device.
The capacity of a disk is the number of bytes it can store, which is
usually very large.
06/22/2024 9
Secondary Storage Devices
Disk capacity measures storage size
Disks may be single or double-sided
Concentric circles called tracks
Tracks divided into blocks or sectors
Disk packs
Cylinder
06/22/2024 10
Single-Sided Disk and Disk Pack
06/22/2024 13
Magnetic Tape Storage Devices (cont’d.…)
The main characteristic of a tape is its requirement that we
access the data blocks in sequential order.
To get to a block in the middle of a reel of tape, the tape is
mounted and then scanned until the required block gets under
the read/write head.
For this reason, tape access can be slow and tapes are not used
to store online data, except for some specialized applications.
However, tapes serve a very important function-that of backing
up the database.
06/22/2024 14
Storage Types and Characteristics
Table :Types of Storage with Capacity, Access Time, Max Bandwidth (Transfer
Speed), and Commodity Cost.
06/22/2024 15
Making Data Access More Efficient on Disk
Techniques for efficient data Access.
Buffering of data
Proper organization of data on disk
Reading data ahead of request.
Proper scheduling of I/O requests.
Use of log disks to temporarily hold writes .
Use of SSDs or flash memory for recovery purposes.
06/22/2024 16
06/22/2024 17
Placing File Records on Disk
A file is a collection of related sequence of records.
A collection of field names and their corresponding data types constitutes a record.
A data type, associated with each field, specifies the types of values a field can take.
All records in a file are of the same record type.
Record: collection of related data values or items
Values correspond to record field
Data types
Numeric
String
Boolean
Date/time
Binary large objects (BLOBs)
Unstructured objects
06/22/2024 18
Placing File Records on Disk (cont’d.…)
Reasons for variable-length records
One or more fields have variable length
One or more fields are repeating
One or more fields are optional
File contains records of different types
06/22/2024 19
Record Blocking and Spanned Versus Un spanned Records
File records allocated to disk blocks
Spanned records
Larger than a single block.
Pointer at end of first block points to block containing
remainder of record.
Un spanned
Records not allowed to cross block boundaries.
06/22/2024 20
Record Blocking and Spanned Versus Un spanned
Records(cont’d.…)
06/22/2024 21
Record Blocking and Spanned Versus Un spanned
Records (cont’d.…)
Allocating file blocks on disk
Contiguous allocation: The file blocks are allocated to
consecutive disk blocks.
Linked allocation: Each file block contains a pointer to the next
file block.
Indexed allocation: where one or more index blocks contain
pointers to the actual file blocks. It is also common to use
combinations of these techniques.
File header (file descriptor)
Contains file information needed by system programs
Disk addresses
Format descriptions
06/22/2024 22
Operations on Files
Retrieval operations
No change to file data
Update operations
File change by insertion, deletion, or modification
Records selected based on selection condition(filtering condition)
06/22/2024 23
Operations on Files (cont’d.…)
Examples of operations for accessing file records
Open
Find
Read
FindNext
Delete
Insert
Close
Scan
06/22/2024 24
Files of Unordered Records(Heap Files)
In this simplest and most basic type of organization, records
are placed in the file in the order in which they are inserted, so
new records are inserted at the end of the file.
Such an organization is called a heap or pile file.
Inserting a new record is very efficient: the last disk block of
the file is copied into a buffer; the new record is added; and the
block is then rewritten back to disk.
06/22/2024 25
Files of Unordered Records(Heap Files)
Deletion techniques
To delete a record, a program must first find its block, copy
the block into a buffer, then delete the record from the buffer,
and finally rewrite the block back to the disk.
This leaves unused space in the disk block.
During reorganization, the file blocks are accessed
consecutively, and records are packed by removing deleted
records.
To read all records in order of the values of some field, we
create a sorted copy of the file.
Searching for a record requires linear search.
06/22/2024 26
Files of Ordered Records (Sorted Files)
We can physically order the records of a file on disk based on the
values of one of their fields-called the ordering field.
This leads to an ordered or sequential file.
If the ordering field is also a key field of the file-a field
guaranteed to have a unique value in each record-then the field is
called the ordering key for the file.
06/22/2024 27
Files of Ordered Records (Sorted Files)
Advantages over unordered records
Reading records in order of ordering key value is extremely
efficient because no sorting is required.
Finding the next record from the current one in order of the
ordering key usually requires no additional block accesses.
Using a search condition based on the value of an ordering key
field results in faster access when the binary search technique is
used, which constitutes an improvement over linear searches.
06/22/2024 28
Access Times for Various File Organizations
Average access times for a file of b blocks under basic file organizations
06/22/2024 29
Hashing Technique
Another type of primary file organization is based on hashing, which
provides very fast access to records on certain search conditions. This
organization is usually called a hash file.
The search condition must be an equality condition on a single field,
called the hash field of the file.
In most cases, the hash field is also a key field of the file, in which case
it is called the hash key.
The idea behind hashing is to provide a function h, called a hash
function or randomizing function, which is applied to the hash field
value of a record and yields the address of the disk block in which the
record is stored.
Hashing also internal search structure
Used when group of records accessed exclusively by one field
value.
06/22/2024 30
Hashing Techniques (cont’d.…)
For internal files, hashing is typically implemented as a hash
table through the use of an array of records.
Suppose that the array index range is from 0 to M – 1 then we
have M slots whose addresses correspond to the array indexes.
We choose a hash function that transforms the hash field value
into an integer between 0 and M - 1.
One common hash function is the h(K) = K mod M function,
which returns the remainder of an integer hash field value K
after division by M; this value is then used for the record
address.
06/22/2024 31
Hashing Techniques (cont’d.…)
Other hashing functions can be used.
One technique, called folding, involves applying an arithmetic
function such as addition or a logical function such as exclusive
or to different portions of the hash field value to calculate the
hash address.
Another technique involves picking some digits of the hash field
value-for example, the third, fifth, and eighth digits-to form the
hash address.
06/22/2024 32
Hashing Techniques (cont’d.…)
A collision occurs when the hash field value of a record that is
being inserted hashes to an address that already contains a
different record.
In this situation, we must insert the new record in some other
position, since its hash address is occupied.
The process of finding another position is called collision
resolution.
There are numerous methods for collision resolution, including
the following:
06/22/2024 33
Hashing Techniques (cont’d.…)
Open addressing:
Proceeding from the occupied position specified by the hash
address, the program checks the subsequent positions in order
until an unused (empty) position is found.
Chaining:
For this method, various overflow locations are kept, usually by
extending the array with a number of overflow positions.
In addition, a pointer field is added to each record location.
A collision is resolved by placing the new record in an unused
overflow location and setting the pointer of the occupied hash
address location to the address of that overflow location.
06/22/2024 34
Hashing Techniques (cont’d.…)
Multiple hashing: The program applies a second hash function
if the first results in a collision. If another collision results, the
program uses open addressing or applies a third hash function
and then uses open addressing if necessary.
06/22/2024 35
Hashing Techniques (cont’d.)
Hashing techniques that allow dynamic file expansion
Extendible hashing
File performance does not degrade as file grows
Dynamic hashing
Maintains tree-structured directory.
Linear hashing
Allows hash file to expand and shrink buckets without
needing a directory
06/22/2024 37
Index Structure for Files
Indexing is a data structure technique to efficiently
retrieve records from the database files based on some
attributes on which the indexing has been done.
Indexing in database systems is like what we see in books.
06/22/2024 38
Indexes as Access Paths
A single-level index is an auxiliary file that makes it more
efficient to search for a record in the data file.
The index is usually specified on one field of the file (although
it could be specified on several fields)
One form of an index is a file of entries <field value, pointer
to record>, which is ordered by field value
The index is called an access path on the field.
Slide 14- 39
Indexes as Access Paths (contd.)
The index file usually occupies considerably less disk blocks
than the data file because its entries are much smaller
A binary search on the index yields a pointer to the file record
Indexes can also be characterized as dense or sparse
A dense index has an index entry for every search key value
(and hence every record) in the data file.
A sparse (or nondense) index, on the other hand, has index
entries for only some of the search values
Slide 14- 40
Indexes as Access Paths (contd.)
Slide 14- 43
Primary index on the ordering key field
Slide 14- 44
Types of Single-Level Indexes
Clustering Index
Defined on an ordered data file
The data file is ordered on a non-key field unlike primary index,
which requires that the ordering field of the data file have a distinct
value for each record.
Includes one index entry for each distinct value of the field; the
index entry points to the first data block that contains records with
that field value.
It is another example of non dense index where Insertion and
Deletion is relatively straightforward with a clustering index.
Slide 14- 45
A Clustering Index Example
FIGURE
A clustering index on
the DEPTNUMBER
ordering non-key field
of an EMPLOYEE
file.
Slide 14- 46
Another Clustering Index Example
Slide 14- 47
Types of Single-Level Indexes
Secondary Index
A secondary index provides a secondary means of accessing a file
for which some primary access already exists.
The secondary index may be on a field which is a candidate key and
has a unique value in every record, or a non-key with duplicate
values.
The index is an ordered file with two fields.
The first field is of the same data type as some non-ordering field of
the data file that is an indexing field.
The second field is either a block pointer or a record pointer.
There can be many secondary indexes (and hence, indexing fields) for
the same file.
Includes one entry for each record in the data file; hence, it is a
dense index
Slide 14- 48
Example of a Dense Secondary Index
Slide 14- 49
An Example of a Secondary Index
Slide 14- 50
Properties of Index Types
Slide 14- 51
Multi-Level Indexes
Because a single-level index is an ordered file, we can create a
primary index to the index itself;
In this case, the original index file is called the first-level index and
the index to the index is called the second-level index.
We can repeat the process, creating a third, fourth, ..., top level
until all entries of the top-level fit in one disk block.
A multi-level index can be created for any type of first-level index
(primary, secondary, clustering) as long as the first-level index
consists of more than one disk block.
Slide 14- 52
A Two-level Primary Index
Slide 14- 53
Multi-Level Indexes
Such a multi-level index is a form of search tree.
However, insertion and deletion of new index entries is a
severe problem because every level of the index is an
ordered file.
Slide 14- 54
A Node in a Search Tree with Pointers to Subtrees below It
FIGURE
Slide 14- 55
A search tree of order p = 3.
Slide 14- 56
Dynamic Multilevel Indexes Using B-Trees and B+-Trees
Most multi-level indexes use B-tree or B+-tree data structures
because of the insertion and deletion problem.
This leaves space in each tree node (disk block) to allow for new
index entries.
These data structures are variations of search trees that allow
efficient insertion and deletion of new search values.
In B-Tree and B+-Tree data structures, each node corresponds to a
disk block.
Each node is kept between half-full and completely full.
Slide 14- 57
Dynamic Multilevel Indexes Using B-Trees and
B+-Trees (contd.)
An insertion into a node that is not full is quite efficient
If a node is full the insertion causes a split into two nodes
Splitting may propagate to other tree levels.
A deletion is quite efficient if a node does not become less than
half full
If a deletion causes a node to become less than half full, it must
be merged with neighboring nodes
Slide 14- 58
Difference between B-tree and B+-tree
In a B-tree, pointers to data records exist at all levels of
the tree
In a B+-tree, all pointers to data records exists at the leaf-
level nodes
A B+-tree can have less levels (or higher capacity of
search values) than the corresponding B-tree
Slide 14- 59
B-tree Structures
Slide 14- 60
The Nodes of a B+-tree
FIGURE : The nodes of a B+-tree
(a) Internal node of a B+-tree with q –1 search values.
(b) Leaf node of a B+-tree with q – 1 search values and q – 1
data pointers.
Slide 14- 61
Reading Assignment
Tree operation
Insertion, deletion and update in B+ tree and B-Tree
06/22/2024 62
An Example of an Insertion in a B+-tree
Slide 14- 63
An Example of a Deletion in a B+-tree
Slide 14- 64
.
Thank You!
06/22/2024 65