Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

File Structures Module 3

Module 3

Cosequential operations involve the coordinated processing of two or more sequential lists to
produce a single output list. Sometimes the processing results in a merging, or union,

AN OBJECT-ORIENTED MODEL FOR IMPLEMENTING COSEQUENTIAL


PROCESSES
We present a Single, Simple model that can be the basis for the construction of any kind of
cosequential process.
Matching Names in Two Lists
Suppose we want to output the names common to the two lists Shown in Fig. 8.1. This operation
is usually called a match operation, or an intersection. We assume, for the moment, that we will
not allow duplicate names within a list and that the lists are sorted in ascending order.

At each step in the processing of the two lists, we can assume that we have two items to
compare: a current item from List 1 and a current item from List 2. Let's call these two current
items Item (l) and Item (2). We can compare the two items to determine whether Item (i) is
less than, equal to, or greater than Item (2):

 If Item (1) is less than Item (2), we get the next item from List I;
 If Item (1) is greater than Item (2), we get the next item from List 2; and
 If the items are the same we output the item and get the next items from the two lists.

Prof. Nagarathna C, SVIT Page 1


File Structures Module 3

It turns out that this can be handled very cleanly with a single loop containing one three-way
conditional statement, as illustrated in the algorithm of Fig. 8.2. (Also can refer program 7)

Although the match procedure appears to be quite simple, there are a number of matters that
have to be dealt with to make it work reasonably well.
Initializing: We need to arrange things in such a way that the procedure gets going properly.
Getting and accessing the next list item: we need simple methods that support getting the
next list element and accessing it.
Synchronizing: we have to make sure that the current item from one list is never so far ahead
of the current item on the other list that a match will be missed. Sometimes this means getting
the next item from List I, sometimes from List 2, sometimes from both lists.
Handling end-of-file conditions: when we get to the end of either List I or List 2, we need to
halt the program.
Recognizing errors: when an error occurs in .the data (for example, duplicate items or items
outof sequence), we want to detect it and take some action.
Merging Two Lists
The three-way-test, single- loop model for cosequential processing can easily be modified to
handle merging of lists simply by producing output for every case of the if-then-else
constructionsince a merge is a union of the list contents. Method Merge2Lists is given in Fig.
8.5. Once again, you should use this logic to work, step by step, through the lists provided in

Prof. Nagarathna C, SVIT Page 2


File Structures Module 3

Fig. 8.1 to see how the resynchronization is handled and how the use of the HighVaIues forces
the procedure to finish both lists before terminating.

Summary of the Cosequential Processing Model


It is important to understand that the model makes certain general assumptions about the nature
of the data and type of problem to be solved. Following is a list of the assumptions, together
with clarifying comments

Prof. Nagarathna C, SVIT Page 3


File Structures Module 3

Given these assumptions, the essential components of the model are:


 Initialization. Previous item values for all files are Set to the low value; then current
records for all files are read from the first logical records in the respective files.
 One main Synchronization loop is used, and the loop continues as long as relevant
records remain.
 Within the body of the main synchronization loop is a selection based on comparison
of the record keys from respective input file records. If there are two input files, the
selection takes the form given in function Match of Fig. 8.2.
 Input files and output files are sequence checked by comparing the previous item
value with the new item value when a record is read. After a successful sequence check,
the previous item value is set to the new item value to prepare for the next input
operation on the corresponding file.
 High values are substituted for actual key values when end-of- file occurs. The main
processing loop terminates when high values have occurred for all relevant input files.
The use of high values eliminates the need to add special code to deal with each end-
of- file condition.
 AIl possible I/O and error detection activities are to be relegated to supporting methods
so the details of these activities do notobscure the principal processing logic.
APPLICATION OF THE MODEL TO A GENERAL LEDGER PROGRAM
The Problem
Suppose we are given the problem of designing a general ledger posting program as part ofan
accounting system. The systern includes a journal file and a ledger file. The ledger contains
month-by- month summaries of the values associated with each of the bookkeeping accounts.
A sample portion of the ledger, containing only checking and expense accounts, is illustrated
in Fig. 8.6.
The journal file contains the monthly transactions that are ultimately to be posted to the ledger
file. Figure 8.7 shows these journal transactions.

Prof. Nagarathna C, SVIT Page 4


File Structures Module 3

Once the journal file is complete for a given month, meaning that it contains all of the
transactions for that month, the journal must be posted to the ledger. Posting involves
associating each transaction with its account. in the ledger.
How is the posting process implemented? Clearly, it uses the account number as a key to relate
the Journal transactions to the ledger records.
A better solution is to begin by collecting all the journal transactions that relate to a given
account. This involves sorting the journal transactions by account number, producing a list
ordered as in Fig. 8.9. Now we can create our output list by working through both the ledger
and the sorted journal consequentially, meaning that we process the two lists sequentially and
in parallel. This concept is illustrated in Fig. 8.10.

Prof. Nagarathna C, SVIT Page 5


File Structures Module 3

Application of the Model to the Ledger Program


The monthly ledger posting program must perform two tasks:
 It needs to update the ledger file with the, correct balance for each account for the
current month.
 It must produce a printed version of the ledger that not only shows the beginning and
current balance for each accounts but also lists all the journal transactions for the month.
As you can see in figure 8.11, the printed output from the monthly ledger posting program
shows the balances of all ledger accounts, whether or not there were transactions for the
account.

Prof. Nagarathna C, SVIT Page 6


File Structures Module 3

In summary, there are three different steps in processing the ledger entries:
1. Immediately after reading a new ledger object, we need to print the header line and
initialize the balance for the next month from the previous month's balance.
2. For each transaction object that matches, we need to update the account balance.
3. After the last transaction for the account, the balance line should be printed. This is {he
place where a new ledger record could be written to create a new ledger file.

Figure 8.13 has the code for the three-way-test loop of method PostTransactions.

Prof. Nagarathna C, SVIT Page 7


File Structures Module 3

The reasoning behind the three-way test is as follows:


1. If 1th.e ledger (master) acco unt number (Item [1]) is less than the journal (transact io n)
account number (Item[2]), then there are no more transactions to add to the ledger account
this month (perhaps there were none at all), so we print the. ledger account balances
(ProcessEndMaster) and read in the next ledger account (NextItemlnList(I)). If the account
exists (MoreMasters is true), we print the title line for the new account (ProcessNewMaster).

2. If the account numbers match, then we have a journal transaction that is to be posted to the
current ledger account. We add the transaction amount to the account balance for the new
month (ProcessCurrentMaster), print the description of the transaction (ProcessItem(2)), then
read the next journal entry (NextItemlnList(1)).

3. If the journal account is less- than the ledger account, then it is an unmatched journal
account, perhaps due to-an input error. We print an error message (ProcessTransactionerror)
and continue with the next transaction.

EXTENSION OF THE MODEL TO INCLUDE MULTIWAY MERGING


The most common application of cosequential processes requiring more than two input files
is a K-way merge, in which we want to merge K input lists to create a single, sequentially
ordered output list. K is often referred to as the order of a K-way merge.
8.1.2 A K-way Merge Algorithm (Also can refer program 8)
Suppose we keep an array of lists and array of the items (or keys) that are being used from each
list in the cosequential process:
list [ 0 ] , list [I ] , list [ 2 ] , list [k-I]
Item[ O ] , Item[I ] , Item[ 3 ] , Item [k-I]
The main loop for the merge processing requires a call to a Minlndexfunction to find the
index of item with the minimum collating sequence value and an inner loop that finds all lists
that are using that item:

int minltem = Minlndex(Item,k); // find an index of minimum item


Processltem(minltem); // Item(minltem) is the next Output
for (i = 0; i<k; i++) // 1O0`k at each list
if (Item(minltem) == ItemCi)) // advance list i
Moreltems[i] = NextltemlnList(i);
Clearly, the expensive parts of this procedure are finding the minim um and testing to see in
which lists the item occurs and which files therefore need to be read. Note that because the
item can occur in several lists, every one of these if tests must be executed on every cycle

Prof. Nagarathna C, SVIT Page 8


File Structures Module 3

through the loop. However, it is often possible to guarantee that a Single item, or Key, occurs
in only one list. In this case, the procedure becomes simpler and more efficient.

A Selection Tree for Merging Large Numbers of Lists


The K-way merge described earlier works nicely if K is no larger than 8 or so. When we begin
merging a larger number of lists, the set of sequential comparisons to find the key with the
minimum value becomes noticeably expensive.
The use of a selection tree is an example of the classic time- versus space trade-off we so often
encounter in computer science. The concept underlying a selection tree can be readily
communicated througha diagram such as that in Fig. 8.15.

The selection tree is a kind of tournament tree in which each higher level node represents the
"winner" (in this case the minimum key value) of the comparison between the two descendent
keys. The minimum value is always at the root node of the tree. If each key has an associated
reference to the list from which it came, it is a simple matter to take the key at the root, read
the next element from the associated list, then run the tournament again. Since the tournament
tree is a binary tree, its depth is ceiling function ┌log2Kl

for a merge of K lists. The number of comparisons required to establish a new tournament
winner is, of course, related to this depth rather than being a linear function of K.

Prof. Nagarathna C, SVIT Page 9


File Structures Module 3

A SECOND LOOK AT SORTING IN MEMORY


Consider the problem of sorting a disk file that is small enough to fit in memory . The
operation we described involves three separate steps:
1. Read the entire file from disk into memory.
2. Sort the records using a standard sorting procedure, such as shell sort.
3. Write the file back to disk.
The total time taken to sort the file is the sum of the times for the three steps. Can we
improve on the time that it takes for this memory sort?
Overlapping Processing and I/O: Heapsort
Most of the time when we use an internal sort, we have to wait until we have the whole file in
memory before we can start sorting. Is there in vernal sorting algorithm that is reasonably fast
and that can begin sorting numbers immediately as they are read rather than waiting for the
whole file to be in memory?
In fact there is it is called heapsort.
Heapsort keep_ s all o f the keys in a structure ca lled a heap. A heap is a b inary tree wit h the
followingproperties:
1. Eac_ h node has a s ingle key, and that key is grea ter than or eq ual to the key at its parent
- .
node.
2. It is a complete binary tree, which means that all of its leaves are on no more than two levels
and that all of the keys on the lower level are in the leftmost position.
3. Because of properties 1 and 2,.storage for the tree can be allocated Sequentially as an array
in such a way that the root node is index 1 and the indexes of the left and right children of node
i are 2i and 2i + 1, respectively.
Figure 8.16 shows a heap in both its tree form and as it would be Stored in an array.

Prof. Nagarathna C, SVIT Page 10


File Structures Module 3

Building the Heap While Reading the File


The algorithm for heap sort has two parts. First we build the heap; then we output the keys in
sorted order. The first stage can occur at virtually the same time that we read the data, so in
terms of elapsed time it comes essentially free. Insert method that adds a string to the heap is
shown in Fig.8.17. Figure 8.18 contains a sample application of this algorithm.

Figure 8.17: Insert method to insert key in heap

Prof. Nagarathna C, SVIT Page 11


File Structures Module 3

How to make the input overlap with the heap-building procedure? We read a block of records
at a time into an input buffer and then operate on all of the records in the block before going
on to the next block. Each time we read a new block, we just append it to the end of the heap
(that is, the input buffer "moves" as the heap gets larger). The first new record is then at the
end of the heap array, as required by the Insert function (Fig. 8.17). Once that record ss
absorbed into the heap, the next new record is at the end of the heap array, ready to be absorbed
into the heap, and so forth.
Use of an input buffer avoids an excessive number of seeks, but it still doesn't let input occur
at the same time that. We build the heap. The way to make processing overlap with I/O is to
use more than one buffer. With multiple buffering, as we process the keys in one block from
the file, we can simultaneously read later blocks from the file.
Figure 8.19 illustrates the technique that we have just described, in which we append each new
block of records to the end of the heap, thereby employing a memory-sized set of input buffers.
Now we read new blocks as fast as we can, never having to wait for processing before reading
a new block.

Prof. Nagarathna C, SVIT Page 12


File Structures Module 3

Sorting While Writing to the File


The second and final step involves writing the heap in sorted order again, it is possible to
overlap I/O (in this case writing), with processing. First, let's look at how to output the sorted
keys. Retrieving the keys in order is $imply a repetition of the following steps:

1. Determine the value of the key in the first position of the heap. This is the smallest value in
the heap.

2. Move the largest value in the heap into the first position, and decrease the number of
elements by one. The heap is now out of order at its root.

3. Reorder the heap by exchanging the largest element with the smaller of its children and
moving down the tree to the new position of the largest element until the heap is back in order.

Each time these three steps are executed, the smallest value is retrieved and removed from the
heap. Figure 8.20 contains the code for method Remove that implements these steps.

First we see that we know immediately which record will be written first in the sorted file;
next, we know what will come second; and so forth. So as soonas we have identified a block

Prof. Nagarathna C, SVIT Page 13


File Structures Module 3

of records, we can write that block, and while we are writing that block, we can identify the
next block, and so forth.

8.2 MERGING AS A WAY OF SORTING LARGE FILES ON DISK


The multiway merge algorithm discussed in Section 8.3 provides the beginning of an attractive
solution to the problem of sorting large files.
As example consider a file with 8000000 records, each of which is 100 bytes long and contains
a key field that is 10 bytes long. we can create a sorted subset of our full file by reading records
into memory until the memory work area is almost full, sorting the records in this work area,
then writing the sorted records back to disk as a sorted sub file. We call such a sorted sub file a
run.
A schematic VIEW of this run creation and merging process is provided in Fig. 8.21. This
solution to our sorting problem has the following features:
1. It can, in fact, sort large files and can be extended to files of any size.
2. Reading of the input file during the run-creation step is sequential and hence is much faster
than input that requires seeking for everyrecord individually (as in a keysort.)
3. Reading through each run during merging and writing the sorted records is also sequential.
Random accesses are required only as we switch from run to run during the merge operation.
4. If a heap sort is used for the in- memory part of the merge, as described in Section 8.4, we can
overlap these operations with I/O so the in memory part does not add appreciably to the total
time for the merge.
5. Since I/O is largely sequential, tapes can be used if necessary for both input and output
operations.
8.2.1 How Much Time Does a Merge Sort Take?
We See in Fig. 8.21 that there are four times when I/O is performed. During the sort phase:
1. Readingall records into memory for sorting and forming runs, and
2. Writing sorted runs to disk.
During the merge phase:
3. Readingsorted runs into memory for merging, and
4. Writing sorted file to disk.
Let's look at each of these in order.
Step 1: Reading Records into Memory for Sorting and Forming Runs
Since we sort the file in 10- megabyte chunks, we read 10 megabytes at a time from the file.
In a sense, memory is a 10- megabyte input buffer that we fill up eighty times to form the

Prof. Nagarathna C, SVIT Page 14


File Structures Module 3

eighty runs. In computing the total time to input each run, we need to include the amount of
time it takes to access each block (seek time + rotational delay), plus the amount of time it
takes to transfer each block. Seek and rotational delay times are 8 msec and 3 msec,
respectively, so total time per Seek is 11 msec. The transmission rate is approximately 14,500
bytes per msec. Total input time for the sort phase consists of the time required for 80 seeks,
plus the time required to transfer 800 megabytes:
Access: 80 seeks x 11 msec = 1 Sec
Transfer: 800 megabytes @ 14500 bytes/msec= 60 sec
Total: 61 Sec
Step 2: Writing Sorted Runs to Disk
In this case, writing is just the reverse of reading-the same number of seeks and the same
amount of data to transfer. So it takes another 61 seconds to write the 80 sorted runs.
Step 3: Reading Sorted Runs Into Memory for Merging -
Since we have 10 megabytes of memory for storing runs, we divide 10 megabytes into 80 parts
for buffering the 80 runs. In a sense, we are reallocating our 10 megabytes of memory as 80
input buffers. Each of the 80 buffers then holds 1/80th of a run (125 000 bytes), so we have to
access each run 80 times to read all of it. Because there are 80 runs, in order to complete the
merge operation (Fig. 8.22) we end up making
80 runs x 80 seeks = 6400 Seeks.
Total seek and rotation time is then 6400 x 11 msec = 70 seconds.
Since 800 megabytes is still transferred, transfer time is still 60 seconds.

Step 4: Writing Sorted File to Disk


To compute the time for writing the file, we need to know how big our output buffers are. To
keep matters simple, let us assume that we can allocate two 200 000-byte output buffers. With
200000 bytes buffers, we need to make 4000 seeks.

Prof. Nagarathna C, SVIT Page 15


File Structures Module 3

Total seek and rotation time is then 4000 x 11 msec = 44 seconds. Transfer time is still 60
seconds.

Prof. Nagarathna C, SVIT Page 16


File Structures – Module 3

Module 3
Multilevel Indexingand B- Trees
R. Bayer and E. McCreight announced B-Trees to the world in 1972.

Why B-Trees came into existence :

When index file is too large, it has to be stored on secondary storage. It has mainly
two problems –

 Searching the index using binary search requires multiple access to secondary
memory.
 Insertion and deletion must be done to secondary memory and sorting of data is to be
done.

Indexing with Binary Search Tree


To overcome the 2nd problem, the keys can be represented in Binary Search Tree.

If the keys are in sorted list, the middle element is taken as the root and a balanced
tree is formed.
Eg: If the list is –

DE, FB, HN, KF, PA, SD, WS.

If KF is taken as root, a balanced tree is obtained,

KF

FB SD

DE HN PA WS

Using tree structure - there is no problem of sorting the file (when new records are added).
The new key is simply linked to the appropriate leaf node.

In a Binary Search Tree, each node will have three informations – key value, left link, right
link
Left Link Right Link
KF

Prof. Nagarathna C, SVIT Page 1


File Structures – Module 3

The above tree is represented in memory as follows:

ROOT 5
KEY LEFT CHILD RIGHT CHILD
0 FB 2 4
1 PA
2 DE
3 SD 1 6
4 HN
5 KF 0 3
6 WS

Suppose a key ‘XF’ is added to the tree, the key will be attached to the rightmost subtree ie.
As the right child of ‘WF’.
Record contents for a linked representation of the binary tree will be –

ROOT 5
KEY LEFT CHILD RIGHT CHILD
0 FB 2 4
1 PA
2 DE
3 SD 1 6
4 HN
5 KF 0 3
6 WS 7
7 XF

As key are added to the Binary Search Tree, it may become imbalanced. A tree is balanced,
when the difference in height of the shortest path to the tree and the longest path of the
tree is not more than one level.
or

A tree is balanced when all the leaves of the tree are at the same level or differ by one
level.

Prof. Nagarathna C, SVIT Page 2


File Structures – Module 3

AVL trees –
Suppose the keys are added in sorted order to a binary tree, the tree grows in
one direction only. To handle such imbalances in tree a new class of trees known as AVL
trees were developed.

G.M.Adel’son, Vel’skii and E.M.Landis, first defined the AVL trees. An AVL tree is a height
balanced-1 tree or HB(1) tree, were the maximum allowable height difference between any
two leaves is one.

Eg: of AVL tree

The two features of AVL trees are –

 Minimum level of searching performance.

 Maintaining AVL trees- when new nodes are added, any of the four possible rotations
occur in the tree, depending on the growth of the tree.

The four possible rotations are –

Left – Left rotation


Right – Right rotation Single Rotation

Left – Right rotation


Right – Left rotation Double Rotation
Prof. Nagarathna C, SVIT Page 3
File Structures – Module 3

For a complete balanced tree, the worst-case search to find a key is log2(N+1)

For an AVL tree, the worst-case search is 1.44 log2(N+2)

So AVL tree search is very near to complete binary tree search. So, AVL tree helps in faster
searching, and also it is always balanced. Thus avoiding the use of indexing.

In the two problems discussed-

 Searching requires many seeks.

 Keeping an index in sorted order is expensive.

The 2nd is solved by using AVL trees. Now, the number of seeks in binary search can be
avoided by using paged binary trees.

Paged Binary Trees –

The main problem with binary search tree is that the disk utilization is extremely
inefficient. Each disk read has only one node of data consisting of only three useful pieces of
information - the key, address of the left and right sub trees.

Each read operation, reads one page of data ie. about 512 bytes. One page consists of
one node of information (only the key, left address and right address), the remaining space is
wasted.

This can be solved by storing more than one node in a single page. So that in one seek
that many nodes of information can be retrieved.

For eg., if seven nodes are stored in a page, by accessing this page, we get the information of
seven nodes in a single access. A single access will return the information required for next
move also, thus reducing the cost of disk access.

Prof. Nagarathna C, SVIT Page 4


File Structures – Module 3

……………………..

……………

Thus in one access we get the information of seven nodes and in two access we get the
information of 63 nodes.

Worst case comparison between binary tree and paged binary tree is shown below:

Binary tree : log2(N+1) N- No of keys


Paged binary tree : log K+1(N+1) N- No. of keys and K – no. of keys in a page

If the no of keys (N) is 13421


Binary search requires log2(13421+1) = 14 seeks
Paged binary tree requires log7+1(13421+1) = 5 seeks

Advantage:
Seek time required to access many nodes is reduced compared to binary search tree.

Disadvantage:
 Most of the information read is unused. All the data in page is read and thus there is
wastage of transfer time.
 The paged tree formed will not be balanced.
 It is difficult to use the AVL tree techniques (rotations) in paged binary tree.
 The root page key cannot be found at the beginning and hence the tree will not be
balanced.

Prof. Nagarathna C, SVIT Page 5


File Structures – Module 3
Multilevel indexing
 Multilevel index is index of the index file.
 Since index records form a sorted list of keys, one of the keys in each index record
(usually largest key) is taken as the key of the whole record.
Index file 2
.
.
.
K2 Addr of f2 K2
K3 Addr of f3
.
. Index file 3
. .
. .
.
.
. .
. K3
. .
. .
.
Index file n
Kn Addr of fn
.
.
.
Kn

Multilevel indexes help to reduce, the number of disk accesses and their overhead space costs
are minimal. An index file to another index file is multilevel indexing.
1
2
3
:
99
100 100
200
300 101
400 102
500 :
600 199
700 200
800 901
902
900
Prof. Sreelatha P K, SVI T : Page 6
1000
:
999
1000
File Structures – Module 3

B- Trees: Working up form bottom

B-Trees are multilevel indexes that solve the problem of linear cost of insertion and deletion.
Each node of a B-tree is an index record. Each of these node has a same maximum number of
key-reference pairs, called the order of B-tree.

Order of a B-tree is the maximum no. of keys in a node of a tree.


Or
Maximum no. of descendants from a node. It is usually denoted by ‘m’.

While inserting a new key to a node –


 If the node is not full, simply insert the key.
 If the node is full, the node is split into two nodes, each with half of the keys.

Since a new node is created, the largest key of this new node must be added into the
next higher level node. This process is called promotion of the key.

This promotion may cause an overflow at that higher level, which inturn causes the
split of this node and again key promotion to the next higher level. This continous
addition of keys, may also lead to the splitting of root node. In this way, B-tree grows
up from the leaves.

Formal definition of B-tree properties –


Properties of a B-tree of order m, are –
1. Every page has a maximum of ‘m’ descendants.
2. Every page, except for the root and the leaves, has at least ┌ m/2┐ descendants.
3. The root has at least two descendants.
4. All the leaves appear on the same level.
5. The leaf level forms a complete, ordered index of the associated data file.

Creation of B-Tree –
If the order of the B-Tree is ‘m’, each node in a B-tree can store maximum of m keys.
 Initially the keys are added to the node.
 When the node is full (overflow), it is split into two nodes, each containing half of
the keys.
 When the new node is created, the largest key of the new nodes must be inserted
into the higher level node – key promotion.
 Continue the process of insertion, splitting and key promotion until the elements
to be inserted into the B-tree is complete.

Prof.Nagarathna C, SVIT Page 7


File Structures – Module 3

Eg: Suppose the list is


CSDTAMP I B W N G U R .

1) C S

2)
C D S T

3) Insertion of ‘A’, causes the node to split and create a higher level node – containg the
largest key from both nodes.

D T

A C D S T

4) After addition of ‘M; & ‘P’.

D T

A C D M P S T

5) The addition of ‘I’ in the right leaf node, causes splitting of that node. The node is split
into two nodes having keys I,M,P and ST. The key P is promoted to higher level node.

D P T

A C D I M P S T

6) After insertion of B, W, N into leaf nodes -

D P W

A B C D I M N P S T W

Prof.Nagarathna C, SVIT Page 8


File Structures – Module 3

When ‘W’ is added, as ‘W’ is the highest key in that node, it is added to the higher level
node.

7) Insertion of ‘G’ in the 2nd leaf node, splits the node into two (as the number of keys will
be greater than 4). An key is added to the root node. Thus the root node becomes full.

D M P W

A B C D N P S T W
G I M
8) Insertion of U & R, causes the rightmost leaf node to split. This adds up another key to the
root node, and thus causes the splitting of root. Thus the tree grows to another level, a new
root node is formed.

P W

D M P T W

A B C D G I M N P
R S T U W

Worst case search depth –

Worst case occurs when every page of the tree has only the minimum number of
descendants. For a B- tree of order ‘m’, the minimum number of descendants from the root
page is 2. The minimum number of descendants from intermediate node is ┌ m/2┐. The
general expression of the relation between depth and the minimum number of descendants
are –

Level Minimum no. of descendants

1(root) 2

2 2 x ┌ m/2┐

Prof. Nagarathna C, SVIT Page 9


File Structures – Module 3
3 2 x ┌ m/2┐ x ┌ m/2┐ or 2┌ m/2┐2

4 2 x ┌ m/2┐ x ┌ m/2┐ x┌ m/2┐or 2 ┌ m/2┐3

.
..
.
d 2 x ┌ m/2┐d-1

So, the minimum no. of descendants at level d is 2 x ┌ m/2┐d-1 . If there are N keys at the
leaf level, at depth ‘d’, then
N ≥ 2 x ┌ m/2┐d-1

N/2 ≥ ┌ m/2┐d-1
Or
d ≤ 1+ log┌ m/2┐(N/2).

Note: C++ code for B – Tree insertrion() and search() functions – Find at the end of notes

Deletion, Merging and Redistribution

The deletion of key from a node , affects the sibling node. Nodes that have the same parent
node are called sibling nodes.
D T

A C D S T
Siblings

P W

D M T W

A B C D G I M R S T U W

Not Siblings

Prof. Nagarathna C, SVIT Page 10


File Structures – Module 3
The rules for deleting a key ‘k’ from a node ‘n’ in a B-tree are as follows –

1) If node ’ n’ has more than minimum number of keys –


a) If ‘k’ is not the largest in ‘n’ - simply delete ‘k’ from ‘n’.
b) If ‘k’ is the largest in ‘n’ – delete ‘k’ and modify the higher level indexes to reflect
the new largest key in ‘n’.

2) If ‘n’ has exactly minimum number of keys and one of the sibling of ‘n’ has –
a) Few enough keys (min. no. of keys only), merge ‘n’ with its sibling and delete a
key from the parent node.
b) Extra keys (more than min.), redistribute by moving some keys from a sibling to
‘n’, and modify the higher level indexes to reflect the new largest keys in the
affected nodes.

Merge – After deleting a key, if the node has less than minimum number of keys (unstable),
an underflow of node occurs. So the keys of this node can be put into a sibling node
(iff there is enough space). This process is called merging.

Redistribution - After deleting a key from a node, if underflow occurs in a node(the node has
less than minimum number of keys) and if there is no of space for merging the keys
among the siblings, ie. if the siblings are full.
The keys are distributed among the siblings, this process is called redistribution.

Redistribution differs from both splitting (creating a node) and merging (deleting a node).
Here the number of nodes in a tree is not changed.

Example of deletion in B-Tree


For merging:
15 30

12 15 25 30

Suppose key to be deleted is ‘25’, then the node is merged to form a single node.

12 15 30

Prof. Nagarathna C, SVIT Page 11


File Structures – Module 3
For redistribution:

15 30

5 10 12 15 25 30

Suppose key to be deleted is ‘25’. As 1st node is full, redistribution takes place.

12 30

5 10 12 15 30

Redistribution during insertion : A way to improve storage utilization

During insertion, when a node is split, a new node is created. Thus using more
memory space. This can be avoided by postponing the creation of new node. Rather than
dividing (splitting) the node to create two pages, redistribute the keys among the existing
node only.

The use of redistribution in place of splitting tends the B-tree to be more efficient in
utilization of space.

B* trees
Knuth in 1973, extended the idea of redistribution in insertion and thus formed a B*
tree.

By postponing the splitting by redistribution, a condition comes when the node has to
be splitted. In this stage atleast one of its sibling is full. So instead of splitting one-to-two, the
nodes are split as two-to-three ie. two sibling nodes that are almost full is split to form three
nodes, such that each node is 2/3rd full, rather than ½ full. Thus making a B* tree .
The properties of B* tree are –

1. Every page has a maximum of ‘m’ descendants.

2. Every page except for the root has atleast ┌ (2m-1)/3┐ descendants.

3. The root has atleast two descendants.

Prof. Nagarathna C, SVIT Page 12


File Structures – Module 3
4. All the leaves appear on the same level.

Buffering of pages: Virtual B-trees

While using B-Trees, sometimes it is not possible to hold the entire tree in memory.
So a page buffer can be created to hold two or more pages of B-tree.

A B-tree which uses memory buffer to hold more B-tree pages in memory is called
virtual B-tree.

When a page is requested, the page is accessed from the memory (iff present), there
by avoiding disk access. Buffering works only if it is likely to request a page that is in buffer.
But some times the requested page may not be there in buffer. This situation were the
requested page is not in the buffer (memory) is called page fault. Then the process is
accessed from the disk.

There are two causes for page fault –

1. The page is never accessed before.

2. The page was in the buffer but has been replaced with a new page.

The first case is unavoidable, as the page is not read even ones, the page cannot be in
memory. But the second case can be minimized.

One approach is to replace the page that was least recently used. This method is called
LRU replacement. The page to be replaced is the one that has gone the longest time without

a request. LRU replacement is based on the assumption that, it is more likely to need a page
that is used recently.(least recently page is not used.)

Another approach is ‘Replacement Based on Page Height’ – this approach always


retain the pages that occur at the highest levels of the tree. If a large amount of buffer is
present, even the second level of the tree can be placed in buffer.

Important questions:

1. Write a note on AVL Trees


2. What are paged binary trees? Explain the problems associated with paged binary trees.
3. Write the adv. and disadv. of paged binary tree

Prof. Nagarathna C, SVIT Page 13


File Structures – Module 3
4. Give the formal definition of properties of B-Tree. Why is it called as “Bottom-up” tree.
5. Mention the four properties of B* trees
6. What are the two major drawbacks with binary search to search a simple sorted index on
secondary storage.
7. Show the B-Tree of order 4 that result from loading the following sets of keys in order i)
CGJXNSUOAEBHIF ii)CSDAMPIBWNGURKE
8. Explain with an example the creation of B-trees.
9. Explain the following with respect to B-Tree:
10. i) Worst-case search depth ii) Redistribution during insertion.
11. What is multilevel indexing? Explain the concept of B - Trees in multilevel indexing with an
example.
12. Explain deletion, Merging and redistribution of elements in B - Tree.
13. What are B-trees? Explain in detail an object oriented representation of B-trees.
14. Explain the virtual B-tree and how to overcome page fault.
15. What is redistribution? Explain redistribution during deletion and insertion of key in a B- tre
node.
16. Derive an expression for worst search depth in B-tree.
17. How is storage utilization improved by redistribution of keys during insertion
18. How is B* tree different from B-Tree. List the properties of B*trees.

C++ code for B Tree Creation and searching

B Tree Creation

B Tree node structure


struct node
{ int level;
int usage;
int numbers[order+1];
struct node* children[order+1];
struct node* parent;
};

void insertnode()
{
int num,i,j,k,poi;
cout << "enter number to insert into btree\n";
cin >> num;
if (head == NULL)
{ head=getnode();

Prof. Nagarathna C, SVIT Page 14


File Structures – Module 3
head->usage=1;
head->numbers[0]=num;
return;
}
else
{
cur=head;
while (cur != NULL)
{
for(i=0,j=0;i<=(cur->usage)-1;i++)
if (num > cur->numbers[i]) j++ ;
prev=cur;
if (j == cur->usage) cur=cur->children[j-1];
else cur=cur->children[j];
}
cur=prev;
poi=j;
if (cur->numbers[poi] == num)
{
cout << "element already present\n";
return;
}
if (poi != cur->usage)
{
for(i=cur->usage;i>=poi+1;i--)
{
cur->numbers[i]=cur->numbers[i-1];
cur->children[i]=cur->children[i-1];
}
}
cur->numbers[poi]=num;
cur->children[poi]=NULL;
cur->usage++;

while(cur!=NULL)
{
if (cur->usage <= order)
{
if (cur->parent == NULL) break;
up=cur->parent; i=0;
while (up->children[i] != cur) i++;

Prof. Nagarathna C, SVIT Page 15


File Structures – Module 3
up->numbers[i]=cur->numbers[(cur->usage)-1];

cur=up;
continue;
}

temp=getnode();
for(k=min+1,j=0;k<=order;k++,j++)
{
temp->numbers[j]=cur->numbers[k];
cur->numbers[k]=999;
temp->children[j]=cur->children[k];
cur->children[k]=NULL;
}

cur->usage=min+1;
temp->usage=order-(cur->usage)+1;
temp->level=cur->level;
if (cur->children[0] != NULL)
{
for (i=0;i<=(cur->usage)-1;i++) (cur->children[i])->parent=cur;
for(i=0;i<=(temp->usage)-1;i++) (temp->children[i])->parent=temp;
}

if (cur->parent == NULL)
{ up=getnode();
up->level=(cur->level)+1;
up->usage=2;
up->numbers[0]=cur->numbers[(cur->usage)-1];
up->numbers[1]=temp->numbers[(temp->usage)-1];
up->children[0]=cur;
up->children[1]=temp;
cur->parent = temp->parent = up;
head=up;
return;
}
else
{ up=cur->parent;
temp->parent=up;
newup=getnode();

Prof. Nagarathna C, SVIT Page 16


File Structures – Module 3
i=0;
while (up->numbers[i] < cur->numbers[0])
{ newup->numbers[i] = up->numbers[i];
newup->children[i] = up->children[i];
i++;
}

newup->numbers[i]= cur->numbers[(cur->usage)-1];
newup->children[i] = cur;
i++;
newup->numbers[i]= temp->numbers[(temp->usage)-1];
newup->children[i] = temp;
i++;

j=0;
while (up->numbers[j] <= temp->numbers[(temp->usage)-1]) j++;
while (j <= (up->usage)-1)
{ newup->numbers[i] = up->numbers[j];
newup->children[i] = up->children[j];
i++,j++;
}

up->usage=i;
newup->usage=i;
for(i=0;i<=(newup->usage)-1;i++)
{ up->numbers[i] = newup->numbers[i];
up->children[i] = newup->children[i];
}

free(newup);
cur=up;
continue;
}
}
}
}

B Tree Search

Prof. Nagarathna C, SVIT Page 17


File Structures – Module 3
void searchnode()
{
int num,i,j,poi;
cout << "enter number to search\n";
cin >> num;
cur=head;
while (cur != NULL)
{ for(i=0,j=0;i<=(cur->usage)-1;i++)
if (num > cur->numbers[i]) j++ ;
prev=cur;
if (j == cur->usage) cur=cur->children[j-1];
else cur=cur->children[j];
}
cur=prev;
poi=j;
if (cur->numbers[poi] == num)
cout << "search success\n";
else
cout << "search failure\n";
}

Prof. Nagarathna C, SVIT Page 18

You might also like