1712375192026_faculty e Notes Unit 4 Ds 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Additional Knowledge Material

(Faculty E-notes)

 COURSE BCA IVTH SEM


 UNIVERSITY MAHARSHI DAYANAND UNIVERSITY
 SUBJECT DATA STRUCTURE-II
 SUBJECT CODE BCA - 207
 UNIT NO & NAME UNIT IV:- FILES
 NAME OF FACULTY Mr. KHUSHAL NAGPAL

1
INDEX

S.NO. TOPIC PAGE NO.


Files: Physical storage devices and their characteristics,
1. Attributes of a file viz fields, records, Fixed and variable length 3 - 14
records, Primary and secondary keys.
Classification of files, File operations, Comparison of various
2. 15 – 20
types of files.
File organization: Serial, Sequential, Indexed-sequential,
3 20 - 25
Random-access/Direct, Inverted, Multilist file organization.
Hashing: Introduction, Hashing functions and Collision
4 25 - 32
resolution methods.

2
What is File?
A file a container in a computer system that stores data, information, settings, or commands, which
are used with a computer program. In graphical user interface (GUI), such as Microsoft operating
systems, represent the files as icons, which associate to the program that opens the file. For
instance, the picture is shown as an icon; it is related to Microsoft Word. If your computer contains
this file and you double-click on the icon, it will open in Microsoft Word installed on the computer.

There are several types of files available such as directory files, data files, text files, binary and
graphic files, and these several kinds of files contain different types of information. In the computer
system, files are stored on hard drives, optical drives, discs, or other storage devices.
In most of the operating systems, a file must be saved with a unique name within a given file
directory. However, certain characters cannot be used during creating a file as they are considered
illegal. A filename is consisted of with a file extension that is also called a suffix. The file extension
contains two to four characters that follow the complete filename, and it helps to recognize the file
format, type of file, and the attributes related to the file.
Most modern computer systems have the ability to protect files from file corruption or damage.
The file can be contained the data from system-generated information to user-specified
information. File management is done manually at times with the help of the user or done with the
help of third-party tools and operating systems
The basic operations that can be performed on a file are given below:
o Closing or terminating a file operation
o Creation of programs
o Reading of data from the file
o Creation of a new file
o Opening the file in order to make the contents available to other
o Modification of data or file attributes
o Writing data to the file

How are the files created?


A software program helps to create a file on the computer. For instance, to create a document file,
you will use a word processor, to create a C programming file, you would use a C software, to
create an image file, you would use an image editor. Specific software is used to create a particular
file.
Where are files stored?
Computer files are stored on a hard drive, disc-like DVD, and floppy disk. It can also be stored in
a folder that is stored on the drives.

3
Illegal file characters
The given below characters are considered illegal with most operating systems, hence cannot be
used. If you try to create a file name with these characters, it would generate an error or make the
file inaccessible.
:*?"<>|\/

File management
File management is also referred to as a file system that is a process of creating an organized
structure and retrieving files from a storage medium such as a hard drive. It is a type of software
that usually comprise files separated into groups, which is called directories. Basically, it is
designed to handle individual or group files, like records and special office documents. It is ableto
display report details, such as creation date, state of completion, owner and more other similar
details, which are useful in an office environment.
Nowadays, NTFS (New Technology File System) is the most widely used file system with
Windows. All files cannot be organized without file management, and it would be impossible to
be the same name for a file. Often, files are managed in a hierarchical way that allows users to
view files in the current directory and then navigate into any subdirectories.

File Format
The file format is the structure of a file that arranges the data logically within a file. It allows a
program to represent the information correctly, retrieve data, and continue with processing. For
instance, a Microsoft Word document will be saved with .doc file format; it will be best viewed in
Microsoft Word software. Although another software can open this file, it may not have all features
to display the document properly, like Microsoft Word. The programs may be able to give an
overview of a file if they are compatible with the file format. But they may be unable to display
all the files features.
Additionally, some of the programs that are not supported with a file format maybe give you
garbage with opening a file. For example, if you will open a.XLS file in another program like
notepad, it will not display the document properly and give you garbage. A file format minimizes
the required storage space as it contains the data encoding process. For instance, video and picture
are encoded by embedded processes like compression; in this process, a picture is divided into
pixels.
Furthermore, it also includes presentation information. For example, a Microsoft .xls file includes
both the document's text and its final form, as well as table, color, calculations, font size, charts,
and other information that must be organized in a standard form inside the file.

4
Common file formats
Below is a table that contains common file formats you are most likely to see while working on a
computer.
FILE TYPE FILE EXTENSION
Image .bmp .eps .gif .jpg .pict .png .psd .tif
Text .asc .doc .docx .rtf .msg .txt .wpd .wps
Video .avi .mp4 .mpg .mov .wmv
Compressed .arc .arj .gz .hqx .rar .sit .tar .z .zip
Program .bat .com .exe
Sound aac .au .mid .mp3 .ra .snd .wma .wav

File Extension
A file extension is an identifier that helps identify the type of file in operating systems, such as
Microsoft Windows. It can be classified as a type of metadata, and it helps the operating systems
to understand the intended use of a file and the characteristics. The filename extension may be
contained one to four characters and used as a suffix to the file name. For example, in Microsoft
Windows, the file extension is often followed by three characters.

A dot (.) symbol is used to separate the file extension from the filename. The filename is considered
incomplete without file extension; therefore, to complete a filename, it must be included in the file
extension. Generally, file extensions are hidden from the users in Windows operating systems.
Although file extensions can be renamed, it is not necessarily by renaming a file extension will
convert one file format to another. File extensions are helpful for both users and the file system in
two ways:
1. It helps in identifying the type of data that a file hold.
2. It allows the operating system to select the proper program or application with which to open
a file.
What makes a valid file name extension?
A filename extension is always at the end of the file name, which starts with a period (After dot
symbol). Although it is often between one and three characters, some of the programs also support
more than three characters. For instance, in the latest versions of Microsoft Word file will be saved
with .docx extension and some web pages with .html file extension.

5
Can be a file extension more than three or four characters?
Yes, a file extension can be more than three or four characters. It depends on the program of how
it was designed. Some of the programs are designed to identify and open a program with a longer
(more than three or four characters) file extension. However, most programs do not exceed four
characters to keep the overall file name short.
1. It helps in identifying the type of data that a file hold.
2. It allows the operating system to select the proper program or application with which to open
a file.
What makes a valid file name extension?
A filename extension is always at the end of the file name, which starts with a period (After dot
symbol). Although it is often between one and three characters, some of the programs also support
more than three characters. For instance, in the latest versions of Microsoft Word file will be saved
with .docx extension and some web pages with .html file extension.
Can be a file extension more than three or four characters?
Yes, a file extension can be more than three or four characters. It depends on the program of how
it was designed. Some of the programs are designed to identify and open a program with a longer
(more than three or four characters) file extension. However, most programs do not exceed four
characters to keep the overall file name short.

Limit of a file extension


Until the file name, path, and extension are not combined, the limit of file extension does not
exceed the limit of the maximum file name character. There is given a list below, which contains
Microsoft operating systems (Windows) versions and their filename character limit.
o Windows XP: It contains a limit of 255 characters.
o Windows 7: It includes a limit of 260 characters.
o Windows 2000: Its limit is 254 characters.
o Windows Vista: Its limit is 260 characters.
o Windows 8: It includes a limit of 260 characters.
o Windows 10: It contains a limit of 260 characters.
Different types of file extension
There are various types of file extensions that can be connected with one or more applications.
Below is given a list that contains some of the more common file extensions and their related
programs.
Music and sound files
.wav
.mp3
Picture files
.bmp
.jpg
.gif
Text and word processing documents
.doc
.rtf
.docx
.txt

6
Operating system files
.dll
.exe
Web Page files
.htm
.html
Spreadsheet files
.xls
.xlr
.xlsx
.csv
File Compression
File compression is also known as file zipping. It is a data compression method that contains one
or more files or directory that is smaller than their original file size. It is used to reduce the file
size to save storage space and provide faster transmission over a network or the Internet. The
compressed files allow more data to be stored on removable media and make downloading faster.
The common types of compressed file extensions are .RAR, .ARJ, .ZIP, TGZ, and.TAR.GZ.

The process of file compression is completed with the help of data or file compression software,
which processes all files and creates a compressed version. Generally, it scans an entire file,
recognizes repetitive patterns and data and replaces duplicates with a unique identifier. The size
of the created file of the identifier is much smaller as compared to the original file. Although there
is no fixed size of the compressed file, it reduces the size by 50 to 90 percent of the original file
while compressing the file.
There are various types of compressed file extensions, below is a table that contains some common
types of compressed file extensions:
.pf .b64 .cab .z .tar.gz
.rar .btoa .cpt .zip .tbz
.rpm .bz .gz .zipx .tbz2
.lzh .arc .hqx .zoo .sea
.mim .arj .iso .tgz .sit
.mme .as .lha .uu .sitx
.pak .uue

7
How to copy files
A file name must be unique; if you copy a file with the same file name in the same directory or
folder, a number will be appended to the end of a file name. Instead of a number, it could be '-
Copy' added to the end of the file name. For example, a file name is smith.doc, and it is copied in
the same directory or folder, the copied file name may be snkh143(1).doc or snkh143 - Copy.doc.
Although it is very easy to copy computer documents from one location to another. To copy files,
follow these below steps:
Copy a file in Microsoft Windows
o First, go to the files or folders that you want to copy, then select them with the help of clicking
the mouse. If you want to copy more than one file, you need to select all files. You can highlight
more than one file by holding down the Ctrl or Shift keys on your keyboard and clicking the
mouse together.
o After selecting the files, you are required to right-click on the selected files and choose the
copy option from the opened list. You can also press the Ctrl+C shortcut key. Also, in Windows
Explorer, you can click on the edit option at the top of the program window and select copy.
o Now, you need to open the destination folder where you want to copy the files, and right-click
an empty space in the destination folder and select the paste option.
How to move files or folders on the computer?
There are numerous methods to move files, folders or directories from one location to another on
the computer.
Move a file in windows
In Windows, files can be moved by using different methods such as cut and paste, drag-and-drop,
or using the move to folder option. Below all methods are described through which you can move
the files easily. You can choose any method accordingly.
o Cut and paste method
To use the cut and paste method, first, you are required to select the file that you want to
move. Then, right-click on the selected file and choose the cut option from the opened list.
Now, open the destination folder where you want to move the file and right-click on the
empty space in the folder and select the paste option from the list that appeared.
On the other hand, select the file and click on the Edit option from the file menu and select
the Cut option. Then, browse the folder where you want to move the files and click on the
Edit option from the file menu and select the Paste option to move the file.
Alternatively, you can also use shortcut keys to move the files. For that, you are required to
highlight the file that you want to move, then press the shortcut key Ctrl+X. Now, browse
the folder where you want to move the files and press the shortcut key Ctrl+V to paste the
files.
o Drag-and-drop method
First, you are required to select the files that you want to move, then hold down the right mouse
button on the file, and drag the files while continuing to hold down the right mouse button, and
release the mouse button on the location where you want to move the files.
o 'Move to folder' method
Highlight the file by clicking on the file name, then click on the Edit from the file menu and
click the Move to Folder option. In the new window, browse the folder in which you want to
move the files, then you only need to click on the move button to move the file to the
browsed folder.

8
Attributes of a file viz fields, records-
What is a File?
A file can be defined as a data structure which stores the sequence of records. Files are stored in a
file system, which may exist on a disk or in the main memory. Files can be simple (plain text) or
complex (specially-formatted).
The collection of files is known as Directory. The collection of directories at the different levels,
is known as File System.

Attributes of the File


1.Name
Every file carries a name by which the file is recognized in the file system. One directory cannot
have two files with the same name.
2. Identifier
Along with the name, Each File has its own extension which identifies the type of the file. For
example, a text file has the extension .txt, a video file can have the extension .mp4.
3. Type
In a File System, the Files are classified in different types such as video files, audio files, text files,
executable files, etc.
4. Location
In the File System, there are several locations on which, the files can be stored. Each file carries
its location as its attribute.
5. Size
The Size of the File is one of its most important attribute. By size of the file, we mean the number
of bytes acquired by the file in the memory.
6. Protection
The Admin of the computer may want the different protections for the different files. Therefore,
each file carries its own set of permissions to the different group of Users.
7. Time and Date
Every file carries a time stamp which contains the time and date on which the file is last modified.

9
Operations on the File
A file is a collection of logically related data that is recorded on the secondary storage in the form
of sequence of operations. The content of the file is defined by its creator who is creating the file.
The various operations which can be implemented on a file such as read, write, open and close etc.
are called file operations. These operations are performed by the user by using the commands
provided by the operating system. Some common operations are as follows:
1.Create operation:
This operation is used to create a file in the file system. It is the most widely used operation
performed on the file system. To create a new file of a particular type the associated application
program calls the file system. This file system allocates space to the file.
Open operation:
This operation is the common operation performed on the file. Once the file is created, it must be
opened before performing the file processing operations. When the user wants to open a file, it
provides a file name to open the particular file in the file system. It tells the operating system to
invoke the open system call and passes the file name to the file system.
3. Write operation:
This operation is used to write the information into a file. A system call write is issued that specifies
the name of the file and the length of the data has to be written to the file. Whenever the file length
is increased by specified value and the file pointer is repositioned after the last byte written.
4. Read operation:
This operation reads the contents from a file. A Read pointer is maintained by the OS, pointing to
the position up to which the data has been read.
5. Re-position or Seek operation:
The seek system call re-positions the file pointers from the current position to a specific place in
the file i.e. forward or backward depending upon the user's requirement. This operation is generally
performed with those file management systems that support direct access files.
6. Delete operation:

Define File, Record and Field


Field
A combination of one or more characters is called field. It is the smallest unit of data that can be
accessed by the user. The name of each field in a record is unique. The data type of a field indicates
the type of data that can be stored in the field. Each field contains one specific piece of information.
A field size defines the maximum number of characters that can be stored in a field.
For example, Employee Number, Employee Name, Grade and Designation are fields.

Fields
Employee Number 0007
Employee Name Umar kamal
Grade ¼
Designation Senior Manager

10
Record
A collection of related fields treated as a single as a single unit is called a record. For example, an
employee’s record includes a set of fields that contains Employer Number, Employee Name, Grade
and designation etc.

File
A collection of related records treated as a single unit is called file. File is also known as data set.
Files are stored in disk like hard disk, CD-ROM or DVD-ROM etc. A Student file may contain
the records of hundreds of students. Each student’s record consists of same fields but each field
contains different data.

Fixed and variable length records-


File Organization Storage
There are different ways of storing data in the database. Storing data in files is one of them. A user
can store the data in files in an organized manner. These files are organized logically as a sequence
of records and reside permanently on disks. Each file is divided into fixed-length storage units
known as Blocks. These blocks are the units of storage allocation as well as data transfer. Although
the default block size in the database is 4 to 8 kilobytes, many databases allow specifying the size
at the time of creating the database instance.
Usually, the record size is smaller than the block size. But, for large data items such as images, the
size can vary. For accessing the data quickly, it is required that one complete record should reside
in one block only. It should not be partially divided between one or two blocks. In RDBMS, the
size of tuples varies in different relations. Thus, we need to structure our files in multiple lengths
for implementing the records. In file organization, there are two possible ways of representing the
records:
o Fixed-length records
o Variable-length records

Fixed-Length Records
Fixed-length records means setting a length and storing the records into the file. If the record size
exceeds the fixed size, it gets divided into more than one block. Due to the fixed size there occurs
following two problems:
1. Partially storing subparts of the record in more than one block requires access to all the blocks
containing the subparts to read or write in it.
2. It is difficult to delete a record in such a file organization. It is because if the size of the existing
record is smaller than the block size, then another record or a part fills up the block.
However, including a certain number of bytes is the solution to the above problems. It is known
as File Header. The allocated file header carries a variety of information about the file, such as
the address of the first record. The address of the second record gets stored in the first record and
so on. This process is similar to pointers. The method of insertion and deletion is easy in fixed-
length records because the space left or freed by the deleted record is exactly similar to the space
required to insert the new records. But this process fails for storing the records of variable lengths.

11
Variable-Length Records
Variable-length records are the records that vary in size. It requires the creation of multiple blocks
of multiple sizes to store them. These variable-length records are kept in the following ways in the
database system:
1. Storage of multiple record types in a file.
2. It is kept as Record types that enable repeating fields like multisets or arrays.
3. It is kept as Record types that enable variable lengths either for one field or more.
In variable-length records, there exist the following two problems:
1. Defining the way of representing a single record so as to extract the individual attributes easily.
2. Defining the way of storing variable-length records within a block so as to extract that record
in a block easily.
Thus, the representation of a variable-length record can be divided into two parts:
1. An initial part of the record with fixed-length attributes such as numeric values, dates, fixed-
length character attributes for storing their value.
2. The data for variable-length attributes such as varchar type is represented in the initial part of
the record by (offset, length) pair. The offset refers to the place where that record begins, and
length refers to the length of the variable-size attribute. Thus, the initial part stores fixed-size
information about each attribute, i.e., whether it is the fixed-length or variable-length attribute.

Slotted-page Structure
There occurs a problem to store variable-length records within the block. Thus, such records are
organized in a slotted-page structure within the block. In the slotted-page structure, a header is
present at the starting of each block. This header holds information such as:
1. The number of record entries in the header
2. No free space remaining in the block
3. An array containing the information on the location and size of the records.

12
Inserting and Deleting Method
The variable-length records reside in a contiguous manner within the block.
When a new record is to be inserted, it gets the place at the end of the free space. It is because free
space is contiguous as well. Also, the header fills an entry with the size and location information
of the newly inserted record.
When an existing record is deleted, space is freed, and the header entry sets to deleted. Before
deleting, it moves the record and occupies it to create the free space. The end-of-free-space gets
the update. Then all the free space again sets between the first record and the final entry.
The primary technique of the slotted-page structure is that no pointer should directly point the
record. Instead, it should point to the header entry that contains the information of its location.
This stops fragmentation of space inside the block but supports indirect pointers to the record.

Primary and secondary keys-


What Is A Primary Key?
A primary key, also referred to as a primary keyword, is a key in relational database that is unique
for each record. It is a unique identifier such as a driver license number, Social security number,
telephone number (including area code), or vehicle identification number. A relational database
must always have one and only one primary key. A primary key typically appears to be as columns
in relational database tables. Primary keys must contain unique values. A primary key column
cannot have NULL values. A table can have one primary key, which may consist of single or
multiple fields. When multiple fields are used as a primary key, they are referred to as a composite
key.

Facts About Primary Key


 A primary key is used to ensure data in the specific column is unique.
 It uniquely identifies a record in the relational database table.
 Only one primary key is allowed in a table in a table.
 It is a combination of UNIQUE and Not Null constraints.
 It does not allow NULL values cannot be deleted from the parent table.
 Its constraint can be implicitly defined on the temporary tables.
 Examples of primary keys include: Unique last name, Social security number, online
username.

13
What Is A Secondary Key?
A secondary key represents a secondary value that is unique for each record that can be used to
identify the record. You may have a primary key that is system generated and a secondary key that
comes from the source or by some other process. You might have an invoice number that is
generated by the system but you have a client specific identifier that is guaranteed unique. This is
secondary key.
In other words, a secondary key provides a secondary reference point for objects whose primary
keys do not adequately distinguish them for reference purposes. In the event that a primary key is
not enough to distinguish an object, a secondary key can be used to render that object unique. It is
processed and sorted in relation to a primary key, clarifying search terms so that only desired
results appear when a table is consulted. This creates distinct, cleaner databases.

Facts about Secondary Key


 A secondary key provides a secondary reference point for objects whose primary keys do not
adequately distinguish them for reference purposes.
 It is used for identification of rows but not usually unique.
 We can have multiple secondary key per table.
 In the event that a primary key is not enough to distinguish an object, a secondary key can be
used to render that object unique.
 Attributes used for Secondary Key are not the ones used for Super key i.e secondary Key is
not even be one of the Super key.
 Examples of secondary keys include: Street address number, Phone number, Middle name
etc.

Difference between Primary Key and Secondary Key in Tabular Form


BASIS OF
PRIMARY KEY SECONDARY KEY
COMPARISON
Description The attribute that uniquely A field or combination of fields that is
identifies a row or record in a basis for retrieval is known as secondary
relation is known as Primary key (mainly used for finding details from
key. large data).
Use It uniquely identifies a record It is used for identification of rows but not
in the relational database table. usually unique.
NULL Values It does not allow NULL values Allows NULL values.
cannot be deleted from the
parent table.
Number Of Keys Only one primary key is We can have multiple secondary key per
allowed in a table in a table. table.
Examples Examples of primary keys Examples of secondary keys include:
include: Unique last name, Street address number, Phone number,
Social security number, Online Middle name etc
username
Deletion Cannot be deleted from the Can be deleted from the parent table.
parent table.

14
Classification of Files:
There are five methods of classification. They are:
1. Alphabetical Classification
2. Numerical Classification
3. Geographical Classification
4. Subject Classification and
5. Chronological Classification.

1. Alphabetical Classification:
Alphabetical classification is based on the occurrence of the letters in the alphabet as it is done for
the dictionary.
Telephone directory is another example. If several names occur having the first letter, the
arrangement takes into account the subsequent letters also, for example:

A, AB, AC ------------ etc


B, Bb, BC --------------etc.

Under the alphabetical classification, the filing of papers and documents is either by the names of
the correspondents or the subjects. In a large office, it would be proper if 26 letters of the English
alphabet are divided into small equal parts considering the letters which are bound to have more
names. This method of classification can be used in correspondence filing, contracts, orders and
staff records.

2. Numerical Classification
In this method of classification, each folder or record is given a number, and the files are placed
in strict numerical order. For example, Mr. Gnanasekar Ltd, may be assigned No 25. If they deal
in a number of lines, each line may be classified with a number beginning with 25, for example,
25.1, 25.2, 25.3, etc.,

The system of numerical classification is generally recommended for filing of orders, sales,
invoices, contracts, (where numbered) and committee minutes.

15
3. Geographical Classification
As the name implies, this classification is based on the geographical origin of a document or paper.
This system is combined with one of the two systems already discussed. The classification can be
town-wise, district-wise, state wise, country-wise and continent-wise.

The steps in geographical classification are outlined as follows:


(i). First of all, geographical limits are set and areas are defined which will make one unit, for
example, in export-import trade
(ii). Next step will be to arrange these countries in their alphabetical order, for example, Algeria,
Bolivia, Canada, France, Great Britain, USA and USSR, etc.,
(iii). Within each sub-division classification of different parties may be arranged alphabetically or
numerically. Such method of classification is very useful for customers’ orders in a given area and
for filing of correspondence according to town.

4. Subject Classification:
It is a method of classification in which all documents relating to a subject are brought together in
one file, even though they may have come from different sources and from many different people.
Following steps are taken to install subject classification
1. Defining Subject
2. Sub-dividing subjects into smaller fractions
3. Assigning numbers or arranging subjects in alphabetical order, including sub-subjects, and
4. Miscellaneous folders are made for subjects which have not been classified.

Example:
Main Subjects Classified:
Purchases
Sales
Advertising

16
Sub-division of classified subjects:
Purchases -------------- Scooter Parts.
Purchases ------------- Tractor parts.
Purchases -------------- Motor Parts

5. Chronological Classification

Under this method various records are identified and arranged in strict date order and sometimes
even according to the time of the day. It is a useful method for filing invoices and other vouchers
associated with accounts.
This system may be useful if used along with some other system. The records may be arranged
alphabetically first and then can be arranged date-wise within each folder. So this system cannot
be used independently.

File operations—
A file is a collection of logically related data that is recorded on the secondary storage in the form
of sequence of operations. The content of the files are defined by its creator who is creating the
file. The various operations which can be implemented on a file such as read, write, open and close
etc. are called file operations. These operations are performed by the user by using the commands

17
provided by the operating system. Some common operations are as follows:

1. Create operation:
This operation is used to create a file in the file system. It is the most widely used operation
performed on the file system. To create a new file of a particular type the associated application
program calls the file system. This file system allocates space to the file. As the file system knows
the format of directory structure, so entry of this new file is made into the appropriate directory.
2. Open operation:
This operation is the common operation performed on the file. Once the file is created, it must be
opened before performing the file processing operations. When the user wants to open a file, it
provides a file name to open the particular file in the file system. It tells the operating system to
invoke the open system call and passes the file name to the file system.
3. Write operation:
This operation is used to write the information into a file. A system call write is issued that specifies
the name of the file and the length of the data has to be written to the file. Whenever the file length
is increased by specified value and the file pointer is repositioned after the last byte written.
4. Read operation:
This operation reads the contents from a file. A Read pointer is maintained by the OS, pointing to
the position up to which the data has been read.
5. Re-position or Seek operation:
The seek system call re-positions the file pointers from the current position to a specific place in
the file i.e. forward or backward depending upon the user's requirement. This operation is generally
performed with those file management systems that support direct access files.
6. Delete operation:
Deleting the file will not only delete all the data stored inside the file it is also used so that disk
space occupied by it is freed. In order to delete the specified file the directory is searched. When
the directory entry is located, all the associated file space and the directory entry is released.
7. Truncate operation:
Truncating is simply deleting the file except deleting attributes. The file is not completely deleted
although the information stored inside the file gets replaced.
8. Close operation:
When the processing of the file is complete, it should be closed so that all the changes made
permanent and all the resources occupied should be released. On closing it deallocates all the
internal descriptors that were created when the file was opened.
9. Append operation:
This operation adds data to the end of the file.
10. Rename operation:
This operation is used to rename the existing file.

18
Comparison of various types of files
Files: As we know that Computers are used for storing the information for a Permanent Time or
the Files are used for storing the Data of the users for a Long Time Period. And the files can
contain any type of information means they can Store the text, any Images or Pictures or any data
in any Format. So that there must be Some Mechanism those are used for Storing the information,
Accessing the information and also Performing Some Operations on the files.
There are Many files which have their Owen Type and own names. When we Store a File in the
System, then we must have to specify the Name and the Type of File. The Name of file will be
any valid Name and Type means the application with the file has linked.
So that we can say that Every File also has Some Type Means Every File belongs to Special
Type of Application software’s. When we Provides a Name to a File then we also specify the
Extension of the File because a System will retrieve the Contents of the File into that
Application Software. For Example if there is a File Which Contains Some Paintings then this
will Opened into the Paint Software.
1) Ordinary Files or Simple File: Ordinary File may belong to any type of Application for
example notepad, paint, C Program, Songs etc. So all the Files those are created by a user are
Ordinary Files. Ordinary Files are used for Storing the information about the user Programs.
With the help of Ordinary Files we can store the information which contains text, database, any
image or any other type of information.
2) Directory files: The Files those are Stored into the a Particular Directory or Folder. Then
these are the Directory Files. Because they belong to a Directory and they are Stored into a
Directory or Folder. For Example a Folder Name Songs which Contains Many Songs So that all
the Files of Songs are known as Directory Files.
3) Special Files: The Special Files are those which are not created by the user. Or The Files
those are necessary to run a System. The Files those are created by the System. Means all the
Files of an Operating System or Window, are refers to Special Files. There are Many Types of
Special Files, System Files, or windows Files, Input output Files. All the System Files are
Stored into the System by using. sys Extension.
4) FIFO Files: The First in First Out Files are used by the System for Executing the Processes
into Some Order. Means To Say the Files those are Come first, will be Executed First and the
System Maintains an Order or Sequence Order. When a user Request for a Service from the
System, then the Requests of the users are Arranged into Some Files and all the Requests of the
System will be performed by the System by using Some Sequence Order in which they are
Entered or we can say that all the files or Requests those are Received from the users will be
Executed by using Some Order which is also called as First in First Out or FIFO order.

19
File organization-
 File organization refers to the way data is stored in a file. File organization is very important
because it determines the methods of access, efficiency, flexibility and storage devices to use.
There are four methods of organizing files on a storage media. This include:
 Sequential
 random
 serial and
 indexed-sequential

1. Sequential file organization


 Records are stored and accessed in a particular order sorted using a key field.
 Retrieval requires searching sequentially through the entire file record by record to the end.
 Because the record in a file are sorted in a particular order, better file searching methods like
the binary search technique can be used to reduce the time used for searching a file .
 Since the records are sorted, it is possible to know in which half of the file a particular record
being searched is located, Hence this method repeatedly divides the set of records in the file
into two halves and searches only the half on which the records is found.
 For example, of the file has records with key fields 20, 30, 40, 50, 60 and the computer is
searching for a record with key field 50, it starts at 40 upwards in its search, ignoring the first
half of the set.

Advantages of sequential file organization


 The sorting makes it easy to access records.
 The binary chop technique can be used to reduce record search time by as much as half the
time taken.

20
Disadvantages of sequential file organization
 The sorting does not remove the need to access other records as the search looks for particular
records.
 Sequential records cannot support modern technologies that require fast access to stored
records.
 The requirement that all records be of the same size is sometimes difficult to enforce.

1. Random or direct file organization


 Records are stored randomly but accessed directly. 
 To access a file stored randomly, a record key is used to determine where a record is stored on
the storage media.
 Magnetic and optical disks allow data to be stored and accessed randomly. 

Advantages of random file access


 Quick retrieval of records.
 The records can be of different sizes.
1. Serial file organization
 Records in a file are stored and accessed one after another.
 The records are not stored in any way on the storage medium this type of organization is mainly
used on magnetic tapes.

Advantages of serial file organization


 It is simple 
 It is cheap
Disadvantages of serial file organization
 It is cumbersome to access because you have to access all proceeding records before retrieving
the one being searched.
 Wastage of space on medium in form of inter-record gap.
 It cannot support modern high speed requirements for quick record access. 
1. Indexed-sequential file organization method
 Almost similar to sequential method only that, an index is used to enable the computer to locate
individual records on the storage media. For example, on a magnetic drum, records are stored
sequential on the tracks. However, each record is assigned an index that can be used to access
it directly. 

Direct, Inverted, Multilist file organization


Introduction (Presentation)
Content:- File Organization, Sequential, Random, Linked Organization, Inverted Files, Cellular
Partitions.
File organization
 Sequential
 Random
 Linked organization
 Inverted files
 Cellular partitions

21
Sample Employee File

1. Sequential Organization
 In sequential organization the records are placed sequentially onto the storage media i.e.
occupy consecutive locations in the case of tape that means placing records adjacent to each
other.
 In addition the physical sequence of records is ordered on some key called the primary key.
 Sequential organization is also possible in the case of DASD such as a disk. Even though disk
storage is really two dimensional (cylinder x surface) it may be mapped down into one
dimensional memory.
 If the disk has c cylinders and s surfaces one possibility will be to view disk memory as in
figure.
 Using notation tij to represent the jth track of the ith surface, the sequence is t11, t21, t31….ts1,
t12, t22,…..ts2 etc.

Interpreting disk memory as sequential memory


 The sequential interpretation in figure is particularly efficient for batched update and retrieval
as the tracks are to be accessed in order: all tracks on cylinder 1 followed by all tracks on
cylinder 2 etc. as a result of this the read/write heads are moved one cylinder at a time and
this movement is necessitated only once for every s tracks.
 Its main advantages are:
o It is easy to implement;
o It provides fast access to the next record using lexicographic order.
 Its disadvantages:
o It is difficult to update - inserting a new record may require moving a large proportion
of the file;

22
o Random access is extremely slow.

2. Random File organization


 Records are stored at random locations on the disk. This randomization could be achieved by
any of several techniques: direct addressing, directory lookup, hashing.
Direct addressing: in direct addressing with equi-size records, available disk space is divided
out into nodes large enough to hold a record. Numeric value of primary key is used to determine
the node into which a particular record is to be stored.

Directory lookup: the index is not direct access type but is a dense index maintained using a
structure suitable for index operations. Retrieving a record involves searching the index for the
record address and then accessing the record itself. The storage management scheme will
depend on whether fixed size or variable size nodes are being used. It requires more accesses
for retrieval and update, since index searching will generally require more than one access. In
both direct addressing and directory lookup, some provision must be made to handle collisions.
Hashing: the available file space is divided into buckets and slots. Some space may have to
be set aside for an overflow area in case chaining is being used to handle overflows. When
variable size records are present, the no. of slots per bucket will be only rough indicator of no.
of records a bucket can hold. The actual no. will vary dynamically with the size of records in
a particular bucket. Random organization on the primary key using any of the above three
techniques overcomes the difficulties of sequential organizations. Insertion, deletions become
easy. But batch processing of queries becomes inefficient as records are not maintained in
order of primary key. Handling range queries becomes very inefficient except in case of
directory lookup.

23
3. Linked organization
 Linked organizations differ from sequential organizations essentially in that the logical
sequence of records is generally different from the physical sequence.
 In sequential ith record is placed at location li, then the i+1 st record is placed at li + c where
c is the length of ith record or some fixed constant.
 In linked organization the next logical record is obtained by following link value from present
record. Linking in order of increasing primary key eases insertion deletion.
 Searching for a particular record is difficult since no index is available, so only sequential
search possible.
 We can facilitate indexes by maintaining indexes corresponding to ranges of employee
numbers eg. 501-700, 701-900. all records with same range will be linked together i a list.
 We can generalize this idea for secondary key level also. We just set up indexes for each key
and allow records to be in more than one list. This leads to the multilist structure for file
representation.

4. Inverted files
 Inverted files are similar to multilists. Multilists records with the same key value are linked
together with link information being kept in individual record. In case of inverted files the link
information is kept in index itself. 
 EG. We assume that every key is dense. Since the index entries are variable length, index
maintenance becomes complex for multilists. Benefits being Boolean queries require only one
access per record satisfying the query. Queries of type k1=xx and k2=yy can be handled
similarly by intersecting two lists.
 The retrieval works in two steps. In the first step, the indexes are processed to obtain a list of
records satisfying the query and in the second, these records are retrieved using the list. The
no. of disk accesses needed is equal to the no. of records being retrieved + the no. to process
the indexes. 
 Inverted files represent one extreme of file organization in which only the index structures are
important. The records themselves can be stored in any way. 
 Inverted files may also result in space saving compared with other file structures when record
retrieval doesn’t require retrieval of key fields. In this case key fields may be deleted from the
records unlike multilist structures.

24
5. Cellular partitions
 To reduce the file search times, the storage media may be divided into cells. A cell may be
an entire disk pack or it may simply be a cylinder. Lists are localized to lie within a cell.
 Thus if we had a multilist organization in which the list for key1=prog list included records
on several different cylinders then we could break the list into several smaller lists where
each prog list included only those records in the same cylinder. The index entry for prog will
now contain several entries of the type (addr, length) where addr is a pointer to start of a list
of records with key1=prog and length is the no. of records on the list. By doing this all records
of the same cell may be accessed without moving the read/write heads.

Hashing: Introduction, Hashing functions and Collision resolution methods


If you are transferring a file from one computer to another, how do you ensure that the copied file
is the same as the source? One method you could use is called hashing, which is essentially a
process that translates information about the file into a code. Two hash values (of the original file
and its copy) can be compared to ensure the files are equal.
What is Hashing?
Hashing is an algorithm that calculates a fixed-size bit string value from a file. A file basically
contains blocks of data. Hashing transforms this data into a far shorter fixed-length value or key
which represents the original string. The hash value can be considered the distilled summary of
everything within that file.
A good hashing algorithm would exhibit a property called the effect, where the resulting hash
output would change significantly or entirely even when a single bit or byte of data within a file is
changed. A hash function that does not do this is considered to have poor randomization, which
would be easy to break by hackers.
A hash is usually a hexadecimal string of several characters. Hashing is also a unidirectional
process so you can never work backwards to get back the original data.
A good hash algorithm should be complex enough such that it does not produce the same hash
value from two different inputs. If it does, this is known as a hash collision. A hash algorithm can
only be considered good and acceptable if it can offer a very low chance of collision.

What are the benefits of Hashing?


One main use of hashing is to compare two files for equality. Without opening two document files
to compare them word-for-word, the calculated hash values of these files will allow the owner to
know immediately if they are different.
Hashing is also used to verify the integrity of a file after it has been transferred from one place to
another, typically in a file backup program like SyncBack. To ensure the transferred file is not

25
corrupted, a user can compare the hash value of both files. If they are the same, then the transferred
file is an identical copy.
In some situations, an encrypted file may be designed to never change the file size nor the last
modification date and time (for example, virtual drive container files). In such cases, it would be
impossible to tell at a glance if two similar files are different or not, but the hash values would
easily tell these files apart if they are different.
Types of Hashing
There are many different types of hash algorithms such as RipeMD, Tiger, xxhash and more, but
the most common type of hashing used for file integrity checks are MD5, SHA-2 and CRC32.
MD5 - An MD5 hash function encodes a string of information and encodes it into a 128-bit
fingerprint. MD5 is often used as a checksum to verify data integrity. However, due to its age,
MD5 is also known to suffer from extensive hash collision vulnerabilities, but it’s still one of the
most widely used algorithms in the world.
SHA-2 – SHA-2, developed by the National Security Agency (NSA), is a cryptographic hash
function. SHA-2 includes significant changes from its predecessor, SHA-1. The SHA-2 family
consists of six hash functions with digests (hash values) that are 224, 256, 384 or 512 bits: SHA-
224, SHA-256, SHA-384, SHA-512, SHA-512/224, SHA-512/256.
CRC32 – A cyclic redundancy check (CRC) is an error-detecting code often used for detection of
accidental changes to data. Encoding the same data string using CRC32 will always result in the
same hash output, thus CRC32 is sometimes used as a hash algorithm for file integrity checks.
These days, CRC32 is rarely used outside of Zip files and FTP servers.
Using Hashing in 2BrightSparks software
In the backup and synchronization software, SyncBackPro/SE/Free, hashing is mainly used for file
integrity checks during or after a data transfer session. For example, a SyncBack user can turn on
file verification (Modify profile > Copy/Delete) or use a slower but more reliable method
(Modify profile > Compare Options) which will enable hashing to check for file differences.
Different hash functions will be used depending on which option is used and where the backup
files are located.
Other areas where hashing is used are resuming in FTP, data integrity checking, scripting and
occasionally for authentication in Cloud profiles (scripting and cloud backup is supported by
SyncBackPro only).

Hashing
There are many possibilities for representing the dictionary and one of the best methods for
representing is hashing. Hashing is a type of a solution which can be used in almost all situations.
Hashing is a technique which uses less key comparisons and searches the element in O(n) time in
the worst case and in an average case it will be done in O(1) time. This method generally used the
hash functions to map the keys into a table, which is called a hash table.
1) Hash table
Hash table is a type of data structure which is used for storing and accessing data very quickly.
Insertion of data in a table is based on a key value. Hence every entry in the hash table is defined
with some key. By using this key data can be searched in the hash table by few key comparisons
and then searching time is dependent upon the size of the hash table.

26
2) Hash function
Hash function is a function which is applied on a key by which it produces an integer, which can
be used as an address of hash table. Hence one can use the same hash function for accessing the
data from the hash table. In this the integer returned by the hash function is called hash key.
Types of hash function
There are various types of hash function which are used to place the data in a hash table,
1. Division method
In this the hash function is dependent upon the remainder of a division.
For example:-if the record 52,68,99,84 is to be placed in a hash table and let us take the table size
is 10.
Then:
h(key)=record% table size.
2=52%10
8=68%10
9=99%10
4=84%10

2. Mid square method


In this method firstly key is squared and then mid part of the result is taken as the index. For
example: consider that if we want to place a record of 3101 and the size of table is 1000. So
3101*3101=9616201 i.e. h (3101) = 162 (middle 3 digit)
3. Digit folding method
In this method the key is divided into separate parts and by using some simple operations these
parts are combined to produce a hash key. For example: consider a record of 12465512 then it will
be divided into parts i.e. 124, 655, 12. After dividing the parts combine these parts by adding it.
H(key)=124+655+12
=791
Characteristics of good hashing function
1. The hash function should generate different hash values for the similar string.
2. The hash function is easy to understand and simple to compute.

27
3. The hash function should produce the keys which will get distributed, uniformly over an array.
4. A number of collisions should be less while placing the data in the hash table.
5. The hash function is a perfect hash function when it uses all the input data.

In hashing technique, Collison is a situation when hash value of two key become similar.
Suppose we want to add a new Record with key k in a hash table, but index address H(k) is already
occupied by another record. This situation is known as a collision.
The below example to understand the collision situation
In the below figure, we have a hash table and the size of the below hash table is 10.
It means this hash table has 10 indexes which are denoted by {0, 1,2,3,4,5,6,7,8,9}
Now we have to insert value {9,7,17,13,12,8} into a hash table. So to calculate slot/index value to
store items we will use hashing concept.

And hash function is h(k) = h(k) mod m


Step 1 :
First Draw an empty hash table of Size 10.
The possible range of hash values will be [0, 9].

Insert the given keys one by one in the hash table.


First Key to be inserted in the hash table = 9.
h(k) = h(k) mod m
h(9) = 9 mod 10 = 9
So, key 9 will be inserted at index 9 of the hash table.

Insert the given keys one by one in the hash table.


Second Key to be inserted in the hash table = 7.
h(k) = h(k) mod m
h(7) = 7 mod 10 = 7
So, key 7 will be inserted at index 7 of the hash table.
Insert the given keys one by one in the hash table.
Second Key to be inserted in the hash table = 17.
h(k) = h(k) mod m
h(7) = 17 mod 10 = 7
So, key 17 will be inserted at index 7 of the hash table
But Here at index 7 already there is a key 7.
So this is a situation when we can say a collision has occurred.
Now to overcome this situation we have various Collision resolution techniques.

Collision resolution techniques are


1). Open Addressing
a. Linear Probing
b. Quadratic Probing
c. Double Hashing Technique
2). Closed Addressing
a) Chaining

28
1). Open Addressing
In open addressing, all the keys are stored inside the hash table and
No key is stored outside the hash table.
a). Linear Probing
It is very easy and simple method to resolve or to handle the collision. In this collision can be
solved by placing the second record linearly down, whenever the empty place is found. In this
method there is a problem of clustering which means at some place block of a data is formed in a
hash table.
Example: Let us consider a hash table of size 10 and hash function is defined as H(key)=key %
table size. Consider that following keys are to be inserted that are 56,64,36,71.

In this diagram we can see that 56 and 36 need to be placed at same bucket but by linear probing
technique the records linearly placed downward if place is empty i.e. it can be seen 36 is placed at
index 7.
 The simplest approach to resolve a collision is linear probing. In this technique, if a value is
already stored at a location generated by h(k), it means collision occurred then we do a
sequential search to find the empty location.
 Here the idea is to place a value in the next available position. Because in this approach
searches are performed sequentially so it’s known as linear probing.
 Here array or hash table is considered circular because when the last slot reached an empty
location not found then the search proceeds to the first location of the array.
Clustering is a major drawback of linear probing.
Below is a hash function that calculates the next location. If the location is empty then store
value otherwise find the next location.
Following hash function is used to resolve the collision in:
h(k, i) = [h(k) + i] mod m
Where

29
m = size of the hash table,
h(k) = (k mod m),
i = the probe number that varies from 0 to m–1.
Therefore, for a given key k, the first location is generated by [h(k) + 0] mod m, the first time
i=0.
If the location is free, the value is stored at this location. If value successfully stores then probe
count is 1 means location is founded on the first go.
If location is not free then second probe generates the address of the location given by [h(k) +
1]mod m.
Similarly, if the generated location is occupied, then subsequent probes generate the address
as [h(k) + 2]mod m, [h(k) + 3]mod m, [h(k) + 4]mod m, [h(k) + 5]mod m, and so on, until a
free location is found.
Probes is a count to find the free location for each value to store in the hash table.
b). Quadratic Probing
This is a method in which solving of clustering problem is done. In this method the hash function
is defined by the H(key)=(H(key)+x*x)%table size. Let us consider we have to insert following
elements that are:-67, 90,55,17,49.

In this we can see if we insert 67, 90, and 55 it can be inserted easily but at case of 17 hash function
is used in such a manner that :-(17+0*0)%10=17 (when x=0 it provide the index value 7 only) by
making the increment in value of x. let x =1 so (17+1*1)%10=8.in this case bucket 8 is empty
hence we will place 17 at index 8.

In this technique, if a value is already stored at a location generated by h(k), then the following
hash function is used to resolve the collision:
h(k, i) = (h(k) + i^2) mod m
where m is the size of the hash table,
h(k) = (k mod m), i is the probe number that varies from 0 to m–1,
 Quadratic probing solves the clustering problem which is in linear probing because instead of
doing a linear search, it does a quadratic search. 

30
 For a given key k, first, the location generated by [h(k) + 0] mod m, where i is 0. If the location
is free, the value is stored at this generated location, else new locations will be generated using
hash function [h(k) + 1^2] mod m.
 Value of i will change until free space is founded and probe count is increased until free
space is founded.
Quadratic probing performs better than linear probing, in order to maximize the utilization of
the hash table.
 The disadvantage of quadratic probing is it does not search all locations of the list.

c). Double Hashing


In this we can see 67, 90 and 55 can be inserted in a hash table by using first hash function but in
case of 17 again the bucket is full and in this case we have to use the second hash function which
is H2(key)=P-(key mode P) here p is a prime number which should be taken smaller than the hash
table so value of p will be the 7.
i.e. H2(17)=7-(17%7)=7-3=4 that means we have to take 4 jumps for placing the 17. Therefore 17
will be placed at index 1.
 Double hashing is a collision resolution technique used in conjunction with open-addressing
in hash tables.
 In this technique, we use a two hash function to calculate empty slot to store value.
 In the case of collision we take the second hash function h2(k) and look for i * h2(k) free slot
in an ith iteration.
 Double hashing requires more computational time because two hash functions need to be
computed.
To start with, double hashing uses two hash function to calculate an empty location.
In double hashing, we use two hash functions rather than a single
function. The hash function in the case of double hashing can be given as:
h(k, i) = [h1(k) + ih2(k)] mod m
where m is the size of the hash table,
h1(k) and h2(k) are two hash functions
h1(k) = k mod m,
h2(k) = k mod m
i is the probe number that varies from 0 to m–1
When we have to insert a key k in the hash table, we first probe the location given by applying
[h1(k) mod m] because during the first probe, i = 0. If the location is vacant, the key is inserted
into it.
And if the location is not vacant then increase value of i to calculate next location using
h(k,1) = [h1(k) + 1*h2(k)] mod m.
Otherwise
h(k,0) = [h1(k) + 0*h2(k)] mod m for next key
2). Closed Addressing
In closed addressing, all the keys are stored inside and outside the hash table. Each slot of the hash
table is linked with linked list So if a collision occurs key stores in the linked list.
a). Chaining
It is a method in which additional field with data i.e. chain is introduced. A chain is maintained at
the home bucket. In this when a collision occurs then a linked list is maintained for colliding data.

31
Example: Let us consider a hash table of size 10 and we apply a hash function of H(key)=key %
size of table. Let us take the keys to be inserted are 31,33,77,61. In the above diagram we can see
at same bucket 1 there are two records which are maintained by linked list or we can say by
chaining method.
 Chaining is a Collision Resolution technique in hash tables. 
 In hash tables collision occurs when two keys are hashed to the same index in a hash table. 
 It means the calculated hash value for the two keys is the same. Collisions are a problem
because every slot in a hash table is supposed to store a single element.
 In the chaining approach, the hash table is an array of linked lists. This means each index of
the hash table has its own linked list.
 And if Collision arises then a new value will be store in the linked list of that index. And at
index, this linked list appears like a chain that is why this technique is known as the chaining
technique. 

Time complexity of Chaining


For Searching
In worst case all the value is present in the linked list of the same index. So in this case sequential
search is performed to search the value.
So in the worst case, time complexity will be O(n) for searching.

32

You might also like