DA Unit 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Unit – 2

1. Data Processing
Data in its raw form is not useful to any organization.
Data processing is the method of collecting raw data and translating it into usable
information.
It is usually performed in a step-by-step process by a team of data
scientists and data engineers in an organization. The raw data is collected, filtered,
sorted, processed, analyzed, stored, and then presented in a readable format.
Data processing is essential for organizations to create better business strategies
and increase their competitive edge.
By converting the data into readable formats like graphs, charts, and documents,
employees throughout the organization can understand and use the data.
Now that we’ve established what we mean by data processing, let’s examine the
data processing cycle.
2. Data Processing Cycle
The data processing cycle consists of a series of steps where raw data (input) is fed
into a system to produce actionable insights (output). Each step is taken in a
specific order, but the entire process is repeated in a cyclic manner. The first data
processing cycle's output can be stored and fed as the input for the next cycle, as
the illustration below shows us.

1
Generally, there are six main steps in the data processing cycle:
Step 1: Collection
The collection of raw data is the first step of the data processing cycle. The type of
raw data collected has a huge impact on the output produced. Hence, raw data
should be gathered from defined and accurate sources so that the subsequent
findings are valid and usable. Raw data can include monetary figures, website
cookies, profit/loss statements of a company, user behavior, etc.
Step 2: Preparation
Data preparation or data cleaning is the process of sorting and filtering the raw data
to remove unnecessary and inaccurate data. Raw data is checked for errors,
duplication, miscalculations or missing data, and transformed into a suitable form
for further analysis and processing. This is done to ensure that only the highest
quality data is fed into the processing unit.
The purpose of this step to remove bad data (redundant, incomplete, or incorrect
data) so as to begin assembling high-quality information so that it can be used in
the best possible way for business intelligence.
Step 3: Input
In this step, the raw data is converted into machine readable form and fed into the
processing unit. This can be in the form of data entry through a keyboard, scanner
or any other input source.
Step 4: Data Processing
In this step, the raw data is subjected to various data processing methods using
machine learning and artificial intelligence algorithms to generate a desirable
output. This step may vary slightly from process to process depending on the
source of data being processed (data lakes, online databases, connected devices,
etc.) and the intended use of the output.
Step 5: Output
The data is finally transmitted and displayed to the user in a readable form like

2
graphs, tables, vector files, audio, video, documents, etc. This output can be stored
and further processed in the next data processing cycle.
Step 6: Storage
The last step of the data processing cycle is storage, where data and metadata are
stored for further use. This allows for quick access and retrieval of information
whenever needed, and also allows it to be used as input in the next data processing
cycle directly.
3. Types of Data Processing
There are different types of data processing based on the source of data and the
steps taken by the processing unit to generate an output. There is no one-size-fits
all method that can be used for processing raw data.

Type Uses

Data is collected and processed in batches. Used for large


Batch Processing amounts of data.
Eg: payroll system

Data is processed within seconds when the input is given.


Real-time
Used for small amounts of data.
Processing
Eg: withdrawing money from ATM

Data is automatically fed into the CPU as soon as it


becomes available. Used for continuous processing
Online Processing
of data.
Eg: barcode scanning

3
Data is broken down into frames and processed using two
or more CPUs within a single computer system.
Multiprocessing
Also known as parallel processing.
Eg: weather forecasting

Allocates computer resources and data in time slots to


Time-sharing
several users simultaneously.

4. Data Processing Methods


There are three main data processing methods - manual, mechanical and
electronic.
4.1. Manual Data Processing
This data processing method is handled manually. The entire process of data
collection, filtering, sorting, calculation, and other logical operations are all done
with human intervention and without the use of any other electronic device or
automation software. It is a low-cost method and requires little to no tools, but
produces high errors, high labor costs, and lots of time and tedium.
4.2. Mechanical Data Processing
Data is processed mechanically through the use of devices and machines. These
can include simple devices such as calculators, typewriters, printing press, etc.
Simple data processing operations can be achieved with this method. It has much
lesser errors than manual data processing, but the increase of data has made this
method more complex and difficult.
4.3. Electronic Data Processing
Data is processed with modern technologies using data processing software and
programs. A set of instructions is given to the software to process the data and
yield output. This method is the most expensive but provides the fastest processing

4
speeds with the highest reliability and accuracy of output.
Examples of Data Processing
Data processing occurs in our daily lives whether we may be aware of it or not.
Here are some real-life examples of data processing:
 A stock trading software that converts millions of stock data into a simple
graph
 An e-commerce company uses the search history of customers to
recommend similar products
 A digital marketing company uses demographic data of people to strategize
location-specific campaigns
 A self-driving car uses real-time data from sensors to detect if there are
pedestrians and other cars on the road
4.4. Batch Data Processing
Batch Data Processing is when processing and analysis happens on data that has
been stored for a longer period of time. This process is often applied to large
datasets such as payroll, credit card or banking transactions, etc.
4.5. Real-time Data Processing
Real-time Data Processing is when data is processed quickly and in a short-period
of time. This system is used when results are required in a short amount of time,
for example stock selling.
4.6. Automatic Data Processing
Automatic Data Processing is when a tool or software is used to store, organize,
filter and analyze the data. It is also known as Automated Data Processing.
Moving From Data Processing to Analytics
If we had to pick one thing that stands out at the most significant game-changer in
today’s business world, it’s big data. Although it involves handling a staggering
amount of information, the rewards are undeniable. That’s why companies that
want to stay competitive in the 21st-century marketplace need an effective data

5
processing strategy.
Analytics, the process of finding, interpreting, and communicating meaningful
patterns in data, is the next logical step after data processing. Whereas data
processing changes data from one form to another, analytics takes those newly
processed forms and makes sense of them.
But no matter which of these processes data scientists are using, the sheer volume
of data and the analysis of its processed forms require greater storage and access
capabilities, which leads us to the next section!
Future of Data Processing
The future of data processing can best be summed up in one short phrase: cloud
computing.
While the six steps of data processing remain immutable, cloud technology has
provided spectacular advances in data processing technology that has given data
analysts and scientists the fastest, most advanced, cost-effective, and most efficient
data processing methods today.
The cloud lets companies blend their platforms into one centralized system that’s
easy to work with and adapt. Cloud technology allows seamless integration of new
upgrades and updates to legacy systems while offering organizations immense
scalability.
Cloud platforms are also affordable and serve as a great equalizer between large
organizations and smaller companies.
So, the same IT innovations that created big data and its associated challenges have
also provided the solution. The cloud can handle the huge workloads that are
characteristic of big data operations.
Data contains a lot of useful information for organizations, researchers,
institutions, and individual users. With the increase in the amount of data being
generated every day, there is a need for more data scientists and data engineers to
help understand these data.

6
4.7. Online Processing
Online Processing is a method that utilizes Internet connections and equipment
directly attached to a computer. It is used mainly for information recording and
research. Real-Time Processing is a technique that has the ability to respond almost
immediately to various signals in order to acquire and process information.
4.8. Distributed Processing
Distributed data processing is diverging massive amount of data to several
different nodes running in a cluster for processing. All the nodes execute the
task allotted parallelly, they work in conjunction with each other connected by a
network. The entire set-up is scalable & highly available.
5. Files
As we know that Computers are used for storing the information for a Permanent
Time or the Files are used for storing the Data of the users for a Long time Period.
And the files can contains any type of information means they can Store the text,
any Images or Pictures or any data in any Format. So that there must be Some
Mechanism those are used for Storing the information, Accessing the information
and also Performing Some Operations on the files.
There are Many files which have their Owen Type and own names. When we Store
a File in the System, then we must have to specify the Name and the Type of File.
The Name of file will be any valid Name and Type means the application with the
file has linked.
So that we can say that Every File also has Some Type Means Every File belongs
to Special Type of Application software’s. When we Provides a Name to a File
then we also specify the Extension of the File because a System will retrieve the
Contents of the File into that Application Software. For Example if there is a File
Which Contains Some Paintings then this will Opened into the Paint Software.
1) Ordinary Files or Simple File: Ordinary File may belong to any type of
Application for example notepad, paint, C Program, Songs etc. So all the Files

7
those are created by a user are Ordinary Files. Ordinary Files are used for Storing
the information about the user Programs. With the help of Ordinary Files we can
store the information which contains text, database, any image or any other type of
information.
2) Directory files: The Files those are Stored into the a Particular Directory or
Folder. Then these are the Directory Files. Because they belongs to a Directory and
they are Stored into a Directory or Folder. For Example a Folder Name Songs
which Contains Many Songs So that all the Files of Songs are known as Directory
Files.
3) Special Files: The Special Files are those which are not created by the user. Or
The Files those are necessary to run a System. The Files those are created by the
System. Means all the Files of an Operating System or Window, are refers to
Special Files. There are Many Types of Special Files, System Files, or windows
Files, Input output Files. All the System Files are Stored into the System by using.
Sys Extension.
4) FIFO Files: The First in First Out Files are used by the System for Executing
the Processes into Some Order. Means To Say the Files those are Come first, will
be Executed. First and the System Maintains a Order or Sequence Order. When a
user Request for a Service from the System, then the Requests of the users are
Arranged into Some Files and all the Requests of the System will be performed by
the System by using Some Sequence Order in which they are Entered or we can
say that all the files or Requests those are Received from the users will be
Executed by using Some Order which is also called as First in First Out or FIFO
order.
6. File Operations
Files are not made for just reading the Contents, we can also Perform Some other
operations on the Files those are Explained below as :

8
1) Read Operation : Meant To Read the information which is Stored into the Files.
2) Write Operation : For inserting some new contents into a File.
3) Rename or Change the name of File.
4) Copy the File from one Location to another.
5) Sorting or Arrange the contents of File.
6) Move or Cut the File from One Place to Another.
7) Delete a File.
8) Execute Means to Run Means File Display Output.
We can Also Link a File with any other File. These are also called as the Symbolic
Links, in the Symbolic Links all the files are linked by using Some Text or Some
Alias.
When a User Clicks on the Special text or on the Alias then this will open that
Linked File. So that we can say that the Files are linked With each other by using
Some Names and by using Some Locations.
These are Also Called as the Symbolic Links and always remember that when we
remove the Link from the System then this will not effect on the Actual file Means
the Original File will be Kept Save into the Locations.
7. File Organization
 The File is a collection of records. Using the primary key, we can access the
records. The type and frequency of access can be determined by the type of file
organization which was used for a given set of records.
 File organization is a logical relationship among various records. This method
defines how file records are mapped onto disk blocks.
 File organization is used to describe the way in which the records are stored in
terms of blocks, and the blocks are placed on the storage medium.
 The first approach to map the database to the file is to use the several files and
store only one fixed length record in any given file. An alternative approach is
to structure our files so that we can contain multiple lengths for records.
9
 Files of fixed length records are easier to implement than the files of variable
length records.
7.1. Objective Of File Organization
 It contains an optimal selection of records, i.e., records can be selected as fast as
possible.
 To perform insert, delete or update transaction on the records should be quick
and easy.
 The duplicate records cannot be induced as a result of insert, update or delete.
 For the minimal cost of storage, records should be stored efficiently.
7.2. Types of File Organization
A sequential file contains records organized by the order in which they were
entered. The order of the records is fixed. Records in sequential files can be read
or written only sequentially. After you place a record into a sequential file, you
cannot shorten, lengthen, or delete the record.
7.2.1. Sequential File Organization
1. A sequential file is designed for efficient processing of records in sorted order on
some search key.
 Records are chained together by pointers to permit fast retrieval in search key
order.
 Pointer points to next record in order.
 Records are stored physically in search key order (or as close to this as
possible).
 This minimizes number of block accesses.
2. It is difficult to maintain physical sequential order as records are inserted and
deleted.
 Deletion can be managed with the pointer chains.
 Insertion poses problems if no space where new record should go.

10
 If space, use it, else put new record in an overflow block.
 Adjust pointers accordingly.
 Problem: we now have some records out of physical sequential order.
 If very few records in overflow blocks, this will work well.
 If order is lost, reorganize the file.
 Reorganizations are expensive and done when system load is low.
3. If insertions rarely occur, we could keep the file in physically sorted order and
reorganize when insertion occurs. In this case, the pointer fields are no longer
required.
Indexed File
An indexed file contains records ordered by a record key . A record key
uniquely identifies a record and determines the sequence in which it is accessed
with respect to other records. Each record contains a field that contains the record
key.
7.2.2. Indexed File Organization
Indexed file organization is the storage of records either sequentially or non-
sequentially with an index that allows software to locate individual records.
An index is a table or other data structure used to determine the location of
rows in a file that satisfy some condition. Each index entry matches a key
value with one or more records. An index can point to unique records (a
primary key index) or potentially more than one record. A secondary key is
one field or a combination of fields for which more than one record may have
the same combination of values, which is also called a non-unique key. When
the terms primary and secondary index are used, there are four type of
indexes:
1. Unique primary index (UPI) is an index on a unique field, possibly the
primary key of the table, which not only is used to find table rows based on this

11
field value but also is used by the DBMS to determine where to store a row based
on the primary index field value.
2. Non-unique primary index (NUPI) is an index on a non-unique field, which
not only is used to find table rows based on this field value but also is used by the
DBMS to determine where to store a row based on the primary index field value.
3. Unique secondary index (USI) is an index on a unique field, which is used
only to find table rows based on this field value.
4. Non-unique secondary index (NUSI) is an index on a non-unique field, which
is used only to find table rows based on this field value.
One of the most powerful capabilities of indesxed file organizations is the ability
to create multiple indexes.

12
7.2.3. Random Or Direct File Organization
Records are stored randomly but accessed directly. To access a file stored
randomly, a record key is used to determine where a record is stored on the storage
media. Magnetic and optical disks allow data to be stored and accessed randomly.
Records are quickly accessed (i.e. there is fast access to records). Files are easily
updated (i.e. adding, deleting, and amending the records is easily achieved). The
method does not require the use of indexes, hence saving space.

13

You might also like