Chapter 2-2

Chapter 2
Data
Introduction
● Data science is now one of the most influential topics all around.
● Companies and enterprises are focusing a lot on gathering data science talent.
● Data science is a multi-disciplinary field that uses

 Scientific methods,
 Processes,
 Algorithms, and
 Systems
To extract knowledge and insights from structured, semi-structured and unstructured
data.
● Example: The data involved in buying a box of cereal from the store or supermarket
Data Science vs Data scientist
• Data Science defined as the extraction of actionable knowledge directly from the
data through:-
 Process of discovery,
 Hypothesis, and
Analytical hypotheses analysis.
• It is a process of effectively producing some tool, method, or product.
• A data scientist (is a job title) is a person engaging in a systematic activity to

acquire knowledge from data.
• A data scientist may refer to an individual who uses the scientific method on
existing data.
Role of a Data Scientist
• Advance the skills of analyzing large amounts of data, data mining, and
programming skills.
• The processed and filtered data fed to various analytics programs and
machine learning with statistical methods to generate data which will
soon be used in predictive analysis and other fields
Data Science
• Scientific method requires data to begin iterating towards a more convincing
hypothesis.
• Science doesn’t exist without data.
• Data scientist
• possess a strong
• Quantitative background in statistics
• Linear algebra
• Programming knowledge with focuses on data warehousing, mining, and
modeling to build and analyze algorithms
Algorithms
• An algorithm is a set of instructions designed to perform a specific task.
• This can be a simple process, such as multiplying two numbers, or a

complex operation, such as playing a compressed video file.
• Search engines use proprietary algorithms to display the most relevant

results from their search index for specific queries.
Data vs. Information
• Data
• Can be defined as a representation of facts, concepts, or instructions in a
formalized manner, which should be suitable for:-
 Communication,
 Interpretation, or
 Processing, by human or electronic machines.
• It can be described as unprocessed facts and figures
• It is represented with the help of characters such as alphabets (A-Z, a-z), digits (0-9)
or special characters (+, -, /, *, <,>, =, etc.
• Information
• The processed data on which decisions and actions are based
• Information is interpreted data; created from organized, structured, and processed
data in a particular context
Data Processing Cycle
• Data processing is the conversion of raw data to meaningful information
through a process.
• The process includes activities like:-

 Data entry/input,
 Calculation/process,
 Output and
 Storage
Input is the task where verified data is coded or converted into machine
readable form so that it can be processed through a computer.
• Data entry is done through the use of a keyboard, digitizer, scanner, or data entry
from an existing source.
Data Processing Cycle
Processing is manipulation of data.
 Is the point where a computer program is being executed,
 It contains the program code and its current activity.
Output is the stage where processed information is now transmitted to
the user.
 Output is presented to users in various report formats like printed
report, audio, video, or on monitor.
Storage is the last stage in the data processing cycle, where data,
instruction and information are held for future use.
 The importance of this cycle is it allows quick access and retrieval of the
processed information, allowing it to be passed on to the next stage
directly, when needed.
Data types
• A data type is way to tell compiler what type of data supposed to be
stored and what amount of memory will be allocate to them.
• It restricts the compiler to store anything else other than that value
range.
• Common data types include

• Integers(int)- is used to store whole numbers, mathematically known as integers
• Booleans(bool)- is used to represent restricted to one of two values: true or false
• Characters(char)- is used to store a single character
• Floating-point numbers(float)- is used to store real numbers
• Alphanumeric strings(string)- used to store a combination of characters and
numbers
Data types
Data representation
• All computer represent data nothing more than a string of ones
and zeroes.
• In order for said ones and zeroes to convey any meaning, they
need to be contextualized.
• Data types provide that context.

• E.g. 01100001
Data types from Data Analytics perspective
• Data analytics (DA) is that the method of examining knowledge sets to
conclude the data they contain. or
is the process of examining data sets in order to find trends and
draw conclusions about the information they contain.
• Data analytics is done with the aid of specialized systems and software.
• From a data analytics point of view, it is important to understand that there are
three common types of data types or structures: -
• Structured,
• Semi-structured, and
• Unstructured data types
Structured Data
• Structured data is data that adheres to a pre-defined data model and is
therefore straightforward to analyze.
• Structured data concerns all data which can be stored in database SQL in
table with rows and columns.
• They have relational key and can be easily mapped into pre-designed
fields.
• Structured data is highly organized information that uploads neatly into a
relational database
• Structured data is relatively simple to enter, store, query, and analyze,

but it must be strictly defined in terms of field name and type
Unstructured Data
• Unstructured data is information that either does not have a predefined data
model or is not organized in a pre-defined manner.
• Unstructured data may have its own internal structure, but does not conform
neatly into a spreadsheet or database.
• Most business interactions, in fact, are unstructured in nature.

• Today more than 80% of the data generated is unstructured.
• The fundamental challenge of unstructured data sources is that they are difficult
for business users and data analysts alike to understand.
Semi structured Data
• Semi-structured data is information that doesn’t reside in a relational
database but that does have some organizational properties that
make it easier to analyze.
• Examples of semi-structured : CSV but XML and JSON documents

are semi structured documents, NoSQL databases are considered as
semi structured.
Cont….
Metadata – Data about Data
• Metadata is data about data.
 It is data that describes other data.
 It provides additional information about a specific set of data.
• Metadata summarizes basic information about data, which can make
finding and working with particular instances of data easier.
• For example, author, date created and date modified and file size are
examples of very basic document metadata.
• Having the ability to filter through that metadata makes it much easier for
someone to locate a specific document.
Data value chain
• The Data Value Chain is introduced to describe information flow
in big data system as a series of steps needed to generate value
and useful insights from data.
Steps in Data Value Chain

• Data acquisition, Data analysis, Data curation, Data storage, Data
usage
Data acquisition is the process of digitizing data to displayed,

analyzed, and stored in a computer.
• It is the processes for bringing data that has been created by a
source outside the organization, into the organization, for
production use.
Data value chain
Data analysis is the process of evaluating data using
analytical and logical reasoning to examine each component
of the data provided.
• Data from various sources is gathered, reviewed, and then

analyzed to form some sort of finding or conclusion.
• Data analytics is process of finding information from data

to make a decision and subsequently act on it.
Data value chain
Data curation is about managing data throughout its lifecycle. Like:-
Collecting,
Organizing,
Cleaning.
• Data curators manage the data through various stages and make the data
usable for data analysts and scientists.
Data storage is defined as a way of keeping information in the memory

storage for use by a computer.
• An example of data storage is a folder for storing Microsoft Word
documents.
Data usage is the amount of data (things like images, movies, photos,
videos, and other files) that you send, receive, download and/or upload.
Cont….
Big Data Definition
• No single standard definition…
“Big Data”
 Refers to data sets that are too large or complex to be dealt with by traditional
data-processing application software
 Is data whose scale, diversity, and complexity require new:-
architecture,
techniques,
 algorithms, and
analytics to manage it and extract value
.
…
Cont….
Big data
• Big Data is associated with the concept of 3 V that is volume, velocity, and
variety.
• Big data is characterized by 3V and more:

• Volume: large amounts of data Zeta bytes/Massive datasets
• Velocity: Data is live streaming or in motion
• Variety: data comes in many different forms from diverse sources
• Veracity: can we trust the data? How accurate is it? etc.
Clustered Computing
• Cluster Computing addresses the latest results in these fields that support
High Performance Distributed Computing .
• The Clustering methods have identified as- HPC IAAS(Infrastructure as
a Service), HPC PAAS(Platform as a Service), that are more expensive
and difficult to setup and maintain than a single computer.
• In HPDC environments, parallel and/or distributed computing

techniques are applied to the solution of computationally intensive
applications across networks of computers.
Clustered Computing
• “Computer cluster” basically refers to a set of connected computer
working together.
• The cluster represents one system and the objective is to improve

performance.
• The computers are generally connected in a LAN (Local Area

Network).
• So, when this cluster of computers works to perform some tasks and
gives an impression of only a single entity, it is called “cluster
computing”.
Clustered Computing
• Big data clustering software combines the resources of many
smaller machines, seeking to provide a number of benefits:
• Resource Pooling:
• Combining the available storage space to hold data is a clear benefit,
but CPU and memory pooling are also extremely important. Processing
large datasets requires large amounts of all three of these resources.
•
• Object Pooling is a way which enable storing of group of object(called
pool storage) in memory.
• Whenever new object is needs to be created, it is first checked in pool

storage and if available it is reused and like this it provide reusability of
object and system resources, improves the scalability of program.
Clustered Computing
• High Availability: In computing, the term availability is used to describe the
period of time when a service is available, as well as the time required by a
system to respond to a request made by a user.
• High availability is a quality of a system or component that assures a high
level of operational performance for a given period of time.
• Clusters can provide varying levels of fault tolerance and availability
guarantees to prevent hardware or software failures from affecting access to
data and processing. This becomes increasingly important as we continue to
emphasize the importance of real-time analytics.
• Easy Scalability: Clusters make it easy to scale horizontally by adding

additional machines to the group. This means the system can react to changes
in resource requirements without expanding the physical resources on a
machine.
Hadoop and its Ecosystem
• Hadoop is an open-source framework intended to make interaction with
big data easier.
• It is a framework that allows for the distributed processing of large
datasets across clusters of computers using simple programming models.
• The four key characteristics of Hadoop are:
• Economical: Its systems are highly economical as ordinary computers can be used
for data processing.
• Reliable: It is reliable as it stores copies of the data on different machines and is
resistant to hardware failure.
• Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes
help in scaling up the framework.
• Flexible: It is flexible and you can store as much structured and unstructured
data as you need to and decide to use them later.
Hadoop and its Ecosystem
● Hadoop has an ecosystem that has evolved from its four core components:
● data management, access, processing, and storage.
● It is continuously growing to meet the needs of Big Data.
● It comprises the following components and many others:
○ HDFS: Hadoop Distributed File System
○ YARN: Yet Another Resource Negotiator
○ MapReduce: Programming based Data Processing
○ Spark: In-Memory data processing
○ PIG, HIVE: Query-based processing of data services
○ HBase: NoSQL Database
○ Zookeeper: Managing cluster

Big Data Life Cycle
THANKS
End of chapter 2
Does anyone have any questions?

Chapter 2-2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2-2

Uploaded by

Copyright:

Available Formats

Chapter 2

● Data science is a multi-disciplinary field that uses

• It is a process of effectively producing some tool, method, or product.

• A data scientist (is a job title) is a person engaging in a systematic activity to

• Science doesn’t exist without data.

• This can be a simple process, such as multiplying two numbers, or a

• Search engines use proprietary algorithms to display the most relevant

• It can be described as unprocessed facts and figures

• The process includes activities like:-

• Common data types include

• Data types provide that context.

• Structured data is relatively simple to enter, store, query, and analyze,

• Most business interactions, in fact, are unstructured in nature.

• Examples of semi-structured : CSV but XML and JSON documents

Steps in Data Value Chain

Data acquisition is the process of digitizing data to displayed,

• Data from various sources is gathered, reviewed, and then

• Data analytics is process of finding information from data

Data storage is defined as a way of keeping information in the memory

• Big data is characterized by 3V and more:

• In HPDC environments, parallel and/or distributed computing

• The cluster represents one system and the objective is to improve

• The computers are generally connected in a LAN (Local Area

• Whenever new object is needs to be created, it is first checked in pool

• Easy Scalability: Clusters make it easy to scale horizontally by adding

○ HDFS: Hadoop Distributed File System

○ YARN: Yet Another Resource Negotiator

○ MapReduce: Programming based Data Processing

○ Spark: In-Memory data processing

○ PIG, HIVE: Query-based processing of data services

○ HBase: NoSQL Database

○ Zookeeper: Managing cluster

You might also like