CS8091 BDA Unit 1

21ITE06 / Big Data Analytics
III Year / VI Semester

MODULE I INTRODUCTION TO BIG
DATA AND HADOOP FRAMEWORK
Introduction to Big Data: Types of Digital Data-Characteristics of Data – Evolution of
Big Data - Definition of Big Data – Challenges with Big Data – 3Vs of Big Data – Non-
Definitional traits of Big Data – Business Intelligence vs. Big Data - Data warehouse
and Hadoop environment – Coexistence. Big Data Analytics: Classification of analytics
– Data Science - Terminologies in Big Data – CAP Theorem – BASE Concept. NoSQL:
Types of Databases – Advantages – NewSQL - SQL vs. NOSQL vsNewSQL. Introduction
to Hadoop: Features – Advantages - Versions – Overview of Hadoop Eco systems –
Hadoop distributions – Hadoop vs. SQL – RDBMS vs. Hadoop-Hadoop Components –
Architecture - HDFS – Map Reduce: Mapper – Reducer - Combiner -Partitioner –
Searching – Sorting – Compression. Hadoop 2 (YARN): Architecture – Interacting with
Hadoop Eco systems.
Objectives
•To understand the need of Big Data, challenges, and

different analytical architectures
•Installation and understanding of Hadoop Architecture
and its ecosystems.
Course Outcome
1. Understand evolution of big data with its
characteristics and challenges with traditional
business intelligence
2. Distinguish big data analysis and analytics in
optimizing the business decisions.
3. Make use of appropriate components for
processing, scheduling and knowledge extraction
from large volumes in distributed Hadoop
Ecosystem.
Unit I - INTRODUCTION TO BIG
DATA
Evolution of Big data - Best Practices for Big data Analytics -
Big data characteristics - Validating - The Promotion of the
Value of Big Data - Big Data Use Cases- Characteristics of
Big Data Applications - Perception and Quantification of
Value -Understanding Big Data Storage - A General
Overview of High-Performance Architecture - HDFS -
MapReduce and YARN - Map Reduce Programming Model.
DATA
The quantities, characters, or symbols on which

operations are performed by a computer, which may be stored
and transmitted in the form of electrical signals and recorded on
magnetic, optical, or mechanical recording media.
BIG DATA
Big Data is also data but with a huge size. Big Data is a
term used to describe a collection of data that is huge in size and
yet growing exponentially with time.
“Big Data” is data whose scale, diversity, and complexity

require new architecture, techniques, algorithms, and analytics
to manage it and extract value and hidden knowledge from it…
BIG DATA
Units of Memory-
Byte
Kilo Byte
Mega Byte
Giga Byte
Tera Byte
Peta Byte
Exa Byte
Zetta Byte
Yotta Byte
BIG DATA - Sources
BIG DATA - Sources
 Primary sources of Big Data

 Social data:
 Likes,
Tweets & Retweets,
Comments,
Video Uploads, and general media
BIG DATA - Sources

 Machine data:
 Industrial equipment,
 sensors that are installed in machinery,
 web logs which track user behavior
 Sensors such as medical devices, smart meters, road
cameras, satellites, games
BIG DATA - Sources

 Transactional data:
 Invoices,
 Payment orders,
 Storage records,
 Delivery receipts
BIG DATA – Data Structures
 Structured data:
 Data containing a defined data type, format, and
structure.
 Semi-structured data:
 Semi-structured data is information that does not
reside in a relational database but that have some
organizational properties that make it easier to
analyze.
 Example: XML Data
 Quasi-structured data:
 It consists of textual data with erratic data formats,
and can be formatted with effort, software tools,
and time. An example of quasi-structured data is
the data about which webpages a user visited and
in what order.
 Quasi-structured data:
 Unstructured data:
Data that has no inherent structure, which may
include text documents, PDFs, images, and video.
 A clickstream that can be parsed and mined by

data scientists to discover usage patterns and
uncover relationships among clicks and areas
of interest on a website or group of sites.
Types of Data Repositories, from
an Analyst Perspective
Data Repository Characteristics
Spreadsheets and Spreadsheets and low-volume

data marts databases for recordkeeping
Analyst depends on data extracts
Data Warehouses Centralized data containers in a purpose-built
space
Supports BI and reporting, but restricts robust
analyses
Analyst dependent on IT and DBAs for data access
and schema changes
Analysts must spend significant time to get
aggregated and disaggregated data extracts from
multiple sources.
Analytic Sandbox Data assets gathered from multiple sources and

(workspaces) technologies for analysis
Enables flexible, high-performance analysis in a
nonproduction environment; can leverage in-
database processing
Reduces costs and risks associated with data
replication into “shadow” file systems
“Analyst owned” rather than “DBA owned”
State of the Practice in Analytics
Business Driver Examples
Optimize business operations Sales, pricing, profitability,

efficiency
Identify business risk Customer churn, fraud, default
Predict new business Upsell, cross-sell, best new

opportunities customer prospects
Comply with laws or regulatory Anti-Money Laundering, Fair

requirements Lending, Basel II-III,
Sarbanes-Oxley (SOX)
BI Versus Data Science
 BI systems make it easy to answer questions

related to:
Quarter-to-date revenue,
Progress toward quarterly targets, and
Understand how much of a given product was sold

in a prior quarter or year
 Data Science tends to use disaggregated data

in a
more forward-looking,
exploratory way,
focusing on analyzing the present and enabling

informed decisions about the future.
 BI problems tend to require highly structured

data organized in rows and columns for
accurate reporting,
 Data Science projects tend to use many types
of data sources, including large or
unconventional datasets
Current Analytical Architecture
 Most organizations still have data warehouses

that provide excellent support for traditional
reporting and
 simple data analysis activities but
unfortunately have a more difficult time
supporting more robust analyses.
 For data sources to be loaded into the data

warehouse, data needs to be well understood,
structured, and normalized with the
appropriate data type definitions.
 Although this kind of centralization enables

security, backup, and failover of highly critical
data,
 it also means that data typically must go through
significant preprocessing and checkpoints before
it can enter this sort of controlled environment
 As a result of this level of control on the

EDW, additional local systems may emerge in
the form of departmental warehouses and local
data marts that business users create to
accommodate their need for flexible analysis.
 Once in the data warehouse, data is read by

additional applications across the enterprise
for BI and reporting purposes.
 These are high-priority operational processes
getting critical data feeds from the data
warehouses and repositories
 Analysts create data extracts from the EDW to

analyze data offline in R or other local
analytical tools.
 Because new data sources slowly accumulate

in the EDW due to the rigorous validation and
data structuring process, data is slow to move
into the EDW, and the data schema is slow to
change.
 Departmental data warehouses may have been

originally designed for a specific purpose and
set of business needs, some of which may be
forced into existing schemas to enable BI and
the creation of OLAP cubes for analysis and
reporting.
Drivers of Big Data
Drivers of Big Data
 The data now comes from multiple sources,

such as these:
 Medical information, such as genomic sequencing
and diagnostic imaging
Photos and video footage uploaded to the World
Wide Web.
Drivers of Big Data
such as these:
 Video surveillance, such as the thousands of video
cameras spread across a city
Mobile devices, which provide geospatial location
data of the users, as well as metadata about text
messages, phone calls, and application usage on
smart phones.
Drivers of Big Data
such as these:
 Smart devices, which provide sensor-based
collection of information from smart electric grids,
smart buildings, and many other public and industry
infrastructures
 Nontraditional IT devices, including the use of
radio-frequency identification (RFID) readers, GPS
navigation systems, and seismic processing.
Emerging Big Data Ecosystem and
a New Approach to Analytics
 Data devices
 “Sensornet” gather data from multiple locations
and continuously generate new data about this
data.
 The video game provider captures data about the
skill and levels attained by the player.
 Data devices
 As a consequence, the game provider can fine-
tune the difficulty of the game, suggest other
related games that would most likely interest the
user, and offer additional equipment and
enhancements for the character based on the user’s
age, gender, and interests.
 Data collectors
 Retail stores tracking the path a customer takes
through their store while pushing a shopping cart

with an RFID chip so they can gauge which
products get the most foot traffic using geospatial
data collected from the RFID chips
 Data aggregators
 Organizations compile data from the devices and usage
patterns collected by government agencies, retail stores,
and websites.
 In turn, they can choose to transform and package the data
as products to sell to list brokers, who may want to
generate marketing lists of people who may be good targets
for specific ad campaigns.
 Data users and buyers
 These groups directly benefit from the data collected and
aggregated by others within the data value chain.
Characteristics of Big Data
 Volume: This refers to the data that is tremendously
large.
 Variety: The data is coming from different sources in
various formats.
 Velocity: The speed of data accumulation also plays a
role in determining whether the data is categorized into
big data or normal data.
 Value: It deals with a mechanism to bring out

the correct meaning out of data.
 Veracity: Trustworthiness and quality of data.
VALIDATING THE HYPE: ORGANIZATIONAL
FITNESS
 There are a number of factors that need to be
considered before making a decision regarding
adopting that technology.
 Big data is feasible within the organization, it
does not necessarily mean that it is reasonable.
 A sample framework for determining a score
for each of these factors ranging from 0
(lowest level) to 4 (highest level)
FITNESS
 Feasibility:
 Is the enterprise aligned in a way that allows for new and
emerging technologies to be brought into the organization,
tested out, and assessed without overbearing organization?
 If not, what steps can be taken to create an environment
that is suited to the introduction and assessment of
innovative technologies?
FITNESS
 Reasonability:
 When evaluating the feasibility of adopting big
data technologies, have you considered whether
your organization faces business challenges whose
resource requirements exceed the capability of the
existing or planned environment?
FITNESS
 Reasonability:
 If not currently, do you anticipate that the
environment will change in the near-, medium or
long-term to be more data-centric and require
augmentation of the resources necessary for
analysis and reporting?
FITNESS
 Value:
 Is there an expectation that the resulting
quantifiable value that can be enabled as a result of
big data warrants the resource and effort
investment in development and
productionalization of the technology?
FITNESS
 Integrability:
 What steps need to be taken to evaluate the means
by which big data can be integrated as part of the
enterprise?
FITNESS
 Sustainability :
 the costs associated with maintenance, configuration,
skills maintenance, and adjustments to the level of
agility in development may not be sustainable within
the organization
 How would you plan to fund continued management
and maintenance of a big data environment?
Quantifying Organizational Readiness
0 Evaluation of new technology is not officially sanctioned

1 Organization tests new technologies in reaction to market
Pressure
2 Organization evaluates and tests new technologies after
market evidence of successful use
Feasibility 3 Organization is open to evaluation of new technology
Adoption of technology on an ad hoc basis based on
convincing business Justifications
4 Organization encourages evaluation and testing of new
technology
Clear decision process for adoption or rejection
Organization supports allocation of time to Innovation
0 Organization’s resource requirements for near-, mid-, and
long-terms are satisfactorily Met
1 Organization’s resource requirements for near- and mid-
terms are satisfactorily met, unclear as to whether long-
term needs are met
2 Organization’s resource requirements for near-term is
satisfactorily met, unclear as to whether mid- and long
term needs are Met
Reasonability
3 Business challenges are expected to have resource
requirements in the mid- and long-terms that will exceed
the capability of the existing and planned environment
4 Business challenges have resource requirements that
clearly exceed the capability of the existing and planned
environment
Organization’s go-forward business model is highly
Information centric
0 Investment in hardware resources, software tools, skills

training, and ongoing management and maintenance
exceeds the expected quantifiable Value
1 The expected quantifiable value widely is evenly
balanced by an investment in hardware resources,
software tools, skills training, and ongoing management
and maintenance
Value
2 Selected instances of perceived value may suggest a
positive return on investment
3 Expectations for some quantifiable value for investing in
limited aspects of the technology
4 The expected quantifiable value widely exceeds the
investment in hardware resources, software tools, skills
training, and ongoing management and Maintenance
0 Significant impediments to incorporating any

nontraditional technology into environment
1 Willingness to invest effort in determining ways to
integrate technology, with some successes
2 New technologies can be integrated into the environment
Integrability within limitations and with some level of effort
3 Clear processes exist for migrating or integrating new
technologies, but require dedicated resources and level of
effort
4 No constraints or impediments to fully integrate
technology into operational environment
0 No plan in place for acquiring funding for ongoing

management and maintenance costs
No plan for managing skills inventory
1 Continued funding for maintenance and engagement is
given on an ad hoc basis
Sustainability is at risk on a continuous basis
Sustainability 2 Need for year by- year business justifications for
continued Funding
3 Business justifications ensure continued funding and
investments in skills
4 Program management office effective in absorbing and
remunerating management and maintenance costs
Program for continuous skills enhancement and training
The promotion of the value of the
Big Data
 A thoughtful approach must differentiate
between hype and reality, and one way to do
this is to review the difference between what is
being said about big data and what is being
done with big data.
Big Data
 A scan of existing content on the “value of big
data” sheds interesting light on what is being
promoted as the expected result of big data
analytics and, more interestingly, how familiar
those expectations sound.
Big Data
 Center for Economics and Business Research
(CEBR) that speaks to the cumulative value of:
 optimized consumer spending as a result of improved
targeted customer marketing
 improvements to research and analytics within the
manufacturing sectors to lead to new product
development
Big Data
(CEBR) that speaks to the cumulative value
of:
 improvements in strategizing and business
planning leading to innovation and new start-up
companies
Big Data
(CEBR) that speaks to the cumulative value of:
 predictive analytics for improving supply chain
management to optimize stock management,
replenishment, and forecasting
 improving the scope and accuracy of fraud detection
Big Data
 Benefits promoted by business intelligence and
data warehouse tools vendors and system
integrators for the past 15-20 years, namely:
 Better targeted customer marketing, Improved
product analytics, Improved business planning,
Improved supply chain management, Improved
analysis for fraud, waste, and abuse
Big Data Use Cases
 A scan of the list allows us to group most of

those applications into these categories:
 Business intelligence, querying, reporting,
searching
 Improved performance for common data
management operations
Big Data Use Cases
 A scan of the list allows us to group most of

those applications into these categories:
Non-database applications
Data mining and analytical applications

Big Data Use Cases
 The big data application can be further

abstracted into more fundamental categories:
 Counting – Filtering and aggregation
 Scanning – Sorting, Transformation and searching
 Modeling – Analysis and prediction
 Storing – large datasets, rapid access.

Applications
 The big data approach is mostly suited to
addressing or solving business problems that are
subject to one or more of the following criteria:
 Data throttling, Computation-restricted throttling,
Large data volumes, Significant data variety and
Benefits from data parallelization
Applications
Application Characteristic Sample Data Sources
Energy network Data throttling Sensor data from
monitoring and Computation smart meters and
Optimization throttling network
Components
Large data volumes
Credit fraud Data throttling Point-of-sale data
detection Computation Customer profiles
throttling
Large data volumes Transaction histories
Parallelization Predictive models
Data variety
Applications
Data profiling Large data volumes Sources selected for
Parallelization downstream
repurposing
Clustering and Data throttling Customer profiles
customer Computation Transaction histories
segmentation throttling
Large data volumes Enhancement
datasets
Parallelization
Data variety
Applications
Recommendation Data throttling Customer profiles
engines Computation Transaction histories
throttling
Large data volumes Enhancement
datasets
Parallelization Social network data
Data variety
Applications
Price modeling Data throttling Point-of-sale data
Computation Customer profiles

throttling
Large data volumes Transaction histories
Parallelization Predictive models

Perception and Quantification of
Value
 Big data significantly contributes to adding
value to the organization by:
 Increasing revenues
 Lowering costs - operating costs
 Increasing productivity
 Reducing risk
Understanding Big Data Storage
 If not all big data applications achieve their

performance and scalability through
deployment on a collection of storage and
computing resources bound together within a
runtime environment
Understanding Big Data Storage
 the ability to design, develop, and implement

a big data application is directly dependent on
an awareness of the architecture of the
underlying computing platform (both
hardware and software)
Understanding Big Data Storage -
Resource
 Processing capability
 CPU, processor, or node
 modern processing nodes often incorporate multiple

cores that are individual CPUs that share the node’s
memory and are managed and scheduled together,
allowing multiple tasks to be run simultaneously -
multithreading
Resource
 Memory
 which holds the data that the processing node is
currently working on. Most single node machines
have a limit to the amount of memory.
Resource
 Storage
 the place where datasets are loaded, and from
which the data is loaded into memory to be
processed
Resource
 Network
 provides the “pipes” through which datasets are
exchanged between different processing and
storage nodes
Resource
 Single-node computers are limited in their
capacity, they cannot easily accommodate
massive amounts of data.
Resource
 High-performance platforms are composed of
collections of computers in which the massive
amounts of data and requirements for
processing can be distributed among a pool of
resources
A General View of High
Performance Architecture
 Connecting multiple nodes together via a
variety of network topologies.
 The general architecture distinguishes the
management of computing resources and the
management of the data across the network of
storage nodes.
 A master job manager - oversees the pool of
processing nodes, assigns tasks, and monitors
the activity
 A storage manager - oversees the data storage
pool and distributes datasets across the
collection of storage resources
 To get a better understanding of the layering
and interactions within a big data platform,
will examine the Apache Hadoop software
stack.
HDFS
 HDFS attempts to enable the storage of large files,
and does this by distributing the data among a pool of
data nodes.
 A single name node runs in a cluster, associated with
one or more data nodes, and provide the management
of a typical hierarchical file organization and
namespace.
HDFS
 The name node effectively coordinates the
interaction with the distributed data nodes.
 The name node maintains metadata about each file.
That metadata includes an enumeration of the
managed files, properties of the files, and the file
system, as well as the mapping of blocks to files at
the data nodes.
HDFS
HDFS
 The data node itself does not manage any
information about the logical HDFS file;
 Rather, it treats each data block as a separate
file and shares the critical information with the
name node.
HDFS
 Once a file is created, as data is written to the
file, it is actually cached in a temporary file.
HDFS
 When the amount of the data in that temporary
file is enough to fill a block in an HDFS file,
the name node is alerted to transition that
temporary file into a block that is committed to
a permanent data node, which is also then
incorporated into the file management scheme.
HDFS
 HDFS provides a level of fault tolerance
through data replication.
 An application can specify the degree of
replication (i.e., the number of copies made)
when a file is created.
HDFS
 HDFS provides performance through
distribution of data and fault tolerance through
replication.
 The result is a level of robustness for reliable
massive file storage.
HDFS - Key tasks for failure
management
 Monitoring:
 There is a continuous “heartbeat” communication between the
data nodes to the name node.
 If a data node’s heartbeat is not heard by the name node, the
data node is considered to have failed and is no longer
available.
 A replica is employed to replace the failed node, and a
change is made to the replication scheme
management
 Rebalancing:
A process of automatically migrating blocks of
data from one data node to another when there is
free space
 when there is an increased demand for the data
and moving it may improve performance
management
 Managing integrity:
HDFS uses checksums, which are effectively
“digital signatures” associated with the actual data
stored in a file that can be used to verify that the
data stored corresponds to the data shared or
received.
management
 Metadata replication:
 The metadata files are also subject to failure, and
HDFS can be configured to maintain replicas of
the corresponding metadata files to protect against
corruption
management
 decreasing the cost of specialty large-scale
storage systems;
 providing the ability to rely on commodity
components;
 enabling the ability to deploy using cloud-
based services;
 reducing system management costs
MapReduce
 MapReduce originally combined both job

management and the programming model for
execution
 The MapReduce execution environment
employs a master/slave execution model
MapReduce
 one master node (called the JobTracker)

manages a pool of slave computing resources
(called TaskTrackers) that are called upon to
do the actual work.
MapReduce – JobTracker
 Responsibilities:
 managing the TaskTrackers
 monitoring their accessibility and availability
 job management - scheduling tasks, tracking the

progress of assigned tasks, reacting to identified
failures, and ensuring fault tolerance of the execution
MapReduce – TaskTracker
 Responsibilities:
 wait for a task assignment
 initiate and execute the requested task
 provide status back to the JobTracker on a

periodic basis
MapReduce
 Limitations
 Applications that demand data movement will
rapidly become bogged down by network latency
issues
 Not all applications are easily mapped to the
MapReduce model.
MapReduce
 Limitations
 The allocation of processing nodes within the
cluster is fixed through allocation of certain nodes
as “map slots” versus “reduce slots.”
 the nodes assigned to the other phase are largely
unused, resulting in processor underutilization
YARN
 YARN – Yet Another Resource Negotiation
 Overall resource management has been

centralized while management of resources at
each node is now performed by a local
NodeManager.
YARN
 The concept of an ApplicationMaster that is

associated with each application that directly
negotiates with the central ResourceManager
for resources while taking over the
responsibility for monitoring progress and
tracking status
YARN
 Applications to be better aware of the data allocation
across the topology of the resources within a cluster.
 Allows for improved colocation of compute and data
resources, reducing data motion, and consequently,
reducing delays associated with data access latencies.
MapReduce Programming Model
 to develop applications to read, analyze,

transform, and share massive amounts of data
 Application development in MapReduce is a
combination of the familiar
procedural/imperative approaches used by
Java or C++ programmers
 Operations:
 Map, which describes the computation or analysis
applied to a set of input key/value pairs to produce a
set of intermediate key/value pairs.
 Reduce, in which the set of values associated with
the intermediate key/value pairs output by the Map
operation are combined to provide the results.
 To process huge amount of data in parallel, reliable and
efficient way in cluster environments.
Uses Divide and Conquer technique to process large
amount of data.
It divides input task into smaller and manageable sub-tasks
to execute them in-parallel.
Steps:
Map function
Shuffle function
Reduce function
Map function
It takes input tasks and divides them into smaller sub-tasks.
 Sub steps:
 Splitting - takes input DataSet from Source and divide

into smaller Sub-DataSets.
 Mapping - takes those smaller Sub-DataSets and perform
required action or computation on each Sub-DataSet
Map function
 The output of this Map Function is a set of key and value pairs
as <Key, Value>
Shuffle function
 Sub steps:
 Merging - combines all key-value pairs which have same

keys.
 Sorting - takes input from Merging step and sort all key-
value pairs by using Keys
 Shuffle Function returns a list of <Key, List<Value>> sorted pairs
to next step
 Reduce Function:
Takes list of <Key, List<Value>> sorted pairs from Shuffle
Function and perform reduce operation

CS8091 BDA Unit 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS8091 BDA Unit 1

Uploaded by

Copyright:

Available Formats

21ITE06 / Big Data Analytics

III Year / VI Semester

•To understand the need of Big Data, challenges, and

The quantities, characters, or symbols on which

“Big Data” is data whose scale, diversity, and complexity

 Primary sources of Big Data

 Primary sources of Big Data

 Primary sources of Big Data

 A clickstream that can be parsed and mined by

Data Repository Characteristics

Spreadsheets and Spreadsheets and low-volume

Analytic Sandbox Data assets gathered from multiple sources and

Optimize business operations Sales, pricing, profitability,

Identify business risk Customer churn, fraud, default

Predict new business Upsell, cross-sell, best new

Comply with laws or regulatory Anti-Money Laundering, Fair

 BI systems make it easy to answer questions

Progress toward quarterly targets, and

Understand how much of a given product was sold

 Data Science tends to use disaggregated data

focusing on analyzing the present and enabling

 BI problems tend to require highly structured

 Most organizations still have data warehouses

 For data sources to be loaded into the data

 Although this kind of centralization enables

 As a result of this level of control on the

 Once in the data warehouse, data is read by

 Analysts create data extracts from the EDW to

 Because new data sources slowly accumulate

 Departmental data warehouses may have been

 The data now comes from multiple sources,

through their store while pushing a shopping cart

 Value: It deals with a mechanism to bring out

0 Evaluation of new technology is not officially sanctioned

0 Investment in hardware resources, software tools, skills

0 Significant impediments to incorporating any

0 No plan in place for acquiring funding for ongoing

 A scan of the list allows us to group most of

 A scan of the list allows us to group most of

Data mining and analytical applications

 The big data application can be further

 Scanning – Sorting, Transformation and searching

 Modeling – Analysis and prediction

 Storing – large datasets, rapid access.

Price modeling Data throttling Point-of-sale data

Computation Customer profiles

Large data volumes Transaction histories

Parallelization Predictive models

 Lowering costs - operating costs

 If not all big data applications achieve their

 the ability to design, develop, and implement

 modern processing nodes often incorporate multiple

 MapReduce originally combined both job

 one master node (called the JobTracker)

 monitoring their accessibility and availability

 job management - scheduling tasks, tracking the

 initiate and execute the requested task

 provide status back to the JobTracker on a

 YARN – Yet Another Resource Negotiation

 Overall resource management has been

 The concept of an ApplicationMaster that is

 to develop applications to read, analyze,