Unit 1-2

What is Big Data ?
Big Data is a large data set which comes from many sources and data formats, and data that
can be processed and analyzed to find insights and patterns used to make informed
decisions.
According to the American IT research and advisory firm Gartner Inc.,
“Big Data is high-volume, high-velocity and/or high-variety information assets

that demand cost-effective, innovative forms of information processing
that enable enhanced insight and decision making.
Slide 1
3 Vs of Big Data
Variety
Volume Data
Data Sources
Size ➢ Structured
➢
➢
Terabytes
Records
Data ➢ Unstructured
➢ Semi-structured
➢ Transactions Complexit ➢ All of the above
➢ Table/Files
y
Big Data is a high-volume, high-

velocity, and high-variety
➢ Batch information asset that demands cost-
➢ Near-Time effective, innovative forms of
Velocity information processing for enhanced
➢ Real-Time
➢ Streams
Speed of insight and decision making
Change
2
Big Data Challenges
➢ Big Data Storage
➢ Big Data Processing
➢ Distributed Computing
3
Big Data and the Hadoop Ecosystem
➢ Big Data is characterized by the magnitude of digital

information that can come from many sources and data
formats (structured and unstructured), and
➢ Data that can be processed and analyzed to find insights and
patterns used to make informed decisions.
➢ Analyzing Big Data requires lots of storage and large

computations that demand a great deal of processing power.
4
Big Data and the Hadoop Ecosystem
➢ Instead of utilizing individual highly configured machine,

distributing tasks over multiple commodity machines is
feasible.
➢ Google introduced Google File System (GFS) and

MapReduce, an algorithm and distributed programming
platform for processing large data sets.
➢ Apache introduced generalized Hadoop Framework with

Hadoop Distributed File System (HDFS) and Hadoop’s
implementation of MapReduce. 5
Hadoop Framework
Hadoop is different from previous distributed approaches in the

following ways:
➤ Data is distributed in advance.

➤ Data is replicated throughout a cluster of computers for
reliability and availability.
➤ Data processing tries to occur where the data is stored, thus
eliminating bandwidth bottlenecks.
6
Power of Hadoop
Vast amount of storage
— Hadoop enables applications to work with thousands of computers and petabytes
of data. Over the past decade, computer professionals have realized that low-cost
“commodity” systems can be used together for high-performance computing
applications that once could be handled only by supercomputers.
Distributed processing with fast data access

- Hadoop moves execution toward the data.
- Hadoop applications are typically organized in a way that they process data
sequentially
Reliability, failover, and scalability

—Hadoop frequently monitors the entire cluster, detects failures and retries
execution (by utilizing different nodes). 7
Hadoop Cluster
Slide 8
MASTER MACHINE(S) SLAVE MACHINES
Hadoop Ecosystem
Hadoop is classified as an ecosystem comprised of many components that range

from data storage, to data integration, to data processing, to specialized tools
for data analysts.
Slide 9
YARN
Slide 10
HDFS
Slide 11
HDFS
Slide 12
HDFS Architecture
Slide 13
Hadoop Philosophies
There are 3 basic philosophies on which hadoop works.
a. All the basic software that helps to start the Hadoop cluster are software daemons.
b. The daemons are based on master and slave architecture.
c. 2 broad parts of Hadoop framework - storage (HDFS) and processing (Map Reduce).
1. HDFS (Storage Part)
○ Master Daemon - Namenode (High End Admin Machine) (1 in number)
○ Back-up Daemon - Secondary Namenode (High End Admin Machine) (1 in
number)
○ Slave Daemons - Datanode (Commodity Machines) (Many in number)
2. YARN - Map Reduce (Processing Part)
○ Master Daemon - ResourceManager (High End Admin Machine) (1 in
number)
○ Slave Daemons - NodeManager (Commodity Machines) (Many in number)
Slide 14
Core Components of Hadoop Ecosystem
Slide 15
➤ HDFS — A foundational component of the Hadoop ecosystem is the Hadoop Distributed File System
(HDFS). HDFS is the mechanism by which a large amount of data can be distributed over a cluster
of computers, and data is written once, but read many times for analytics. It provides the foundation
for other tools, such as HBase.
➤ MapReduce — Hadoop’s main execution framework is MapReduce, a programming model for
distributed, parallel data processing, breaking jobs into mapping phases and reduce phases (thus
the name). Developers write MapReduce jobs for Hadoop, using data stored in HDFS for fast data
access. Because of the nature of how MapReduce works, Hadoop brings the processing to the
data in a parallel fashion, resulting in fast implementation.
➤ HBase — A column-oriented NoSQL database built on top of HDFS, HBase is used for fast
read/write access to large amounts of data. HBase uses Zookeeper for its management to ensure
that all of its components are up and running.
Slide 16
➤ Zookeeper — Zookeeper is Hadoop’s distributed coordination service. Designed to run over a cluster
of machines, it is a highly available service used for the management of Hadoop operations, and
many components of Hadoop depend on it.
➤ Oozie — A scalable workflow system, Oozie is integrated into the Hadoop stack, and is used to
coordinate execution of multiple MapReduce jobs. It is capable of managing a significant amount of
complexity, basing execution on external events that include timing and presence of required data.
➤ Pig — An abstraction over the complexity of MapReduce programming, the Pig platform includes an
execution environment and a scripting language (Pig Latin) used to analyze Hadoop data sets. Its
compiler translates Pig Latin into sequences of MapReduce programs.
➤ Hive — An SQL-like, high-level language used to run queries on data stored in Hadoop, Hive enables
developers not familiar with MapReduce to write data queries that are translated into MapReduce
jobs in Hadoop. Like Pig, Hive was developed as an abstraction layer, but geared more toward
database analysts more familiar with SQL than Java programming Slide 17
The Hadoop ecosystem also contains several frameworks for integration with the rest of the enterprise:
➤ Sqoop is a connectivity tool for moving data between relational databases and data warehouses and
Hadoop. Sqoop leverages database to describe the schema for the imported/ exported data and
MapReduce for parallelization operation and fault tolerance.
➤ Flume is a distributed, reliable, and highly available service for efficiently collecting, aggregating, and
moving large amounts of data from individual machines to HDFS. It is based on a simple and
flexible architecture, and provides a streaming of data flows. It leverages a simple extensible data
model, allowing you to move data from multiple machines within an enterprise into Hadoop.
Slide 18
Hadoop Distribution
In a Big Data project that involves Hadoop-related ecosystem technologies, you have two choices:
• Download the project you need separately and try to create or assemble the technologies in a
coherent, resilient, and consistent architecture.
• Use one of the most popular Hadoop distributions, which assemble or create the technologies for you.
Packaged Hadoop distribution ensures capability between all installed components, ease of
installation, configuration-based deployment, monitoring, and support.
Hortonworks and Cloudera are the main actors in this field.
There are a couple of differences between the two vendors, but for starting a Big Data package, they
are equivalent, as long as you don’t pay attention to the proprietary add-ons.
Slide 19
Hadoop Distribution
Cloudera CDH Cloudera adds a set of in-house components to the Hadoop-based components; these
components are designed to give you better cluster management and search experiences.
The following is a list of some of these components:
• Impala: A real-time, parallelized, SQL-based engine that searches for data in HDFS (Hadoop
Distributed File System) and Base. Impala is considered to be the fastest querying engine within
the Hadoop distribution vendors market, and it is a direct competitor of Spark from UC Berkeley.
• Cloudera Manager: This is Cloudera’s console to manage and deploy Hadoop components within your
Hadoop cluster.
• Hue: A console that lets the user interact with the data and run scripts for the different Hadoop
components contained in the cluster.
Slide 20
Hadoop Distribution - Cloudera
Slide 21
Hadoop Distribution - Hortonworks
Hortonworks is 100-percent open source and is used to package stable components rather than the last
version of the Hadoop project in its distribution. Slide 22
Creating the Foundation of a Long-Term Big Data Architecture
Basically, big data applications are also involved with three major tasks.
1. Data acquisition or ingestion can start from different sources.
2. Processing the data.
3. Data analytics and visualizing it.
Now-a-days, apart from these packaged Hadoop distribution system, there are number of Hadoop
ecosystem core components and their supportive tools evolves in the big data analytical field:
Slide 23
Data Acquisition
Data acquisition or ingestion can start from different sources.
It can be large log files, streamed data, ETL processing outcome, online
unstructured data, or offline structure data.
Supportive Tools:
1. Apache Flume
2. Apache Sqoop
3. Apache HBase
Slide 24
Data Acquisition – Apache Flume
Collect and aggregate log data
External source triggers the event, and it starts streaming data flows
Passive storage system
Design an intuitive programming model and event-driven programming model
3 basic components: 1) Sources 2) Channels 3) Sinks
Use-case:
Formation of new working model for the employees of an Enterprise

- Collect and analyze the employees log data
Slide 25
Architecture of Apache Flume Pipeline
1. External event triggers the component. 2. Source starts receiving log data
3. Stores the received data inside the channel. 4. Trigger the sink
5. Access data from the channel. 6. Move it to the target, i.e., HDFS
Slide 26
Data Acquisition – Squoop
There are 2 basic types of data storage systems.
a. Structured data store (Oracle, MySQL, Postgres, etc.)
b. HDFS Data management (Hive for SQL Query like language, Hbase for NoSQL DataBase)
c. Import and export data between structured data store and HDFS
d. Manage periodic transfer of data to HDFS and start analyzing the data
Data Acquisition – HBase

To maintain column oriented NoSQL DataBase (unstructured data)
Slide 27
Data Acquisition – HBase
HBase is on Create, Read, Update, and
Delete (CRUD) operations on wide
sparse tables.
HBase leverages HDFS for its

persistent data storage.
HBase data management is

implemented by distributed region
servers, which are managed by HBase
master (HMaster).
memstore is HBase’s implementation

of in-memory data cache.
HFile is a specialized HDFS file format

for HBase. The implementation of HFile
in a region server is responsible for
reading and writing HFiles to and from
HDFS. Slide 28
Processing Languages - YARN
YARN – Yet Another Resource Negotiator
It is an umbrella framework which accommodates multiple frameworks and utilizes the cluster
resources at the maximum.
a. Developer friendly tool
b. HDFS is a base layer – YARN is on its top layer
c. Receive input data  Group the sibling of data (Map)  Aggregate data (Reduce)
d. MapReduce is the main process of the FrameWork
e. MR Jobs can be implemented in any of the languages like Java, Python, Pig, Hive, etc.
f. YARN is enriched with other processing models, but the base is MR
Slide 29
YARN – Map Reduce Phases
Slide 30
Processing Tool - Oozie
It is used to coordinate the execution of multiple MapReduce jobs.

It is capable of managing a significant amount of complexity.
It can handle the process execution based on external events that include timing and presence of required data.
Slide 31
Processing Languages – HIVE – Batch Processing Tool
a. High-level language
b. Used to provide DB connectivity code inside MR Code in Java or Python like languages
DB Connectivity code directly provided in
MR Code Hive
SQL query Different formation of query
High priority jobs Low priority and long term processing jobs
High performance Low performance
Real-time processing language Batch processing
Slide 32
Processing Languages – SPARK Streaming
a. Useful to receive input from high throughput data sources
 Social networks (Twitter)

 Clickstream logs
 Web access logs, etc.
b. Collect data from variety of sources
c. Streams the data to pass it into the process
d. High performance system
e. Fault-tolerance capability is possible with the support of Apache Kafka
Slide 33
Processing Languages – Apache Kafka
a. Message oriented middleware component

b. It is a distributed publish-subscribe messaging service
c. High-throughput system
d. Persistent messaging system
e. Producer – Broker - Consumer components
f. Producer - Partitioned Topic – Consumer
g. Producer publishes the message – Consumer subscribes the message
h. Broker partitions the message as number of message blocks
i. External data source – Apache Kafka – Spark streaming
j. Zookeeper sub component for cluster management
Slide 34
Processing Languages – Apache Kafka - Continued
Slide 35
Machine Learning Algorithms – Spark MLib
a. Set of APIs
b. Implementation of Machine Learning algorithms

• Basic statistics
• Logistic regression
• K-means clustering
• Gaussian mixtures
• Multinomial Naive Bayes
c. Train your data and build the prediction model – Few lines of code
Slide 36
NoSQL Stores
Characteristics of NoSQL data
a. Large amount of data
b. Scalability
c. Resiliency
d. High availability
Supportive tools
• Hbase - Column oriented data maintenance
• CourchBase
- Document data store
- Rely on RDBMs
- Front-end queries handler
• ElasticSearch
- Real-time data analytics
- Full-text search process
Slide 37
NoSQL Stores - ElasticSearch + Logstash + Kibana
ELK Platform (ElasticSearch + Logstash + Kibana
The three products work together to provide the best end-to-end platform for collecting, storing, and
visualizing data:
• Logstash lets you collect data from many kinds of sources—such as social data, logs, messages
queues, or sensors—it then supports data enrichment and transformation, and finally it transports them to
an indexation system such as ElasticSearch.
• ElasticSearch indexes the data in a distributed, scalable, and resilient system. It’s schemaless and
provides libraries for multiple languages so they can easily and fatly enable real-time search and
analytics in your application.
• Kibana is a customizable user interface in which you can build a simple to complex dashboard to
explore and visualize data indexed by ElasticSearch.
Slide 38
ElasticSearch Products
ElasticSearch as a search engine that holds the

data produced by Spark.
After being processed and aggregated, the data

is indexed into ElasticSearch to enable a third-
party system to query the data through the
ElasticSearch querying engine.
On the other side, we also use ELK for the

processing logs and visualizing analytics with the
support of Kibana.
Slide 39
NoSQL Landscape
NoSQL technologies are schemaless and highly scalable, and couple of them are also highly
distributed and high-performance.
Most of the time, they complete an architecture with an existing RDBMS technology by, for
example, playing the role of cache, search engine, unstructured store, and volatile
information store.
They are divided in four main categories:
1. Key/value data store

2. Column data store
3. Document-oriented data store
4. Graph data store
Slide 40
NoSQL Landscape
Key/Value
The first and easiest NoSQL data stores to understand are key/value data stores.
It acts like a dictionary and work by matching a key to a value.
They are often used for high-performance use cases in which basic information needs to be
stored—for example, when session information may need to be written and retrieved very
quickly.
Slide 41
NoSQL Landscape
Column
Column-oriented data stores are used to store a very large number of

records with a very large amount of information that goes beyond the
simple nature of the key/value store.
Whereas data is stored in rows in RDBMS, it is obviously stored in

columns in column data stores.
The main benefit of using columnar databases is that you can quickly
access a large amount of data.
In columnar databases, all cells that are part of a column are stored
Slide 42
continuously.
NoSQL Landscape
Graph
Graph databases are really different from other types of

database.
They use a different paradigm to represent the data—a tree-

like structure with nodes and edges that are connected to
each other through paths called relations.
Slide 43
Couchbase
Slide 44
ElasticSearch
ElasticSearch is a NoSQL technology that allows you to store, search, and analyze
data.
ElasticSearch was made to be distributed and to scale out.
ElasticSearch is a schemaless engine; data is stored in JSON and is partitioned

into what we call shards.
A shard is actually a Lucene index and is the smallest unit of scale in

ElasticSearch.
Shards are organized in indexes in ElasticSearch with which an application can

make read and write interactions.
In the end, an index is just a logical namespace in ElasticSearch that regroups a

collection of shards, and when a request comes in, ElasticSearch routes it to the
Slide 45
appropriate shards
ElasticSearch
In the end, an index is just a logical
namespace in ElasticSearch that
regroups a collection of shards, and
when a request comes in,
ElasticSearch routes it to the
appropriate shards.
Replica shards are made at start for

failover; when a primary shard dies, a
replica is promoted to become the
primary to ensure continuity in the
cluster.
Replica shards have the same load that

primary shards do at index time; this
means that once the document is
indexed in the primary shard, it’s
indexed in the replica shards.
Slide 46
ElasticSearch
Monitoring ElasticSearch
Elastic provides a plug-in called Marvel for ElasticSearch that aims to monitor an
ElasticSearch cluster. This plug-in is part of Elastic’s commercial offer, but you
can use it for free in Development mode.
Marvel relies on Kibana, the visualization console of Elastic, and comes with a
bunch of visualization techniques that let an operator be really precise about
what happens in the cluster.
Slide 47
Data Analytics with Elasticsearch – Aggregation Framework
Elasticsearch comes with a powerful set of API that let the users get the
best of their data.
The aggregation framework groups and enables real-time analytics

queries on small to large set of data. What we have done with Spark in
terms of simple analytics can be done as well in Elasticsearch using the
aggregation framework at scale.
This framework is divided into two types of aggregations:

• Bucket aggregation that aims to group a set of document based on
key common to documents and criterion. Document that meets the
condition falls in the bucket.
• Metric aggregation that aims to calculate metrics such as average,
maximum, minimum, or even date histograms, over a set of documents.Slide 48
Bucket Aggregation - ElasticSearch
It is used to create groups of documents.
As opposed to metric aggregation, this can be nested so that you can get real-
time multilevel aggregation.
Slide 49
Metric Aggregation - ElasticSearch
Metric aggregations are the last level of set of aggregation and are use to
compute metrics over set of documents.
Some of the computations are considered as single value metric aggregation

because they basically compute a specific operation across document such as
average, min, max or sum, some are multivalue aggregation because they
give more than one statistic metrics for a specific field across documents such
as the stats aggregation.
Slide 50
Slide 51
The results give the computation of differents metrics for the

bytes field:
• The count represents the number of documents the query

has been on, here more than 2.5 million
• The min & max are respectively the minimum and maximum
of bytes
• The avg is the average bytes value for all documents
• The sum is the sum of all bytes field across documents
Slide 52
Clickstream Data
Clickstream data is the key data that you

can get from a website in terms of visitor
activity like,
• Site traffic
• Unique visitor
• Conversion rate
• Pay per click traffic volume
• Time on site
Slide 53
Anatomy of Clickstream Data
Slide 54
Clickstream Data
The clickstream data will give directly or indirectly:

• Dates such as the timestamp when the event occurred, the time spent by
the visitor
• The user agent used to browse the website, the version, and devices
• The session attached to the visitor, which will help to make correlation
between the different lines of data: the session start time, end time,
duration
• The page, what the visitor browsed, what the type of request was sent to
the server, the domain/subdomain
• Information about visitor IP addresses, which indirectly will give
information about delocalization
• Depending on the type of information we get, it can be feasible to identity
and map the customers to the clickstream data Slide 55
The Log Generator
Slide 56
The Log Generator contd.,
Slide 57
The Log Generator contd.,
Slide 58

Unit 1-2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1-2

Uploaded by

Copyright:

Available Formats

What is Big Data ?

According to the American IT research and advisory firm Gartner Inc.,

“Big Data is high-volume, high-velocity and/or high-variety information assets

Big Data is a high-volume, high-

➢ Big Data Storage

➢ Big Data Processing

➢ Big Data is characterized by the magnitude of digital

➢ Analyzing Big Data requires lots of storage and large

➢ Instead of utilizing individual highly configured machine,

➢ Google introduced Google File System (GFS) and

➢ Apache introduced generalized Hadoop Framework with

Hadoop is different from previous distributed approaches in the

➤ Data is distributed in advance.

Distributed processing with fast data access

Reliability, failover, and scalability

Hadoop is classified as an ecosystem comprised of many components that range

Hortonworks and Cloudera are the main actors in this field.

The following is a list of some of these components:

1. Data acquisition or ingestion can start from different sources.

2. Processing the data.

3. Data analytics and visualizing it.

Data acquisition or ingestion can start from different sources.

Collect and aggregate log data

Passive storage system

Design an intuitive programming model and event-driven programming model

3 basic components: 1) Sources 2) Channels 3) Sinks

Formation of new working model for the employees of an Enterprise

a. Structured data store (Oracle, MySQL, Postgres, etc.)

Data Acquisition – HBase

HBase leverages HDFS for its

HBase data management is

memstore is HBase’s implementation

HFile is a specialized HDFS file format

a. Developer friendly tool

b. HDFS is a base layer – YARN is on its top layer

d. MapReduce is the main process of the FrameWork

f. YARN is enriched with other processing models, but the base is MR

It is used to coordinate the execution of multiple MapReduce jobs.

DB Connectivity code directly provided in

SQL query Different formation of query

High performance Low performance

Real-time processing language Batch processing

a. Useful to receive input from high throughput data sources

 Social networks (Twitter)

b. Collect data from variety of sources

c. Streams the data to pass it into the process

d. High performance system

e. Fault-tolerance capability is possible with the support of Apache Kafka

a. Message oriented middleware component

b. Implementation of Machine Learning algorithms

ElasticSearch as a search engine that holds the

After being processed and aggregated, the data

On the other side, we also use ELK for the

They are divided in four main categories:

1. Key/value data store

It acts like a dictionary and work by matching a key to a value.

Column-oriented data stores are used to store a very large number of

Whereas data is stored in rows in RDBMS, it is obviously stored in

Graph databases are really different from other types of

They use a different paradigm to represent the data—a tree-

ElasticSearch was made to be distributed and to scale out.

ElasticSearch is a schemaless engine; data is stored in JSON and is partitioned

A shard is actually a Lucene index and is the smallest unit of scale in