Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 58

What is Big Data ?

Big Data is a large data set which comes from many sources and data formats, and data that
can be processed and analyzed to find insights and patterns used to make informed
decisions.

According to the American IT research and advisory firm Gartner Inc.,

“Big Data is high-volume, high-velocity and/or high-variety information assets


that demand cost-effective, innovative forms of information processing
that enable enhanced insight and decision making.

Slide 1
3 Vs of Big Data

Variety
Volume Data
Data Sources
Size ➢ Structured


Terabytes
Records
Data ➢ Unstructured
➢ Semi-structured
➢ Transactions Complexit ➢ All of the above
➢ Table/Files
y

Big Data is a high-volume, high-


velocity, and high-variety
➢ Batch information asset that demands cost-
➢ Near-Time effective, innovative forms of
Velocity information processing for enhanced
➢ Real-Time
➢ Streams
Speed of insight and decision making
Change

2
Big Data Challenges

➢ Big Data Storage

➢ Big Data Processing

➢ Distributed Computing

3
Big Data and the Hadoop Ecosystem

➢ Big Data is characterized by the magnitude of digital


information that can come from many sources and data
formats (structured and unstructured), and
➢ Data that can be processed and analyzed to find insights and
patterns used to make informed decisions.

➢ Analyzing Big Data requires lots of storage and large


computations that demand a great deal of processing power.

4
Big Data and the Hadoop Ecosystem

➢ Instead of utilizing individual highly configured machine,


distributing tasks over multiple commodity machines is
feasible.

➢ Google introduced Google File System (GFS) and


MapReduce, an algorithm and distributed programming
platform for processing large data sets.

➢ Apache introduced generalized Hadoop Framework with


Hadoop Distributed File System (HDFS) and Hadoop’s
implementation of MapReduce. 5
Hadoop Framework

Hadoop is different from previous distributed approaches in the


following ways:

➤ Data is distributed in advance.


➤ Data is replicated throughout a cluster of computers for
reliability and availability.
➤ Data processing tries to occur where the data is stored, thus
eliminating bandwidth bottlenecks.

6
Power of Hadoop
Vast amount of storage
— Hadoop enables applications to work with thousands of computers and petabytes
of data. Over the past decade, computer professionals have realized that low-cost
“commodity” systems can be used together for high-performance computing
applications that once could be handled only by supercomputers.

Distributed processing with fast data access


- Hadoop moves execution toward the data.
- Hadoop applications are typically organized in a way that they process data
sequentially

Reliability, failover, and scalability


—Hadoop frequently monitors the entire cluster, detects failures and retries
execution (by utilizing different nodes). 7
Hadoop Cluster

Slide 8
MASTER MACHINE(S) SLAVE MACHINES
Hadoop Ecosystem

Hadoop is classified as an ecosystem comprised of many components that range


from data storage, to data integration, to data processing, to specialized tools
for data analysts.

Slide 9
YARN

Slide 10
HDFS

Slide 11
HDFS

Slide 12
HDFS Architecture

Slide 13
Hadoop Philosophies
There are 3 basic philosophies on which hadoop works.
a. All the basic software that helps to start the Hadoop cluster are software daemons.
b. The daemons are based on master and slave architecture.
c. 2 broad parts of Hadoop framework - storage (HDFS) and processing (Map Reduce).
1. HDFS (Storage Part)
○ Master Daemon - Namenode (High End Admin Machine) (1 in number)
○ Back-up Daemon - Secondary Namenode (High End Admin Machine) (1 in
number)
○ Slave Daemons - Datanode (Commodity Machines) (Many in number)
2. YARN - Map Reduce (Processing Part)
○ Master Daemon - ResourceManager (High End Admin Machine) (1 in
number)
○ Slave Daemons - NodeManager (Commodity Machines) (Many in number)
Slide 14
Core Components of Hadoop Ecosystem

Slide 15
Core Components of Hadoop Ecosystem

➤ HDFS — A foundational component of the Hadoop ecosystem is the Hadoop Distributed File System
(HDFS). HDFS is the mechanism by which a large amount of data can be distributed over a cluster
of computers, and data is written once, but read many times for analytics. It provides the foundation
for other tools, such as HBase.
➤ MapReduce — Hadoop’s main execution framework is MapReduce, a programming model for
distributed, parallel data processing, breaking jobs into mapping phases and reduce phases (thus
the name). Developers write MapReduce jobs for Hadoop, using data stored in HDFS for fast data
access. Because of the nature of how MapReduce works, Hadoop brings the processing to the
data in a parallel fashion, resulting in fast implementation.
➤ HBase — A column-oriented NoSQL database built on top of HDFS, HBase is used for fast
read/write access to large amounts of data. HBase uses Zookeeper for its management to ensure
that all of its components are up and running.
Slide 16
Core Components of Hadoop Ecosystem

➤ Zookeeper — Zookeeper is Hadoop’s distributed coordination service. Designed to run over a cluster
of machines, it is a highly available service used for the management of Hadoop operations, and
many components of Hadoop depend on it.
➤ Oozie — A scalable workflow system, Oozie is integrated into the Hadoop stack, and is used to
coordinate execution of multiple MapReduce jobs. It is capable of managing a significant amount of
complexity, basing execution on external events that include timing and presence of required data.
➤ Pig — An abstraction over the complexity of MapReduce programming, the Pig platform includes an
execution environment and a scripting language (Pig Latin) used to analyze Hadoop data sets. Its
compiler translates Pig Latin into sequences of MapReduce programs.
➤ Hive — An SQL-like, high-level language used to run queries on data stored in Hadoop, Hive enables
developers not familiar with MapReduce to write data queries that are translated into MapReduce
jobs in Hadoop. Like Pig, Hive was developed as an abstraction layer, but geared more toward
database analysts more familiar with SQL than Java programming Slide 17
Core Components of Hadoop Ecosystem

The Hadoop ecosystem also contains several frameworks for integration with the rest of the enterprise:

➤ Sqoop is a connectivity tool for moving data between relational databases and data warehouses and
Hadoop. Sqoop leverages database to describe the schema for the imported/ exported data and
MapReduce for parallelization operation and fault tolerance.

➤ Flume is a distributed, reliable, and highly available service for efficiently collecting, aggregating, and
moving large amounts of data from individual machines to HDFS. It is based on a simple and
flexible architecture, and provides a streaming of data flows. It leverages a simple extensible data
model, allowing you to move data from multiple machines within an enterprise into Hadoop.

Slide 18
Hadoop Distribution
In a Big Data project that involves Hadoop-related ecosystem technologies, you have two choices:

• Download the project you need separately and try to create or assemble the technologies in a
coherent, resilient, and consistent architecture.

• Use one of the most popular Hadoop distributions, which assemble or create the technologies for you.

Packaged Hadoop distribution ensures capability between all installed components, ease of
installation, configuration-based deployment, monitoring, and support.

Hortonworks and Cloudera are the main actors in this field.

There are a couple of differences between the two vendors, but for starting a Big Data package, they
are equivalent, as long as you don’t pay attention to the proprietary add-ons.

Slide 19
Hadoop Distribution
Cloudera CDH Cloudera adds a set of in-house components to the Hadoop-based components; these
components are designed to give you better cluster management and search experiences.

The following is a list of some of these components:

• Impala: A real-time, parallelized, SQL-based engine that searches for data in HDFS (Hadoop
Distributed File System) and Base. Impala is considered to be the fastest querying engine within
the Hadoop distribution vendors market, and it is a direct competitor of Spark from UC Berkeley.

• Cloudera Manager: This is Cloudera’s console to manage and deploy Hadoop components within your
Hadoop cluster.

• Hue: A console that lets the user interact with the data and run scripts for the different Hadoop
components contained in the cluster.

Slide 20
Hadoop Distribution - Cloudera

Slide 21
Hadoop Distribution - Hortonworks

Hortonworks is 100-percent open source and is used to package stable components rather than the last
version of the Hadoop project in its distribution. Slide 22
Creating the Foundation of a Long-Term Big Data Architecture

Basically, big data applications are also involved with three major tasks.

1. Data acquisition or ingestion can start from different sources.

2. Processing the data.

3. Data analytics and visualizing it.

Now-a-days, apart from these packaged Hadoop distribution system, there are number of Hadoop
ecosystem core components and their supportive tools evolves in the big data analytical field:

Slide 23
Data Acquisition

Data acquisition or ingestion can start from different sources.

It can be large log files, streamed data, ETL processing outcome, online
unstructured data, or offline structure data.

Supportive Tools:

1. Apache Flume
2. Apache Sqoop
3. Apache HBase

Slide 24
Data Acquisition – Apache Flume

Collect and aggregate log data

External source triggers the event, and it starts streaming data flows

Passive storage system

Design an intuitive programming model and event-driven programming model

3 basic components: 1) Sources 2) Channels 3) Sinks

Use-case:

Formation of new working model for the employees of an Enterprise


- Collect and analyze the employees log data

Slide 25
Architecture of Apache Flume Pipeline

1. External event triggers the component. 2. Source starts receiving log data
3. Stores the received data inside the channel. 4. Trigger the sink
5. Access data from the channel. 6. Move it to the target, i.e., HDFS

Slide 26
Data Acquisition – Squoop
There are 2 basic types of data storage systems.

a. Structured data store (Oracle, MySQL, Postgres, etc.)

b. HDFS Data management (Hive for SQL Query like language, Hbase for NoSQL DataBase)

c. Import and export data between structured data store and HDFS

d. Manage periodic transfer of data to HDFS and start analyzing the data

Data Acquisition – HBase


To maintain column oriented NoSQL DataBase (unstructured data)

Slide 27
Data Acquisition – HBase
HBase is on Create, Read, Update, and
Delete (CRUD) operations on wide
sparse tables.

HBase leverages HDFS for its


persistent data storage.

HBase data management is


implemented by distributed region
servers, which are managed by HBase
master (HMaster).

memstore is HBase’s implementation


of in-memory data cache.

HFile is a specialized HDFS file format


for HBase. The implementation of HFile
in a region server is responsible for
reading and writing HFiles to and from
HDFS. Slide 28
Processing Languages - YARN
YARN – Yet Another Resource Negotiator

It is an umbrella framework which accommodates multiple frameworks and utilizes the cluster
resources at the maximum.

a. Developer friendly tool

b. HDFS is a base layer – YARN is on its top layer

c. Receive input data  Group the sibling of data (Map)  Aggregate data (Reduce)

d. MapReduce is the main process of the FrameWork

e. MR Jobs can be implemented in any of the languages like Java, Python, Pig, Hive, etc.

f. YARN is enriched with other processing models, but the base is MR

Slide 29
YARN – Map Reduce Phases

Slide 30
Processing Tool - Oozie

It is used to coordinate the execution of multiple MapReduce jobs.


It is capable of managing a significant amount of complexity.
It can handle the process execution based on external events that include timing and presence of required data.

Slide 31
Processing Languages – HIVE – Batch Processing Tool

a. High-level language

b. Used to provide DB connectivity code inside MR Code in Java or Python like languages

DB Connectivity code directly provided in

MR Code Hive

SQL query Different formation of query

High priority jobs Low priority and long term processing jobs

High performance Low performance

Real-time processing language Batch processing

Slide 32
Processing Languages – SPARK Streaming

a. Useful to receive input from high throughput data sources

 Social networks (Twitter)


 Clickstream logs
 Web access logs, etc.

b. Collect data from variety of sources

c. Streams the data to pass it into the process

d. High performance system

e. Fault-tolerance capability is possible with the support of Apache Kafka

Slide 33
Processing Languages – Apache Kafka

a. Message oriented middleware component


b. It is a distributed publish-subscribe messaging service
c. High-throughput system
d. Persistent messaging system
e. Producer – Broker - Consumer components
f. Producer - Partitioned Topic – Consumer
g. Producer publishes the message – Consumer subscribes the message
h. Broker partitions the message as number of message blocks
i. External data source – Apache Kafka – Spark streaming
j. Zookeeper sub component for cluster management

Slide 34
Processing Languages – Apache Kafka - Continued

Slide 35
Machine Learning Algorithms – Spark MLib
a. Set of APIs

b. Implementation of Machine Learning algorithms


• Basic statistics
• Logistic regression
• K-means clustering
• Gaussian mixtures
• Multinomial Naive Bayes

c. Train your data and build the prediction model – Few lines of code

Slide 36
NoSQL Stores
Characteristics of NoSQL data
a. Large amount of data
b. Scalability
c. Resiliency
d. High availability

Supportive tools
• Hbase - Column oriented data maintenance
• CourchBase
- Document data store
- Rely on RDBMs
- Front-end queries handler
• ElasticSearch
- Real-time data analytics
- Full-text search process

Slide 37
NoSQL Stores - ElasticSearch + Logstash + Kibana
ELK Platform (ElasticSearch + Logstash + Kibana

The three products work together to provide the best end-to-end platform for collecting, storing, and
visualizing data:

• Logstash lets you collect data from many kinds of sources—such as social data, logs, messages
queues, or sensors—it then supports data enrichment and transformation, and finally it transports them to
an indexation system such as ElasticSearch.

• ElasticSearch indexes the data in a distributed, scalable, and resilient system. It’s schemaless and
provides libraries for multiple languages so they can easily and fatly enable real-time search and
analytics in your application.

• Kibana is a customizable user interface in which you can build a simple to complex dashboard to
explore and visualize data indexed by ElasticSearch.

Slide 38
ElasticSearch Products

ElasticSearch as a search engine that holds the


data produced by Spark.

After being processed and aggregated, the data


is indexed into ElasticSearch to enable a third-
party system to query the data through the
ElasticSearch querying engine.

On the other side, we also use ELK for the


processing logs and visualizing analytics with the
support of Kibana.

Slide 39
NoSQL Landscape

NoSQL technologies are schemaless and highly scalable, and couple of them are also highly
distributed and high-performance.

Most of the time, they complete an architecture with an existing RDBMS technology by, for
example, playing the role of cache, search engine, unstructured store, and volatile
information store.

They are divided in four main categories:

1. Key/value data store


2. Column data store
3. Document-oriented data store
4. Graph data store

Slide 40
NoSQL Landscape

Key/Value

The first and easiest NoSQL data stores to understand are key/value data stores.

It acts like a dictionary and work by matching a key to a value.

They are often used for high-performance use cases in which basic information needs to be
stored—for example, when session information may need to be written and retrieved very
quickly.

Slide 41
NoSQL Landscape

Column

Column-oriented data stores are used to store a very large number of


records with a very large amount of information that goes beyond the
simple nature of the key/value store.

Whereas data is stored in rows in RDBMS, it is obviously stored in


columns in column data stores.

The main benefit of using columnar databases is that you can quickly
access a large amount of data.

In columnar databases, all cells that are part of a column are stored
Slide 42
continuously.
NoSQL Landscape

Graph

Graph databases are really different from other types of


database.

They use a different paradigm to represent the data—a tree-


like structure with nodes and edges that are connected to
each other through paths called relations.

Slide 43
Couchbase

Slide 44
ElasticSearch
ElasticSearch is a NoSQL technology that allows you to store, search, and analyze
data.

ElasticSearch was made to be distributed and to scale out.

ElasticSearch is a schemaless engine; data is stored in JSON and is partitioned


into what we call shards.

A shard is actually a Lucene index and is the smallest unit of scale in


ElasticSearch.

Shards are organized in indexes in ElasticSearch with which an application can


make read and write interactions.

In the end, an index is just a logical namespace in ElasticSearch that regroups a


collection of shards, and when a request comes in, ElasticSearch routes it to the
Slide 45
appropriate shards
ElasticSearch
In the end, an index is just a logical
namespace in ElasticSearch that
regroups a collection of shards, and
when a request comes in,
ElasticSearch routes it to the
appropriate shards.

Replica shards are made at start for


failover; when a primary shard dies, a
replica is promoted to become the
primary to ensure continuity in the
cluster.

Replica shards have the same load that


primary shards do at index time; this
means that once the document is
indexed in the primary shard, it’s
indexed in the replica shards.
Slide 46
ElasticSearch

Monitoring ElasticSearch
Elastic provides a plug-in called Marvel for ElasticSearch that aims to monitor an
ElasticSearch cluster. This plug-in is part of Elastic’s commercial offer, but you
can use it for free in Development mode.

Marvel relies on Kibana, the visualization console of Elastic, and comes with a
bunch of visualization techniques that let an operator be really precise about
what happens in the cluster.

Slide 47
Data Analytics with Elasticsearch – Aggregation Framework
Elasticsearch comes with a powerful set of API that let the users get the
best of their data.

The aggregation framework groups and enables real-time analytics


queries on small to large set of data. What we have done with Spark in
terms of simple analytics can be done as well in Elasticsearch using the
aggregation framework at scale.

This framework is divided into two types of aggregations:


• Bucket aggregation that aims to group a set of document based on
key common to documents and criterion. Document that meets the
condition falls in the bucket.
• Metric aggregation that aims to calculate metrics such as average,
maximum, minimum, or even date histograms, over a set of documents.Slide 48
Bucket Aggregation - ElasticSearch
It is used to create groups of documents.

As opposed to metric aggregation, this can be nested so that you can get real-
time multilevel aggregation.

Slide 49
Metric Aggregation - ElasticSearch

Metric aggregations are the last level of set of aggregation and are use to
compute metrics over set of documents.

Some of the computations are considered as single value metric aggregation


because they basically compute a specific operation across document such as
average, min, max or sum, some are multivalue aggregation because they
give more than one statistic metrics for a specific field across documents such
as the stats aggregation.

Slide 50
Metric Aggregation - ElasticSearch

Slide 51
Metric Aggregation - ElasticSearch

The results give the computation of differents metrics for the


bytes field:

• The count represents the number of documents the query


has been on, here more than 2.5 million

• The min & max are respectively the minimum and maximum
of bytes

• The avg is the average bytes value for all documents

• The sum is the sum of all bytes field across documents

Slide 52
Clickstream Data

Clickstream data is the key data that you


can get from a website in terms of visitor
activity like,
• Site traffic
• Unique visitor
• Conversion rate
• Pay per click traffic volume
• Time on site

Slide 53
Anatomy of Clickstream Data

Slide 54
Clickstream Data

The clickstream data will give directly or indirectly:


• Dates such as the timestamp when the event occurred, the time spent by
the visitor
• The user agent used to browse the website, the version, and devices
• The session attached to the visitor, which will help to make correlation
between the different lines of data: the session start time, end time,
duration
• The page, what the visitor browsed, what the type of request was sent to
the server, the domain/subdomain
• Information about visitor IP addresses, which indirectly will give
information about delocalization
• Depending on the type of information we get, it can be feasible to identity
and map the customers to the clickstream data Slide 55
The Log Generator

Slide 56
The Log Generator contd.,

Slide 57
The Log Generator contd.,

Slide 58

You might also like