Professional Documents
Culture Documents
Unit 1-2
Unit 1-2
Big Data is a large data set which comes from many sources and data formats, and data that
can be processed and analyzed to find insights and patterns used to make informed
decisions.
Slide 1
3 Vs of Big Data
Variety
Volume Data
Data Sources
Size ➢ Structured
➢
➢
Terabytes
Records
Data ➢ Unstructured
➢ Semi-structured
➢ Transactions Complexit ➢ All of the above
➢ Table/Files
y
2
Big Data Challenges
➢ Distributed Computing
3
Big Data and the Hadoop Ecosystem
4
Big Data and the Hadoop Ecosystem
6
Power of Hadoop
Vast amount of storage
— Hadoop enables applications to work with thousands of computers and petabytes
of data. Over the past decade, computer professionals have realized that low-cost
“commodity” systems can be used together for high-performance computing
applications that once could be handled only by supercomputers.
Slide 8
MASTER MACHINE(S) SLAVE MACHINES
Hadoop Ecosystem
Slide 9
YARN
Slide 10
HDFS
Slide 11
HDFS
Slide 12
HDFS Architecture
Slide 13
Hadoop Philosophies
There are 3 basic philosophies on which hadoop works.
a. All the basic software that helps to start the Hadoop cluster are software daemons.
b. The daemons are based on master and slave architecture.
c. 2 broad parts of Hadoop framework - storage (HDFS) and processing (Map Reduce).
1. HDFS (Storage Part)
○ Master Daemon - Namenode (High End Admin Machine) (1 in number)
○ Back-up Daemon - Secondary Namenode (High End Admin Machine) (1 in
number)
○ Slave Daemons - Datanode (Commodity Machines) (Many in number)
2. YARN - Map Reduce (Processing Part)
○ Master Daemon - ResourceManager (High End Admin Machine) (1 in
number)
○ Slave Daemons - NodeManager (Commodity Machines) (Many in number)
Slide 14
Core Components of Hadoop Ecosystem
Slide 15
Core Components of Hadoop Ecosystem
➤ HDFS — A foundational component of the Hadoop ecosystem is the Hadoop Distributed File System
(HDFS). HDFS is the mechanism by which a large amount of data can be distributed over a cluster
of computers, and data is written once, but read many times for analytics. It provides the foundation
for other tools, such as HBase.
➤ MapReduce — Hadoop’s main execution framework is MapReduce, a programming model for
distributed, parallel data processing, breaking jobs into mapping phases and reduce phases (thus
the name). Developers write MapReduce jobs for Hadoop, using data stored in HDFS for fast data
access. Because of the nature of how MapReduce works, Hadoop brings the processing to the
data in a parallel fashion, resulting in fast implementation.
➤ HBase — A column-oriented NoSQL database built on top of HDFS, HBase is used for fast
read/write access to large amounts of data. HBase uses Zookeeper for its management to ensure
that all of its components are up and running.
Slide 16
Core Components of Hadoop Ecosystem
➤ Zookeeper — Zookeeper is Hadoop’s distributed coordination service. Designed to run over a cluster
of machines, it is a highly available service used for the management of Hadoop operations, and
many components of Hadoop depend on it.
➤ Oozie — A scalable workflow system, Oozie is integrated into the Hadoop stack, and is used to
coordinate execution of multiple MapReduce jobs. It is capable of managing a significant amount of
complexity, basing execution on external events that include timing and presence of required data.
➤ Pig — An abstraction over the complexity of MapReduce programming, the Pig platform includes an
execution environment and a scripting language (Pig Latin) used to analyze Hadoop data sets. Its
compiler translates Pig Latin into sequences of MapReduce programs.
➤ Hive — An SQL-like, high-level language used to run queries on data stored in Hadoop, Hive enables
developers not familiar with MapReduce to write data queries that are translated into MapReduce
jobs in Hadoop. Like Pig, Hive was developed as an abstraction layer, but geared more toward
database analysts more familiar with SQL than Java programming Slide 17
Core Components of Hadoop Ecosystem
The Hadoop ecosystem also contains several frameworks for integration with the rest of the enterprise:
➤ Sqoop is a connectivity tool for moving data between relational databases and data warehouses and
Hadoop. Sqoop leverages database to describe the schema for the imported/ exported data and
MapReduce for parallelization operation and fault tolerance.
➤ Flume is a distributed, reliable, and highly available service for efficiently collecting, aggregating, and
moving large amounts of data from individual machines to HDFS. It is based on a simple and
flexible architecture, and provides a streaming of data flows. It leverages a simple extensible data
model, allowing you to move data from multiple machines within an enterprise into Hadoop.
Slide 18
Hadoop Distribution
In a Big Data project that involves Hadoop-related ecosystem technologies, you have two choices:
• Download the project you need separately and try to create or assemble the technologies in a
coherent, resilient, and consistent architecture.
• Use one of the most popular Hadoop distributions, which assemble or create the technologies for you.
Packaged Hadoop distribution ensures capability between all installed components, ease of
installation, configuration-based deployment, monitoring, and support.
There are a couple of differences between the two vendors, but for starting a Big Data package, they
are equivalent, as long as you don’t pay attention to the proprietary add-ons.
Slide 19
Hadoop Distribution
Cloudera CDH Cloudera adds a set of in-house components to the Hadoop-based components; these
components are designed to give you better cluster management and search experiences.
• Impala: A real-time, parallelized, SQL-based engine that searches for data in HDFS (Hadoop
Distributed File System) and Base. Impala is considered to be the fastest querying engine within
the Hadoop distribution vendors market, and it is a direct competitor of Spark from UC Berkeley.
• Cloudera Manager: This is Cloudera’s console to manage and deploy Hadoop components within your
Hadoop cluster.
• Hue: A console that lets the user interact with the data and run scripts for the different Hadoop
components contained in the cluster.
Slide 20
Hadoop Distribution - Cloudera
Slide 21
Hadoop Distribution - Hortonworks
Hortonworks is 100-percent open source and is used to package stable components rather than the last
version of the Hadoop project in its distribution. Slide 22
Creating the Foundation of a Long-Term Big Data Architecture
Basically, big data applications are also involved with three major tasks.
Now-a-days, apart from these packaged Hadoop distribution system, there are number of Hadoop
ecosystem core components and their supportive tools evolves in the big data analytical field:
Slide 23
Data Acquisition
It can be large log files, streamed data, ETL processing outcome, online
unstructured data, or offline structure data.
Supportive Tools:
1. Apache Flume
2. Apache Sqoop
3. Apache HBase
Slide 24
Data Acquisition – Apache Flume
External source triggers the event, and it starts streaming data flows
Use-case:
Slide 25
Architecture of Apache Flume Pipeline
1. External event triggers the component. 2. Source starts receiving log data
3. Stores the received data inside the channel. 4. Trigger the sink
5. Access data from the channel. 6. Move it to the target, i.e., HDFS
Slide 26
Data Acquisition – Squoop
There are 2 basic types of data storage systems.
b. HDFS Data management (Hive for SQL Query like language, Hbase for NoSQL DataBase)
c. Import and export data between structured data store and HDFS
d. Manage periodic transfer of data to HDFS and start analyzing the data
Slide 27
Data Acquisition – HBase
HBase is on Create, Read, Update, and
Delete (CRUD) operations on wide
sparse tables.
It is an umbrella framework which accommodates multiple frameworks and utilizes the cluster
resources at the maximum.
c. Receive input data Group the sibling of data (Map) Aggregate data (Reduce)
e. MR Jobs can be implemented in any of the languages like Java, Python, Pig, Hive, etc.
Slide 29
YARN – Map Reduce Phases
Slide 30
Processing Tool - Oozie
Slide 31
Processing Languages – HIVE – Batch Processing Tool
a. High-level language
b. Used to provide DB connectivity code inside MR Code in Java or Python like languages
MR Code Hive
High priority jobs Low priority and long term processing jobs
Slide 32
Processing Languages – SPARK Streaming
Slide 33
Processing Languages – Apache Kafka
Slide 34
Processing Languages – Apache Kafka - Continued
Slide 35
Machine Learning Algorithms – Spark MLib
a. Set of APIs
c. Train your data and build the prediction model – Few lines of code
Slide 36
NoSQL Stores
Characteristics of NoSQL data
a. Large amount of data
b. Scalability
c. Resiliency
d. High availability
Supportive tools
• Hbase - Column oriented data maintenance
• CourchBase
- Document data store
- Rely on RDBMs
- Front-end queries handler
• ElasticSearch
- Real-time data analytics
- Full-text search process
Slide 37
NoSQL Stores - ElasticSearch + Logstash + Kibana
ELK Platform (ElasticSearch + Logstash + Kibana
The three products work together to provide the best end-to-end platform for collecting, storing, and
visualizing data:
• Logstash lets you collect data from many kinds of sources—such as social data, logs, messages
queues, or sensors—it then supports data enrichment and transformation, and finally it transports them to
an indexation system such as ElasticSearch.
• ElasticSearch indexes the data in a distributed, scalable, and resilient system. It’s schemaless and
provides libraries for multiple languages so they can easily and fatly enable real-time search and
analytics in your application.
• Kibana is a customizable user interface in which you can build a simple to complex dashboard to
explore and visualize data indexed by ElasticSearch.
Slide 38
ElasticSearch Products
Slide 39
NoSQL Landscape
NoSQL technologies are schemaless and highly scalable, and couple of them are also highly
distributed and high-performance.
Most of the time, they complete an architecture with an existing RDBMS technology by, for
example, playing the role of cache, search engine, unstructured store, and volatile
information store.
Slide 40
NoSQL Landscape
Key/Value
The first and easiest NoSQL data stores to understand are key/value data stores.
They are often used for high-performance use cases in which basic information needs to be
stored—for example, when session information may need to be written and retrieved very
quickly.
Slide 41
NoSQL Landscape
Column
The main benefit of using columnar databases is that you can quickly
access a large amount of data.
In columnar databases, all cells that are part of a column are stored
Slide 42
continuously.
NoSQL Landscape
Graph
Slide 43
Couchbase
Slide 44
ElasticSearch
ElasticSearch is a NoSQL technology that allows you to store, search, and analyze
data.
Monitoring ElasticSearch
Elastic provides a plug-in called Marvel for ElasticSearch that aims to monitor an
ElasticSearch cluster. This plug-in is part of Elastic’s commercial offer, but you
can use it for free in Development mode.
Marvel relies on Kibana, the visualization console of Elastic, and comes with a
bunch of visualization techniques that let an operator be really precise about
what happens in the cluster.
Slide 47
Data Analytics with Elasticsearch – Aggregation Framework
Elasticsearch comes with a powerful set of API that let the users get the
best of their data.
As opposed to metric aggregation, this can be nested so that you can get real-
time multilevel aggregation.
Slide 49
Metric Aggregation - ElasticSearch
Metric aggregations are the last level of set of aggregation and are use to
compute metrics over set of documents.
Slide 50
Metric Aggregation - ElasticSearch
Slide 51
Metric Aggregation - ElasticSearch
• The min & max are respectively the minimum and maximum
of bytes
Slide 52
Clickstream Data
Slide 53
Anatomy of Clickstream Data
Slide 54
Clickstream Data
Slide 56
The Log Generator contd.,
Slide 57
The Log Generator contd.,
Slide 58