Professional Documents
Culture Documents
Bigdata-7 Ahs Merged
Bigdata-7 Ahs Merged
Agenda
• Hadoop Ecosystem
• Hadoop Distributed File System
• Concepts of Blocks in HDFS Architecture
• Role of Namenodes and Datanodes
Presented by : A.H.Shanthakumara
Big Data
Hadoop Ecosystem:
➢ Hadoop plays an integral part in almost all big data processes.
➢ It is almost impossible to use big data without the tools and
techniques of hadoop.
➢ Hadoop ecosystem is a framework of various types of complex
and evolving tools and components such as HDFS and its
architecture, MapReduce, YARN, HBase and HIVE.
➢ May very different from each other in terms of their architecture
but they all derive their functionality is from the scalability and
power of hadoop
➢ Enable user to process large data sets in real-time and provide
tools to support various types of Hadoop projects.
Presented by : A.H.Shanthakumara
Big Data
Hadoop Ecosystem:
Presented by : A.H.Shanthakumara
Big Data
Presented by : A.H.Shanthakumara
Big Data
Presented by : A.H.Shanthakumara
Big Data
Hadoop Architecture:
➢ Master slave architecture, the Namenode is the master that
manages the various Datanodes
➢ Namenode managers HDFS cluster meta-data, datanodes
store the data
➢ Records and directories are presented by clients to the
Namenode, operations on records and directories are
performed by the Namenode
➢ A file is divided into one or more blocks which are stored in a
group of Datanodes, where Datanodes read and write request
from the client
➢ Datanodes can execute operations like creation, deletion and
replication of blocks depending on the instructions from the
Namenode Presented by : A.H.Shanthakumara
Big Data
Hadoop Architecture:
Presented by : A.H.Shanthakumara
Big Data
Presented by : A.H.Shanthakumara
Big Data
Presented by : A.H.Shanthakumara
Big Data
Presented by : A.H.Shanthakumara
Big Data
Presented by : A.H.Shanthakumara
Big Data
Presented by : A.H.Shanthakumara
Big Data
Agenda
• Continuous of Hadoop Ecosystem
– HDFS High availability
– Features of HDFS
– Other tools in Hadoop Ecosystem
• MapReduce
• YARN
• HIVE
• PIG
Presented by : A.H.Shanthakumara
Big Data
Presented by : A.H.Shanthakumara
Big Data
Features of HDFS:
➢ Three key features: Data replication, Data Resilience and Data
integrity
➢ A file is divided into blocks and replicated blocks are distributed
through the different Datanodes of a cluster
➢ Automatically provide Resilience to data in case of an
unexpected loss or damage
➢ Ensure Data integrity throughout the cluster with the help of:
1. Maintaining transaction logs- helps to monitor every
operation and carry out effective auditing and recovery
Presented by : A.H.Shanthakumara
Big Data
Features of HDFS:
2. Validating checksum- an effective error detection technique
➢ The message receiver verify the checksum of the message to
ensuring that it is the same as in the sent message
➢ Checksum is hidden to avoid tempering
3. Creating data blocks- maintains replicated copies of data
blocks to avoid corruption of a file due to failure of a server
➢ Data blocks are sometimes called block servers and perform
the following functions
1. Storage and retrieval of data on a local file system
2. Storage of meta data of a block on a local file system
3. Conduct of periodic validation for file checksums
Presented by : A.H.Shanthakumara
Big Data
Features of HDFS:
4. Reporting Namenode on regular basis about availability of
blocks
5. On demand supply of meta data and data
6. Movement of data to connected nodes on the basis of the
pipelining model
➢ A connection between multiple Datanodes that supports
movement of data across servers is termed as a pipeline
Presented by : A.H.Shanthakumara
Big Data
MapReduce:
➢ take data input, process it, generate the output and returns the
required answers
➢ Based on parallel programming framework to process large
amount of data dispersed across different systems
➢ Facilitates the processing and analyzing of both unstructured
and semi structured data collected from different sources
➢ Primarily support two operations: map and reduce
➢ Work on master-worker approach in which the master process
control and directs the entire activity
➢ Master collects, segregates and delegate the data among
different workers
Presented by : A.H.Shanthakumara
Big Data
MapReduce:
➢ Can be summed up in the following steps:
1. Worker receives data from the master, processes it and
sends back the generated result to the master
2. Workers run the same code on the received data; however
they are not aware of other co-workers and do not
communicate or interact with them
3. Master receives the result from each worker process,
integrate and processes them and generate the final output
Presented by : A.H.Shanthakumara
Big Data
MapReduce:
Presented by : A.H.Shanthakumara
Big Data
Presented by : A.H.Shanthakumara
Big Data
Agenda
• Continuous of Hadoop Ecosystem
– Introducing HBase
– HBase Architecture
– HBase and HDFS
– HBase and RDBMS
– HBase Read and Write
Presented by : A.H.Shanthakumara
Big Data
Introducing HBase:
➢ A distributed, column-oriented database built on top of the
Hadoop file system
➢ Considered as a web table, a table of web pages accessed by
web page URL
➢ Web table is large, containing over a billion rows parsing and
batch analytics are MapReduce jobs that continuously run
against the web table
➢ It is not relational, but it still has the capacity to do what and an
RDBMS cannot
➢ Host large, inadequately populated tables on cluster produced
using appliance hardware
Presented by : A.H.Shanthakumara
Big Data
Introducing HBase:
➢ Stores data into tables with rows and columns, intersection is
called a cell
➢ Each cell in a HBase table has an associated attribute termed
as "version" which provides timestamp to uniquely identify the
cell
➢ Facilitates reading/writing of big data randomly and efficiently in
real time
➢ Versioning helps in keeping track and allow previous version of
the cell contents
➢ Provide various useful data processing features
➢ Support for distributed environment and multidimensional maps
➢ Allows storage of results for later analytical processing
Presented by : A.H.Shanthakumara
Big Data
HBase Architecture:
➢ HBase architecture consists mainly of four components
➢ HMaster, HRegionserver, HRegions and Zookeeper
Presented by : A.H.Shanthakumara
Big Data
HBase Architecture:
HMaster :
➢ It is the implementation of a Master server in HBase architecture.
➢ It acts as a monitoring agent to monitor all Region Server instances
of the cluster and acts as an interface for all the metadata changes.
➢ The following are important roles performed by HMaster in HBase.
1. Plays a vital role in terms of performance and maintaining nodes in
the cluster.
2. HMaster provides admin performance and distributes services to
different region servers.
3. HMaster assigns regions to region servers.
4. HMaster has the features like controlling load balancing and
handle the load over nodes present in the cluster.
5. HMaster takes responsibility, when a client wants to change any
schema and to change any Metadata operations.
Presented by : A.H.Shanthakumara
Big Data
HBase Architecture:
HRegions:
➢ Tables are automatically partitioned horizontally into regions by
HBase
➢ Regions are units that get spread over a cluster in HBase
➢ Each region consists of a subset of rows of a table
➢ It contains multiple stores, one for each column family.
➢ It consists of mainly two components, which are Memstore and
Hfile
Presented by : A.H.Shanthakumara
Big Data
HBase Architecture:
HRegionserver:
➢ When Region Server receives writes and read requests from the
client, it assigns the request to a specific region, where the actual
column family resides.
➢ It is responsible for serving and managing regions or data that is
present in a distributed cluster.
➢ The region servers run on Data Nodes present in the Hadoop cluster.
➢ HMaster can get into contact with multiple HRegion servers and
performs the following functions.
✓ Hosting and managing regions
✓ Splitting regions automatically
✓ Handling read and writes requests
✓ Communicating with the client directly
Presented by : A.H.Shanthakumara
Big Data
HBase Architecture:
Zookeeper :
➢ It is a centralized monitoring server which maintains configuration
information and provides distributed synchronization.
➢ If the client wants to communicate with regions, the server's client
has to approach ZooKeeper first.
➢ Services provided by ZooKeeper:
✓ Maintains Configuration information
✓ Provides distributed synchronization
✓ Client Communication establishment with region servers
✓ Provides ephemeral nodes for which represent different
region servers
✓ Master servers usability of ephemeral nodes for discovering
available servers in the cluster
✓ To track server failure and network partitions
Presented by : A.H.Shanthakumara
Big Data
Presented by : A.H.Shanthakumara
Big Data
Presented by : A.H.Shanthakumara
Big Data
Presented by : A.H.Shanthakumara
Big Data
Features of HBase:
1. Consistency: not strictly an acid implementation, supports
consistent read and write operations
2. Sharding: allows distribution of data using an underlying file
system
3. High availability: implementation of region server ensure
recoverability
4. Client API: supports programmatic access using Java API's
5. Support for IT operations: provide a set of built-in web pages to
view detailed operational insights
Presented by : A.H.Shanthakumara
Big Data
Agenda
• Continuous of Hadoop Ecosystem
– Read and Write Operation In HDFS
– Access Using COMMAND-LINE INTERFACE
– Access Using JAVA API
Presented by : A.H.Shanthakumara
Big Data
Presented by : A.H.Shanthakumara
Big Data
Presented by : A.H.Shanthakumara
Big Data
filesystem to HDFS
from HDFS
Presented by : A.H.Shanthakumara
Big Data
Presented by : A.H.Shanthakumara