Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

INTRODUCTION TO HADOOP 2020

Introduction to Hadoop
What is hadoop?
 Apache Hadoop is a framework that allows for the distributed processing of
large data sets across cluster of commodity computers using a simple
programming model.

 It is an open source data management with scale-out storage & distributed


processing.
Hadoop Key Characteristics
Economical: -
1. It is open source and freely available.

2. No License require
Reliable: -
1. High availability of data.

2. If data may loss due to node failure, which can be recovered.


1
By Mr. Virendra
INTRODUCTION TO HADOOP 2020

Flexible: -
1. Number of nodes is not fixed, you can add “n” number of nodes into
cluster.
Scalable: -
1. You can process large data sets.

2. Your data may be in Kilobyte (KB),Megabyte (MB),Gigabyte


(GB),Terabyte (TB),Petabyte (PB),Exabyte (EB),Zettabyte (ZB),Yottabyte
(YB)

2
By Mr. Virendra
INTRODUCTION TO HADOOP 2020

Apache Hadoop Ecosystem

3
By Mr. Virendra
INTRODUCTION TO HADOOP 2020

COMPONENT OF HADOOP ECOSYSTEM

4
By Mr. Virendra
INTRODUCTION TO HADOOP 2020

HDFS (Hadoop distributed file system)


 The Hadoop Distributed File System (HDFS) is a distributed file system
designed to run on commodity hardware.

 It has many similarities with existing distributed file systems.


 HDFS is highly fault-tolerant and is designed to be deployed on low-cost
hardware.

 HDFS provides high throughput access to application data and is suitable


for applications that have large data sets.

 In HDFS block size is 64MB which is expendable upto 128MB

 HDFS was originally built as infrastructure for the Apache Nutch web
search engine project.

 HDFS is now an Apache Hadoop subproject.

5
By Mr. Virendra
INTRODUCTION TO HADOOP 2020

Distributed Processing (Map Reduce)


 Hadoop MapReduce is a software framework.

 Use for easily writing applications which process vast amounts of data
(multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes)
of commodity hardware in a reliable, fault-tolerant manner.

 A MapReduce job usually splits the input data-set into independent chunks
which are processed by the map tasks in a completely parallel manner.

 The framework sorts the outputs of the maps, which are then input to the
reduce tasks.

 Typically both the input and the output of the job are stored in a file-system.

6
By Mr. Virendra
INTRODUCTION TO HADOOP 2020

Pig
 Apache Pig is a high level data flow platform for execution Map Reduce
programs of Hadoop.

 The language for Pig is pig Latin.

 The Pig scripts get internally converted to Map Reduce jobs and get
executed on data stored in HDFS.

 Every task which can be achieved using PIG can also be achieved using
java used in Map reduce.

 Ease of programming, Optimization opportunities, Extensibility

7
By Mr. Virendra
INTRODUCTION TO HADOOP 2020

Hive
 Hive is a data warehouse infrastructure tool to process structured data in
Hadoop.
 Initially Hive was developed by Facebook; later the Apache Software
Foundation took it up and developed it further as an open source under the
name Apache Hive.

Hive is not
 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates

Features of Hive

 It stores schema in a database and processed data into HDFS.


 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.

8
By Mr. Virendra
INTRODUCTION TO HADOOP 2020

Hbase
 Hbase is called as hadoop database.

 HBase is a column-oriented database management system that runs on top


of Hadoop Distributed File System (HDFS).

 It is well suited for sparse data sets, which are common in many big data
use cases.

 HBase does not support a structured query language like SQL.

 HBase does support writing applications in Apache™ Avro™, REST, and


Thrift.

9
By Mr. Virendra
INTRODUCTION TO HADOOP 2020

Sqoop
It is (SQL + Hadoop)
 Sqoop is a tool designed to transfer data between Hadoop and relational
database servers.

 It is used to import data from relational databases such as MySQL, Oracle


to Hadoop HDFS, and export from Hadoop file system to relational
databases.

 Sqoop occupies a place in the Hadoop ecosystem to provide feasible


interaction between relational database server and Hadoop’s HDFS.

10
By Mr. Virendra
INTRODUCTION TO HADOOP 2020

Flume (Data streaming)


 Apache Flume is a system used for moving massive quantities of streaming
data into HDFS.

 Collecting log data present in log files from web servers and aggregating it
in HDFS for analysis, is one common example use case of Flume.

Oozie (scheduler system)


 Apache Oozie is a scheduler system to run and manage Hadoop jobs in a
distributed environment.

 It allows combining multiple complex jobs to be run in a sequential order to


achieve a bigger task.

 Within a sequence of task, two or more jobs can also be programmed to run
parallel to each other.

11
By Mr. Virendra
INTRODUCTION TO HADOOP 2020

Zookeeper (reliable cluster co-ordination service)


 The ZooKeeper framework was originally built at “Yahoo!” for accessing
their applications in an easy and robust manner.

 Later, Apache ZooKeeper became a standard for organized service used by


Hadoop, HBase, and other distributed frameworks.

 Apache ZooKeeper is an open-source project which deals with maintaining


configuration information, naming, providing distributed synchronization,
group services for various distributed applications

Ambari (Hadoop clusters manager)

 A completely open source management platform for provisioning,


managing, monitoring and securing Apache Hadoop clusters.
 Ambari enables system administrators to provision, manage and monitor a
Hadoop cluster, and also to integrate Hadoop with the existing enterprise
infrastructure
12
By Mr. Virendra

You might also like