Introduction to BigData Hadoop

INTRODUCTION TO HADOOP 2020
Introduction to Hadoop
What is hadoop?
 Apache Hadoop is a framework that allows for the distributed processing of
large data sets across cluster of commodity computers using a simple
programming model.
 It is an open source data management with scale-out storage & distributed

processing.
Hadoop Key Characteristics
Economical: -
1. It is open source and freely available.
2. No License require
Reliable: -
1. High availability of data.
2. If data may loss due to node failure, which can be recovered.

1
By Mr. Virendra
Flexible: -
1. Number of nodes is not fixed, you can add “n” number of nodes into
cluster.
Scalable: -
1. You can process large data sets.
2. Your data may be in Kilobyte (KB),Megabyte (MB),Gigabyte

(GB),Terabyte (TB),Petabyte (PB),Exabyte (EB),Zettabyte (ZB),Yottabyte
(YB)
2
By Mr. Virendra
Apache Hadoop Ecosystem
3
By Mr. Virendra
COMPONENT OF HADOOP ECOSYSTEM
4
By Mr. Virendra
HDFS (Hadoop distributed file system)

 The Hadoop Distributed File System (HDFS) is a distributed file system
designed to run on commodity hardware.
 It has many similarities with existing distributed file systems.

 HDFS is highly fault-tolerant and is designed to be deployed on low-cost
hardware.
 HDFS provides high throughput access to application data and is suitable

for applications that have large data sets.
 In HDFS block size is 64MB which is expendable upto 128MB
 HDFS was originally built as infrastructure for the Apache Nutch web
search engine project.
 HDFS is now an Apache Hadoop subproject.
5
By Mr. Virendra
Distributed Processing (Map Reduce)

 Hadoop MapReduce is a software framework.
 Use for easily writing applications which process vast amounts of data
(multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes)
of commodity hardware in a reliable, fault-tolerant manner.
 A MapReduce job usually splits the input data-set into independent chunks
which are processed by the map tasks in a completely parallel manner.
 The framework sorts the outputs of the maps, which are then input to the
reduce tasks.
 Typically both the input and the output of the job are stored in a file-system.
6
By Mr. Virendra
Pig
 Apache Pig is a high level data flow platform for execution Map Reduce
programs of Hadoop.
 The language for Pig is pig Latin.
 The Pig scripts get internally converted to Map Reduce jobs and get
executed on data stored in HDFS.
 Every task which can be achieved using PIG can also be achieved using
java used in Map reduce.
 Ease of programming, Optimization opportunities, Extensibility
7
By Mr. Virendra
Hive
 Hive is a data warehouse infrastructure tool to process structured data in
Hadoop.
 Initially Hive was developed by Facebook; later the Apache Software
Foundation took it up and developed it further as an open source under the
name Apache Hive.
Hive is not
 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates
Features of Hive
 It stores schema in a database and processed data into HDFS.

 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.
8
By Mr. Virendra
Hbase
 Hbase is called as hadoop database.
 HBase is a column-oriented database management system that runs on top

of Hadoop Distributed File System (HDFS).
 It is well suited for sparse data sets, which are common in many big data
use cases.
 HBase does not support a structured query language like SQL.
 HBase does support writing applications in Apache™ Avro™, REST, and

Thrift.
9
By Mr. Virendra
Sqoop
It is (SQL + Hadoop)
 Sqoop is a tool designed to transfer data between Hadoop and relational
database servers.
 It is used to import data from relational databases such as MySQL, Oracle

to Hadoop HDFS, and export from Hadoop file system to relational
databases.
 Sqoop occupies a place in the Hadoop ecosystem to provide feasible

interaction between relational database server and Hadoop’s HDFS.
10
By Mr. Virendra
Flume (Data streaming)

 Apache Flume is a system used for moving massive quantities of streaming
data into HDFS.
 Collecting log data present in log files from web servers and aggregating it
in HDFS for analysis, is one common example use case of Flume.
Oozie (scheduler system)

 Apache Oozie is a scheduler system to run and manage Hadoop jobs in a
distributed environment.
 It allows combining multiple complex jobs to be run in a sequential order to

achieve a bigger task.
 Within a sequence of task, two or more jobs can also be programmed to run
parallel to each other.
11
By Mr. Virendra
Zookeeper (reliable cluster co-ordination service)

 The ZooKeeper framework was originally built at “Yahoo!” for accessing
their applications in an easy and robust manner.
 Later, Apache ZooKeeper became a standard for organized service used by

Hadoop, HBase, and other distributed frameworks.
 Apache ZooKeeper is an open-source project which deals with maintaining

configuration information, naming, providing distributed synchronization,
group services for various distributed applications
Ambari (Hadoop clusters manager)
 A completely open source management platform for provisioning,

managing, monitoring and securing Apache Hadoop clusters.
 Ambari enables system administrators to provision, manage and monitor a
Hadoop cluster, and also to integrate Hadoop with the existing enterprise
infrastructure
12
By Mr. Virendra

Introduction to BigData Hadoop

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction to BigData Hadoop

Uploaded by

Copyright:

Available Formats

INTRODUCTION TO HADOOP 2020

 It is an open source data management with scale-out storage & distributed

2. If data may loss due to node failure, which can be recovered.

2. Your data may be in Kilobyte (KB),Megabyte (MB),Gigabyte

Apache Hadoop Ecosystem

COMPONENT OF HADOOP ECOSYSTEM

HDFS (Hadoop distributed file system)

 It has many similarities with existing distributed file systems.

 HDFS provides high throughput access to application data and is suitable

 In HDFS block size is 64MB which is expendable upto 128MB

 HDFS is now an Apache Hadoop subproject.

Distributed Processing (Map Reduce)

 The language for Pig is pig Latin.

 Ease of programming, Optimization opportunities, Extensibility

 It stores schema in a database and processed data into HDFS.

 HBase is a column-oriented database management system that runs on top

 HBase does not support a structured query language like SQL.

 HBase does support writing applications in Apache™ Avro™, REST, and

 It is used to import data from relational databases such as MySQL, Oracle

 Sqoop occupies a place in the Hadoop ecosystem to provide feasible

Flume (Data streaming)

Oozie (scheduler system)

 It allows combining multiple complex jobs to be run in a sequential order to

Zookeeper (reliable cluster co-ordination service)

 Later, Apache ZooKeeper became a standard for organized service used by

 Apache ZooKeeper is an open-source project which deals with maintaining

Ambari (Hadoop clusters manager)

 A completely open source management platform for provisioning,

You might also like