Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 21

PRESENTED BY-

-NISHA CHOUDHARY
-PRIYA KAMTI
-CHANDRA KANTA SINGHA
CONTENTS
 Big Data
 Hadoop history

 Hadoop

 HDFS

 MapReduce

 YARN

 Why Hadoop

 Hadoop ecosystem

 Hive

 Pig

 Features of Hadoop

 Distributed cache in Hadoop

 Limitations and solutions


BIG DATA
Use-cases
 Facebook

 Twitter

 Youtube

 Digitalmedia
 Healthcare/lifescience

 Finance services

 Law enforcement

 Retail(marketing)
HADOOP HISTORY
 Hadoop was primarily driven by Doug Cutting and
Tom White in 2006.
 Doug Cutting’s kid named Hadoop to one of his toy
that was a yellow elephant.
HADOOP
 It is an open source distributed processing
framework.
 It manages data processing and storage for big
data application.
 It works on clustered system

 Core components of hadoop are :

1) HDFS
2) Map Reduce
3) Yarn
HDFS(HADOOP DISTRIBUTED FILE
SYSTEM)
 Primary Data storage unit in hadoop.
 Used in distributed data processing environment.

 Works in a master-slave topology.

 Has two daemons running for it – Name node and data


node.
MAPREDUCE
 Data processing layer of hadoop.
 Processes data in two phase:

1) Map phase- applies business logic to the


data.
2) Reduce phase- takes as input the output of
map phase.
YARN(YET ANOTHER RESOURCE
LOCATOR)
Components are-
 Resource manager
Runs on master node.
Knows about the location and resources of each slave.
 Node manager
Runs on slave machines.
monitors resource utilization of each container.
 Job submitter
clients submits the job to resource manager.
resource manager contacts with relevant nodes
11 REASONS
HADOOP ECOSYSTEM
HIVE
PIG
FEATURES OF HADOOP
 Open source
 Distributed Processing
 Fault tolerance
 Reliability
 High availability
 Scalability
 Economic
 Easy to use
 Data Locality
DISTRIBUTED CACHE IN HADOOP
 It is a facility provided by the Hadoop
MapReduce framework.
 It can cache read only text files, archives, jar
files etc.
 Benefits:

1) Store complex data

2) Data consistency

3) Single point of failure


LIMITATIONS AND SOLUTIONS

 Issuewith small files


 Slow processing speed

 Support for Batch processing only

 No real time data processing

 No delta iteration

 Latency
 Not easy to use
 Security

 No Abstraction

 Vulnerable by nature

 No caching

 Lengthy line of code

 Uncertainty
THANK YOU…

You might also like