Hadoop Week 1

Course Topics
 Week 1  Week 5
– Introduction to HDFS – HIVE
– Setting Up Hadoop Cluster – HBASE
– Map-Reduce Basics, types and formats – ZOOKEEPER
– PIG – SQOOP
What are we going to cover today?
Part 1
• Understand what is Big Data
• What is Hadoop
• Limitation of Existing EDW solutions
• Hadoop Differentiating Factors & Why Hadoop
• Hadoop Eco-System Components
Part 2
• Introduction to HDFS
• HDFS Anatomy
Take Away from Week 1 Training
• Understanding Big Data, its Challenges and how to resolve it
• Basics of Hadoop
• Basics of HDFS
Part 1
What is Big Data?
 Lots of Data(Terabytes or Petabytes)
 Systems / Enterprises generate huge amount of data from Terabytes to and
even Petabytes of information.
A airline jet collects 10 terabytes of sensor
data for every 30 minutes of flying time
NYSE generates about one terabyte of new

trade data per day to Perform stock trading
analytics to determine trends for optimal
trades
Facebook Example
Facebook users spend 10.5 billion minutes

(almost 20,000 years) online on the social
network
Facebook has an average of 3.2 billion likes and
comments are posted every day.
Twitter Example
 Twitter has over 500 million registered
users.
 The USA, whose 141.8 million accounts
represents 27.4 percent of all Twitter users,
good enough to finish well ahead of Brazil,
Japan, the UK and Indonesia.
 79% of US Twitter users are more like to
recommend brands they follow
 67% of US Twitter users are more likely to
buy from brands they follow
 57% of all companies that use social media
for business use Twitter
Instagram Example
 More than 50 million users over past 2 years

 300 million pictures uploaded to Facebook a day (via
Instagram)
 Instagram gains one new user every second
 One billion photos have been taken with the app
 There are 58 photos uploaded every minute 575 likes
and 81 comments by Instagram users every second
Data volume is growing exponentially
• Estimated Global Data Volume:

– 2011: 1.8 ZB
– 2015: 7.9 ZB
• The world's information doubles every two years
• Over the next 10 years:
– The number of servers worldwide will grow by 10x
– Amount of information managed by enterprise
data centers will grow by 50x
– Number of “files” enterprise data center handle
will grow by 75x
Source: http://www.emc.com/leadership/programs/digital-
universe.htm, which was based on the 2011 IDC Digital
Universe Study
Hidden Treasure
 Insight into data can provide business advantage
 Some key early indicators can mean fortunes to business
 More precise analysis with more data.

Defining Big Data
• IBM’s definition – Big Data Characteristics

– http://www-01.ibm.com/software/data/bigdata/
Volume Velocity
12 Terabytes of Tweets Scrutinizes 5 million trade

created each day events created each day to
identify potential fraud
Variety
Sensor data, audio, video,

click streams, log files and
more
Un-Structured Data is exploding
What is Hadoop
• Is an open source framework for Large Scale Data Processing

• Framework written in Java
– Designed to solve problem that involve analyzing large data
(Petabytes)
– Programming model based on Google’s Map Reduce
– Infrastructure based on Google’s Big Data and Distributed File System
– Handles large files / data throughput and supports data-intensive
distributed applications
– Is scalable by enabling applications to work with thousands of nodes
and Petabytes of data
What is Hadoop? continued
• Runs on a collection of commodity, share- HDFS

nothing servers
NameNode
• Two key components Secondary

NameNode
– HDFS Map
• Hadoop Distributed File System Reduce DataNodes /
TaskTracker
JobTracker :
– MapReduce
• Programming model for processing DataNodes /
and generating large datasets TaskTracker
15
Problems with the Current System
Solution: A Combined Storage Computer Layer
Differentiating Factors
– Accessible: Hadoop runs on large clusters of commodity

machines or cloud computing services such as Amazon EC2
– Robust: Since Hadoop can run on commodity cluster, its
designed with the assumption of frequent hardware failure, it
can gracefully handle such failure and computation don’t stop
because of few failed devices / systems
– Scalable: Hadoop scales linearly to handle large data by adding
more slave nodes to the cluster
– Simple : Its easy to write efficient parallel programming with
Hadoop
Why should I Care
• Fault-tolerant hardware is expensive v/s Hadoop

is designed to run on cheap commodity hardware
• Complicated Data Replication & Failure Systems
v/s Hadoop automatically handles data
replication and node failure
• It does the hard work – you can focus on
processing data
What are current challenges
 Limitation of existing IT infrastructure and resources.
 Vertical scalability is not always the solution: upgrading server and storage.
 RDBMS is not designed to scale out
 Can not handle unstructured data
 A new approach to the problem is required

◦ Process structured and unstructured data
◦ Analyse huge data sets running into several Terabytes or Petabytes
◦ Process and manage data economically.
Hadoop Users
Hadoop Users
Hadoop Users
Hadoop Eco-System
Hadoop Eco-System

Hadoop Week 1

Uploaded by

Copyright:

Available Formats

You might also like

Hadoop Week 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop Week 1

Uploaded by

Copyright:

Available Formats

Course Topics

• Understanding Big Data, its Challenges and how to resolve it

NYSE generates about one terabyte of new

Facebook users spend 10.5 billion minutes

 More than 50 million users over past 2 years

• Estimated Global Data Volume:

 Insight into data can provide business advantage

 Some key early indicators can mean fortunes to business

 More precise analysis with more data.

• IBM’s definition – Big Data Characteristics

12 Terabytes of Tweets Scrutinizes 5 million trade

Sensor data, audio, video,

• Is an open source framework for Large Scale Data Processing

• Runs on a collection of commodity, share- HDFS

• Two key components Secondary

– Accessible: Hadoop runs on large clusters of commodity

• Fault-tolerant hardware is expensive v/s Hadoop

 RDBMS is not designed to scale out

 Can not handle unstructured data

 A new approach to the problem is required

You might also like