Hadoop Week 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Course Topics

 Week 1  Week 5
– Introduction to HDFS – HIVE

 Week 2  Week 6
– Setting Up Hadoop Cluster – HBASE

 Week 3  Week 7
– Map-Reduce Basics, types and formats – ZOOKEEPER

 Week 4  Week 8
– PIG – SQOOP
What are we going to cover today?
Part 1
• Understand what is Big Data
• What is Hadoop
• Limitation of Existing EDW solutions
• Hadoop Differentiating Factors & Why Hadoop
• Hadoop Eco-System Components
Part 2
• Introduction to HDFS
• HDFS Anatomy
Take Away from Week 1 Training

• Understanding Big Data, its Challenges and how to resolve it

• Basics of Hadoop

• Basics of HDFS
Part 1
What is Big Data?
 Lots of Data(Terabytes or Petabytes)
 Systems / Enterprises generate huge amount of data from Terabytes to and
even Petabytes of information.
A airline jet collects 10 terabytes of sensor
data for every 30 minutes of flying time

NYSE generates about one terabyte of new


trade data per day to Perform stock trading
analytics to determine trends for optimal
trades
Facebook Example

Facebook users spend 10.5 billion minutes


(almost 20,000 years) online on the social
network
Facebook has an average of 3.2 billion likes and
comments are posted every day.
Twitter Example
 Twitter has over 500 million registered
users.
 The USA, whose 141.8 million accounts
represents 27.4 percent of all Twitter users,
good enough to finish well ahead of Brazil,
Japan, the UK and Indonesia.
 79% of US Twitter users are more like to
recommend brands they follow
 67% of US Twitter users are more likely to
buy from brands they follow
 57% of all companies that use social media
for business use Twitter
Instagram Example

 More than 50 million users over past 2 years


 300 million pictures uploaded to Facebook a day (via
Instagram)
 Instagram gains one new user every second
 One billion photos have been taken with the app
 There are 58 photos uploaded every minute 575 likes
and 81 comments by Instagram users every second
Data volume is growing exponentially

• Estimated Global Data Volume:


– 2011: 1.8 ZB
– 2015: 7.9 ZB
• The world's information doubles every two years
• Over the next 10 years:
– The number of servers worldwide will grow by 10x
– Amount of information managed by enterprise
data centers will grow by 50x
– Number of “files” enterprise data center handle
will grow by 75x

Source: http://www.emc.com/leadership/programs/digital-
universe.htm, which was based on the 2011 IDC Digital
Universe Study
Hidden Treasure

 Insight into data can provide business advantage

 Some key early indicators can mean fortunes to business

 More precise analysis with more data.


Defining Big Data

• IBM’s definition – Big Data Characteristics


– http://www-01.ibm.com/software/data/bigdata/
Volume Velocity

12 Terabytes of Tweets Scrutinizes 5 million trade


created each day events created each day to
identify potential fraud
Variety

Sensor data, audio, video,


click streams, log files and
more
Un-Structured Data is exploding
What is Hadoop

• Is an open source framework for Large Scale Data Processing


• Framework written in Java
– Designed to solve problem that involve analyzing large data
(Petabytes)
– Programming model based on Google’s Map Reduce
– Infrastructure based on Google’s Big Data and Distributed File System
– Handles large files / data throughput and supports data-intensive
distributed applications
– Is scalable by enabling applications to work with thousands of nodes
and Petabytes of data
What is Hadoop? continued

• Runs on a collection of commodity, share- HDFS


nothing servers
NameNode

• Two key components Secondary


NameNode

– HDFS Map
• Hadoop Distributed File System Reduce DataNodes /
TaskTracker
JobTracker :
– MapReduce
• Programming model for processing DataNodes /
and generating large datasets TaskTracker

15
Problems with the Current System
Solution: A Combined Storage Computer Layer
Differentiating Factors

– Accessible: Hadoop runs on large clusters of commodity


machines or cloud computing services such as Amazon EC2
– Robust: Since Hadoop can run on commodity cluster, its
designed with the assumption of frequent hardware failure, it
can gracefully handle such failure and computation don’t stop
because of few failed devices / systems
– Scalable: Hadoop scales linearly to handle large data by adding
more slave nodes to the cluster
– Simple : Its easy to write efficient parallel programming with
Hadoop
Why should I Care

• Fault-tolerant hardware is expensive v/s Hadoop


is designed to run on cheap commodity hardware
• Complicated Data Replication & Failure Systems
v/s Hadoop automatically handles data
replication and node failure
• It does the hard work – you can focus on
processing data
What are current challenges
 Limitation of existing IT infrastructure and resources.

 Vertical scalability is not always the solution: upgrading server and storage.

 RDBMS is not designed to scale out

 Can not handle unstructured data

 A new approach to the problem is required


◦ Process structured and unstructured data
◦ Analyse huge data sets running into several Terabytes or Petabytes
◦ Process and manage data economically.
Hadoop Users
Hadoop Users
Hadoop Users
Hadoop Eco-System
Hadoop Eco-System

You might also like