Another Intro to Hadoop

Context Optional
April 2, 2010
By Adeel Ahmad
About Me
Too much data

User-generated, social networks, logging and

Google, Yahoo and others need to index the entire
internet and return search results in milliseconds

NYSE generates 1 TB data/day

Facebook has 400 terabytes of stored data and
ingests 20 terabytes of new data per day. Hosts
approx. 10 billion photos, 1 petabyte (2009)
Can't scale

Challenge to both store and analyze datasets

Slow to process

Unreliable machines (CPUs and disks can do down)

Not affordable (faster, more reliable machines are
Solve it through software

Split up the data

Run jobs in parallel

Sort and combine to get the answer

Schedule across arbitrarily-sized cluster

Handle fault-tolerance

Since even the best systems breakdown, use cheap
commodity computers
Enter Hadoop

Open-source Apache project written in Java

MapReduce implementation for parallelizing

Distributed filesystem for redundant data

Many other sub-projects

Meant for cheap, heterogenous hardware

Scale up by simply adding more cheap hardware

Open-source Apache project

Grew out of Apache Nutch project, an open-source
search engine

Two Google papers

MapReduce (2003): programming model for parallel

Google File System (2003) for fault-tolerant processing
of large amounts of data
 Operates exclusively on <key, value> pairs

Split the input data into independent chunks

Processed by the map tasks in parallel

Sort the outputs of the maps

Send to the reduce tasks

Write to output files

Hadoop Distributed File System

Files split into large blocks

Designed for streaming reads and appending writes,
not random access

3 replicas for each piece of data by default

Data can be encoded/archived formats
Self-managing and self-healing

Bring the computation as physically close to the data
as possible for best bandwidth, instead of copying

Tries to use same node, then same rack, then same
data center

Auto-replication if data lost

Auto-kill and restart of tasks on another node if
taking too long or flaky
Hadoop Streaming

Don't need to write mappers and reducers in Java

Text-based API that exposes stdin and stdout

Use any language

Ruby gems: Wukong, Mandy
Example: Word count
# mapper.rb # reducer.rb
STDIN.each_line do |line| word = nil
word_count = {} count = 0
line.split.each do |word| STDIN.each_line do |line|
word_count[word] ||= 0 wordx, countx = line.strip.split
word_count[word] += 1 if word x!= word
end puts "#{word}\t#{count}" unless word.nil?
word = wordx
word_count.each do |k,v| count = 0
puts "#{k}\t#{v}" end
end count += countx.to_i
end end
puts "#{word}\t#{count}" unless word.nil?
Who Uses Hadoop?

Yahoo 

Facebook 

Netflix 

eHarmony 

LinkedIn 

NY Times 

Digg 
Lots more...
Developing With Hadoop

Don't need a whole cluster to start

– Non-distributed
– Single Java process


Just like full-distributed

Components in separate processes

Full distributed

Now you need a real cluster
How to Run Hadoop

Linux, OSX, Windows, Solaris

Just need Java, SSH access to nodes

XML config files

Download core Hadoop

Can do everything we mentioned

Still needs user to play with config files and
create scripts
How to Run Hadoop

Cloudera Inc. provides their own distributions and
enterprise support and training for Hadoop

Core Hadoop plus patches

Bundled with command-line scripts, Hive, Pig

Publish AMI and scripts for EC2

Best option for your own cluster
How to Run Hadoop

Amazon Elastic MapReduce (EMR)

GUI or command-line cluster management

Supports Streaming, Hive, Pig

Grabs data and MapReduce code from S3 buckets and
puts it into HDFS

Auto-shutdown EC2 instances

Cloudera now has scripts for EMR

Easiest option

High-level scripting language developed by Yahoo

Describes multi-step jobs

Translated into MapReduce tasks

Grunt command-line interface
Ex: Find top 5 most visited pages by users aged 18 to 25
Users = LOAD 'users' AS (name, age);
Filtered = FILTER Users BY age >=18 AND age <= 25;
Pages = LOAD 'pages' AS (user, url);
Joined = JOIN Filtered BY name, Pages BY user;
Grouped = GROUP Joined BY url;
Summed = FOREACH Grouped GENERATE group, COUNT(Joined) AS clicks;
Sorted = ORDER Summed BY clicks DESC

High-level interface created by Facebook

Gives db-like structure to data

HIveQL declarative language for querying

Queries get turned into MapReduce jobs

Command-line interface
CREATE TABLE raw_daily_stats_table (dates STRING, ..., pageviews STRING);
LOAD DATA INPATH 'finaloutput' INTO TABLE raw_daily_stats_table;

Machine-learning libraries for Hadoop
– Collaborative filtering
– Clustering
– Frequent pattern recognition
– Genetic algorithms
 Applications
– Product/friend recommendation
– Classify content into defined groups
– Find associations, patterns, behaviors
– Identify important topics in conversations
More stuff

Hbase – database based on Google's Bigtable

Sqoop – database import tool

Zookeeper – coordination service for distributed
apps to keep track of servers, like a filesystem

Avro – data serialization system

Scribe – logging system developed by Facebook

