Professional Documents
Culture Documents
Unit 5 Da
Unit 5 Da
Research, Indore
www.acropolis.in
Data Analytics
By: Mr. Ronak Jain
Table of Contents
UNIT-V:
BIG DATA TOOLS AND TECHNIQUES: Installing and Running Pig, Comparison with Databases, Pig Latin, User-
Define Functions, Data Processing Operators, Installing and Running Hive, Hive QL, Querying Data, User-Defined
Functions, Oracle Big Data.
Order by clicks
Take top 5
In MapReduce
Users = load ‘users’ as (name, age);
In Pig Latin Filtered = filter Users by age >= 18 and age <=
25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
count(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;
Ease of Translation
Users = load …
Filter by age
Fltrd = filter …
Pages = load …
Join on name
Joined = join …
Group on url Grouped = group …
Summed = … count()…
Count clicks Sorted = order …
Top5 = limit …
Order by clicks
Take top 5
Ease of Translation
Users = load …
Filter by age
Fltrd = filter …
Pages = load …
Join on name
Joined = join …
Job 1
Group on url Grouped = group …
Job 2 Summed = … count()…
Count clicks Sorted = order …
Top5 = limit …
Order by clicks
Job 3
Take top 5
Ease of Translation
Users = load …
Filter by age
Fltrd = filter …
Pages = load …
Join on name
Joined = join …
Job 1
Group on url Grouped = group …
Job 2 Summed = … count()…
Count clicks Sorted = order …
Top5 = limit …
Order by clicks
Job 3
Take top 5
HBase
Modeled on Google’s Bigtable
HBase - What?
Row/column store
Billions of rows/millions on columns
Column-oriented - nulls are free
Untyped - stores byte[]
HBase - Data Model
Column
Column family:
Row Timestamp family
animal:
repairs:
animal:type animal:size repairs:cost
t2 zebra 1000 EUR
enclosure1
t1 lion big
enclosure2 … … … …
HBase - Data Storage Column family animal:
(enclosure1, t2, animal:type) zebra
(enclosure1, t1, animal:size) big
(enclosure1, t1, animal:type) lion
Retrieve a row
RowResult = table.getRow( “enclosure1” );
• Sample output:
Using a Hadoop Streaming Mapper Script
SELECT TRANSFORM(page_views.userid,
page_views.date)
USING 'map_script.py'
AS dt, uid CLUSTER BY dt
FROM page_views;
Storm
Developed by BackType which was acquired by
Storm Twitter
Lots of tools for data (i.e. batch) processing
Hadoop, Pig, HBase, Hive, …
None of them are realtime systems which is
becoming a real requirement for businesses
Storm provides realtime computation
Scalable
Guarantees no data loss
Extremely robust and fault-tolerant
Programming language agnostic
Before Storm
Before Storm – Adding a worker
Deploy
Reconfigure/Redeploy
Scaling is painful
Problems
Poor fault-tolerance
Coding is tedious
Guaranteed data processing
What we want
Horizontal scalability
Fault-tolerance
No intermediate message brokers!
Higher level abstraction than message
passing
“Just works” !!
Storm Cluster
Master node (similar to
Hadoop JobTracker)
Source of streams
Bolts