Hadoop Week 5

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 78

Connect with us

 24x7 Support on Skype, Email & Phone

 Skype ID – edureka.hadoop

 Email – hadoop@edureka.in

 Call us – +91 88808 62004

 Venkat – venkat@edureka.in
Course Topics

 Week 1  Week 5
– Introduction to HDFS – HIVE

 Week 2  Week 6
– Setting Up Hadoop Cluster – HBASE

 Week 3  Week 7
– Map-Reduce Basics, types and formats – ZOOKEEPER

 Week 4  Week 8
– PIG – SQOOP
What Is Pig?

Pig is an open-source high-level dataflow system. It provides a simple language for queries and
data manipulation Pig Latin, that is compiled into map-reduce jobs that are run on Hadoop.

Why Is It Important?
Companies like Yahoo, Google and Microsoft are collecting enormous data sets in the form of click
streams, search logs, and web crawls.
Some form of ad-hoc processing and analysis of all of this information is required.
Why Should I Go For Pig When There Is MR?
What is Pig Latin
• High Level Language that abstracts Hadoop system completely from users

• Data flow language, not procedural language.

• Provides common operations like join, group, sort etc.

• Can use existing user code or libraries for complex, non-regular algorithms

• Operates on files in HDFS

• Developed by Yahoo for their internal use and later contributed to


community and made open source

• They run most of their jobs using Pig.


Motivations

• MapReduce requires a Java Programmer


– Solution was to abstract it and create a system where users are
familiar with scripting languages

• Other than very trivial applications, MapReduce requires


multiple stages, leading to long developmental cycles
– Rapid Prototyping, increased productivity

• In MapReduce users have to reinvent common functionality


(join, filter, etc.)
– Pig provides them.
Where I Should Use Pig?

Pig is a data flow language.


It is at the top of Hadoop and makes it possible to create complex
jobs to process large volumes of data quickly and efficiently.

Case 1 – Time Sensitive Data Loads

Case 2 – Processing Many Data Sources

Case 3 – Analytic Insight Through Sampling


Used For

• Rapid prototyping of algorithms for processing large datasets.


• Log analysis
• Ad hoc queries across various large datasets
• Analytics ( including through sampling)
• Pig Mix provides a set of performance and scalability benchmarks.
Currently 1.1 times MapReduce speed.
How it works
Pig Latin Relational Operators
Pig Is Made Up Of Two Components:
Pig Latin Program
Pig and SQL
PIG and SQL Differences

PIG is procedural. SQL is declarative.


Fields within a PIG tuple can be multi- While fields within an SQL record must be
valued. atomic (contain one single value).

PIG extract the data from its original data SQL needs data to be loaded physically loaded
sources directly during execution. into the DB tables.
Pig Latin vs. SQL
Running Pig

• Pig resides on the user’s machine and can be independent from the
Hadoop Cluster.
• Pig is written in Java and is portable.
– Compiles into MapReduce jobs and submit them to the cluster.
• No need to install anything extra on the cluster.

Pig client
Compilation
Configuration

• Download and un-tar the pig file


– Tar xzf pig-x.y.z.tar.gz

• Configure the PIG Paths


– Export PIG_INSTALL=/<path>/pig-x.y.z
– Export PATH=$PATH:$PIG_INSTALL/bin

• Using pig. Properties file


– fs.default.name=hdfs://localhost/
– Mapred.job.tracker=localhost:8021
Pig - Basic Program Structure

Script:
Pig can run a script file that contains Pig commands.
Example: pig script.pig runs the commands in
the local file script.pig.

Grunt:
Grunt is an interactive shell for running Pig commands.
It is also possible to run Pig scripts from within Grunt
using run and exec (execute).

Embedded:
Embedded can run Pig programs from Java, much like
you can use JDBC to run SQL programs from Java.
Data Types

Scalar Types
– Int Signed 32-bit integer
– Long Signed 64-bit integer
– Float 32-bit floating point
– Double 64-bit floating point
– Chararray Character array(string) in Unicode UTF-8
– Bytearray Byte array (binary object)
Data Types
Complex Types
– Map – associative array
– Tuple – ordered list of data
• ( 1234,Jim Huston, 54 )

– Bag – unordered collection of tuples


• {(1234, Jim Huston, 54), (7634, Harru Slater, 41), (4355, Rod
Stewart, 43, Architect)}
• Tuples in a bag aren’t required to have the same schema or even
have the same number of fields
• Can represent semi-structured or unstructured data.
Load Data

LOAD
– A=LOAD ‘myfile.txt’;
– A=LOAD’myfile.txt’ AS (F1:int, f2:int, f3:int);
– A=LOAD ‘myfile.txt’ USING PigStorage(‘\t’);
– A=LOAD’myfile.txt’ USING PigStorage(‘\t’) AS (f1:int, f2Lint, f3:int);

» 123 DUMP A;
» 421 (1,2,3)
» 834 (4,2,1)
» (8,3,4)
Running Pig

– Pig
– grunt>cust = LOAD ‘retail/cust’ AS (custid: chararray, firstname:chararray,
lastname:chararray, age:long, profession:chararray );
– grunt>lmt = LIMIT cust 10;
– grunt> dump lmt;
– grunt>groupByProfession = GROUP cust BY profession;
– grunt> countByProfession= FOREACH groupByProfession GENERATE group, count
(cust);
– grunt> dump countByProfession;
– grunt?>STORE countByProfession INTO ‘output’;
Statements, Operation and Commands
• A Pig Latin program is a collection of statements.

• A statement is an operation or a command


– Example of an operation : LOAD ‘statement.txt’;
– Example of a command : ls *.txt

• Logical Plan/ Physical plan

• As statement is processed, it is added to logical plan

• When a statement such as “DUMP” relation is reached, logical plan is


compiled to physical plan and executed
Pig Latin - Relational Operators

• Loading and Storing (LOAD, STORE, DUMP)

• Filtering ( FILTER, DISTINCT, FOREACH… GENERATE, STREAM,


SAMPLE)

• Grouping and Joining ( JOIN, COGROUP, GROUP, CROSS)

• Sorting (ORDER, LIMIT)

• Combining and Splitting (UNION, SPLIT)


Pig Latin – Relations and Schemata
• Result of a relational operator is a relation
• A relation is a set of tuples
• Relations can be named using an alias
x= LOAD ‘sample.txt’ AS (id: int, year: int)
DUMP x alias
(1,1987) tuple
• Structure of a relation is a schema
DESCRIBE x
x: ( id: int, year: int) schema
Grouping & Aggregation

• Grouping
– Group alias By { [field_alias [,field_alias] }
– Can be grouped by multiple fields
– The output has key field named group
– Grunt> groupBy Profession = GROUP cust By profession;

• GENERATE controls the output


– Grunt>countByProfession = FOREACH groupByProfession GENERATE group,count
(cust);

• Built in Functions
– AVG, CONCAT, COUNT, DIFF, MAX, MIN, SIZE, SUM, TOKENIZE, IsEmpty
Filtering and Ordering

• Filtering
– Selects tuples based on a boolean expression. Select or remove tuples
– Grunt>teenagers= FILTER cust BY age<20;
– grunt?>topProfessions = FILTER countByProfession BY count>1000;

• Ordering
– Sorting – ascending or descending
– Grunt> sorted = ORDER countByProfession BY count ASC/DESC;
– Can sort by more than one field
FOREACH & DISTINCT

• FOREACH
– Iteration through a list
– Grunt> countByProfession = FOREACH groupByProfession GENERATE
group, count (cust);

• DISTINCT
– Select only distinct tuples
– Temoves duplicate entries
– Grunt> distinctProfession – DISTINCT groupByProfession
SAMPLE

• Creates a sampling of large data set

• Example : 1% of total data set

– A = LOAD ‘data’ AS (f1:int,f2:int,f3:int);

– X = SAMPLE A 0.01;
UDFs
• For logic that cannot be done in Pig

• Can be used to do column transformation, filtering, ordering,


custom aggregation

• For example, you want to write custom logic to do interest


calculation or penalty calculation Example : 1% of total data set

• grunt> interest = FOREACH cust GENERATE custid,


calculateIterest(custAcc);
Types of Pig Latin Diagnostic Operators:

DESCRIBE - Prints a relation‟s schema.


EXPLAIN - Prints the logical and physical plans.
ILLUSTRATE - Shows a sample execution of the logical plan, using a generated subset of the input.

Types of Pig Latin UDF Statements:

REGISTER - Registers a JAR file with the Pig runtime.


DEFINE - Creates an alias for a UDF, streaming script, or a command specification.
Pig Latin – File Loaders

BinStorage - "binary" storage

PigStorage - loads and stores data that is delimited by something

TextLoader - loads data line by line (delimited by the newline character)

CSVLoader - Loads CSV files

XML Loader - Loads XML files


Pig Latin - GROUP Operator

Example of GROUP Operator:


A = load 'student' as (name:chararray, age:int, gpa:float);
dump A;
(joe,18,2.5)
(sam,,3.0)
(bob,,3.5)

X = group A by age;
dump X;
(18,{(joe,18,2.5)})
(,{(sam,,3.0),(bob,,3.5)})
Pig Latin – COGROUP Operator

Example of COGROUP Operator:


A = load 'student' as (name:chararray, age:int,gpa:float); B = load
'student' as (name:chararray, age:int,gpa:float); dump B;
(joe,18,2.5)
(sam,,3.0)
(bob,,3.5)

X = cogroup A by age, B by age; dump X;

(18,{(joe,18,2.5)},{(joe,18,2.5)})
(,{(sam,,3.0),(bob,,3.5)},{})
(,{},{(sam,,3.0),(bob,,3.5)})
JOIN Operator

Example of JOIN Operator:


A = load 'student' as (name:chararray, age:int, gpa:float);
B = load 'student' as (name:chararray, age:int, gpa:float);
dump B;
(joe,18,2.5)
(sam,,3.0)
(bob,,3.5)

X = join A by age, B by age;


dump X;

(joe,18,2.5,joe,18,2.5)
COGROUP vs. JOIN
COGROUP takes advantage of nested data structure (combination of GROUP BY and JOIN).
User can choose to go through with cross-product for a join or perform aggregation on the nested bags.
Parsing
Logic Plan

A=LOAD 'file1' AS (x, y, z);


B=LOAD 'file2' AS (t, u, v);
C=FILTER A by y > 0;
D=JOIN C BY x, B BY u;
E=GROUP D BY z;
F=FOREACH E GENERATE group, COUNT(D);
STORE F INTO 'output„;
Pig Latin - Creating UDF
A Program to create UDF:
public class isEmpty extends FilterFunc {

public Boolean exec(Tuple input) throws IOException {


if (input == null || input.size() == 0) return null;
try {
Object values = input.get(0);
if (values instanceof DataBag)
return ((DataBag)values).size() == 0;
else if (values instanceof Map)
return ((Map)values).size() == 0;
else{
throw new IOException("Cannot test a " + DataType.findTypeName(values) + " for
emptiness.");
}
} catch (ExecException ee) {
throw WrappedIOException.wrap("Caught exception processing input row ", ee);
}
}
}
Pig Latin - Calling A UDF
How to call a UDF?
register myudfs.jar;

A = LOAD 'student' AS (name: chararray, age: int, gpa: float);


B = LOAD 'voter' AS (name: chararray, age: int, registration: chararay, contributions: float);
C = COGROUP A BY name, B BY name;
D = FOREACH C GENERATE group,
flatten((IsEmpty(A) ? null : A)),
flatten((IsEmpty(B) ? null : B));
dump D
Example Of Data Analysis Task
Find users who tend to visit “good” pages:
Conceptual Data Flow
Pig Latin Script

A script for Pig Latin:

Visits = load „/data/visits/visit.txt‟ as (user, url, time);


Visits = foreach Visits generate user, url, time;

Pages = load „/data/pages/page.txt‟ as (url, pagerank);

VP = join Visits by url, Pages by url;


UserVisits = group VP by user;
UserPageranks = foreach UserVisits generate user, AVG(VP.pagerank) as avgpr;
GoodUsers = filter UserPageranks by avgpr > 0.5;

store GoodUsers into '/data/good_users/pig';


HIVE
Hive Background
What is Hive?
• Data Warehousing package built on top of Hadoop
• Used for data analysis
• Targeted towards users comfortable with SQL
• It is similar to SQL and called HiveQL
• For managing and querying structured data
• Abstracts complexity of Hadoop
• No need learn java and Hadoop APIs
• Developed by Facebook and contributed to community
• Facebook analyzed several Terabytes of data everyday using Hive
What is HIVE?
HIVE Architecture
HIVE Architecture (Contd.)

Components Of HIVE Architecture


Hive Components
Metastore
HIVE Service JVM
Type System
Primitive Types (Contd.)
Complex Types
Complex Types can be built up from primitive types and other composite types
using the following three operators:
Abilities of HIVE Query Language

Hive Query Language provides the basic SQL-like operations


Configuring Hive

• Hive automatically stores and manages data for users


– <install path>/hive/warehouse

• Configure path
– HIVE_INSTALL = <hivepath>
– PATH=$PATH:$HIVE_INSTALL/bin

• Metastore options
– Hive comes with Derby, a lightweight and embedded sql
– Can configure any other database as well e.g.MySQL
Partitions and Buckets

• Partition means dividing a table into a coarse grained parts based


on the value of a partition column such as a date. This make it
faster to do queries on slices of the data.
• Buckets give extra structure to the data that may be used for more
efficient queries.
a) A join of two tables that are bucketed on the same columns –
including the join column can be implemented as a Map Side Join.
b) Bucketing by user ID means we can quickly evaluate a user based
query by running it on a randomized sample of the total set of
users.
Schema on Read

• Hive does not verifies the data when it is


loaded, but rather when a query is issued.
• Schema on read makes for a very fast initial
load, since the data does not have to be read,
parsed and serialized to disk in the database’s
internal format. The load operation is just a
file copy or move.
Hive Data Models
• Databases
– Namespaces

• Tables
– Schemas in namespaces

• Partitions
– How data is stored in HDFS
– Grouping data bases on some column
– Can have one or more columns

• Buckets or Clusters
– Partitions divided further into buckets bases on some other column
– Use for data sampling
Creating a table

CREATE TABLE table_name(id INT, name STRING,designation STRING)


COMMENT ‘This is an employee table’
Fields are delimited by tab
PARTITIONED BY (designation STRING) character
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’
STORED AS SEQUENCE FILE/TEXTFILE;

SEQUENCEFILE indicates that this data is stored in a binary


format (Hadoop Sequence Files) on hdfs.

Table is created in warehouse directory and completely managed by hive


External tables
• Create the table in another hdfs location and not in warehouse directory

• Not managed by hive


– CREATE EXTERNAL TABLE external_Table (dummy STRING)
– LOCATION ‘/user/notroot/external_table’;

Need to specify the hdfs


location

• Hive does not delete the table (or hdfs files) even when the tables are dropped. It
leaves the table untouched and only metadata about the tables are deleted
Load Data

• Load the data into the table


• LOAD DATA LOCAL INPATH ’/home/ubuntu/notroot/data/txn.csv’
• OVERWRITE INTO TABLE txnrecords;

• Describing metadata or schema of the table


• Describe txnrecords;
Managing Tables
• Loading Data
– LOAD DATA LOCAL INPATH/<path>/<filename>INTO TABLE
<table_name>
– LOAD DATA LOCAL INPATH /<oath>/<filename>INTO
TABLE<table_name> PARTITION (designation=‘developers
’_ only loads a specific Partition

• Show tables
• Show partitions <table_name>;
• Describe <table_name>;
Create database and table
• Create database
– Create database retail;

• Use database
– Use retail;

• Create table for storing transactional records


– Create table txnrecords(txnnoINT, txndate STRING,custno
INT, amount DOUBLE, category STRING, product STRING, city
STRING, state, String, Spendby String )

– Row format delimited

– Fields terminated by ‘,’ stored as textfile


Queries

• Select
– Select count(*) from txnrecords;

• Aggregation
– Select count (DISTINCT category) from txnrecords;

• Grouping
– Select category, sum( amount ) from txnrecords group by
category
Managing Outputs

• Inserting Output into another table


– INSERT OVERWRITE TABLE results ( SELECT * from
txnrecords)

• Inserting into local file


– INSERT OVERWRITE LOCAL DIRECTORY’tmp/resuts’
(SELECT * from txnrecords)
JOINS

INSERT OVERWRITE TABLE pv_users


SELECT pv.*,u.gender, u.age
FROM user u JOIN page_view pv ON (pv.userid =u.id)
WHERE pv.date = ‘200803-03’;

INSERT OVERWRITE TABLE pv_users


SELECT pv.* u.gender, u.age
FROM user u FULL OUTER JOIN page _view pv ON (pv.userid
= u.id)
WHERE pv.date – ‘2008-03-03’;
Multiple Insert

FROM page_view_stg pvs

INSERT OVERWRITE TABLE page_view PARTITION (dt –


‘2008-06-08’, country=‘US”)
Select *where pvs.country =‘US’

INSERT OVERWRITE TABLE page_view PARTITION


(dt=‘2008-06-08’, country =‘CA’)
Select *WHERE pvs.country = ‘CA’
UNION
INSERT OVERWRITE TABLE actions _users
SELECT u.id, actions.date
FROM(
SELECT av.uid AS uid
FROM action_video av
WHERE av.date =‘2008-06-03’
UNION ALL
SELECT ac.uid AS uid
FROM action_comment ac
WHERE ac.date =‘2008-06-03’
) actions JOIN users u ON(u.id = actions.uid);
SAMPLING
• Allows the users to write queries for samples of the data instead of the whole
table
INSERT OVERWRITE TABLE pv_gender_sum_sample
SELECT pv_gender_sum.*
FROM pv_gender_sum TABLESAMPLE(BUCKET3 OUT OF 32);

• Can only be used on table created using CLUSTERED BY clause


CREATE TABLE table_name(id INT, name STRING, designation STRING,
int age)
COMMENT ‘This is an employee table’
PARTITIONED BY (designation STRING)
ROE FORMAT DELIMITED FIELDS TERMINATED BY ’\t’
CLUSTERED BY (age) INTO 8 BUCKETS
STORED AS SEQUENCEFILE/TEXTFILE;
Managing Tables

• Altering a table
ALTER TABLE old_table_name RENAME TO new_table_name

ALTER TABLE tab1 ADD COLUMNS (c1 INT COMMENT ‘a new


int column’, c2 STRING DEFAULT ‘deL val’);

• Dropping a table or a parittion


DROP TABLE pv_users;
ALTER TABLE pv_users DROP PARTITION (ds=‘2008-0-08’)
Browsing Tables And Partitions
Loading Data
Clarifications

Q & A..?
Thank You
See You in Class Next Week

You might also like