Professional Documents
Culture Documents
Hadoop Week 5
Hadoop Week 5
Hadoop Week 5
Skype ID – edureka.hadoop
Email – hadoop@edureka.in
Venkat – venkat@edureka.in
Course Topics
Week 1 Week 5
– Introduction to HDFS – HIVE
Week 2 Week 6
– Setting Up Hadoop Cluster – HBASE
Week 3 Week 7
– Map-Reduce Basics, types and formats – ZOOKEEPER
Week 4 Week 8
– PIG – SQOOP
What Is Pig?
Pig is an open-source high-level dataflow system. It provides a simple language for queries and
data manipulation Pig Latin, that is compiled into map-reduce jobs that are run on Hadoop.
Why Is It Important?
Companies like Yahoo, Google and Microsoft are collecting enormous data sets in the form of click
streams, search logs, and web crawls.
Some form of ad-hoc processing and analysis of all of this information is required.
Why Should I Go For Pig When There Is MR?
What is Pig Latin
• High Level Language that abstracts Hadoop system completely from users
• Can use existing user code or libraries for complex, non-regular algorithms
PIG extract the data from its original data SQL needs data to be loaded physically loaded
sources directly during execution. into the DB tables.
Pig Latin vs. SQL
Running Pig
• Pig resides on the user’s machine and can be independent from the
Hadoop Cluster.
• Pig is written in Java and is portable.
– Compiles into MapReduce jobs and submit them to the cluster.
• No need to install anything extra on the cluster.
Pig client
Compilation
Configuration
Script:
Pig can run a script file that contains Pig commands.
Example: pig script.pig runs the commands in
the local file script.pig.
Grunt:
Grunt is an interactive shell for running Pig commands.
It is also possible to run Pig scripts from within Grunt
using run and exec (execute).
Embedded:
Embedded can run Pig programs from Java, much like
you can use JDBC to run SQL programs from Java.
Data Types
Scalar Types
– Int Signed 32-bit integer
– Long Signed 64-bit integer
– Float 32-bit floating point
– Double 64-bit floating point
– Chararray Character array(string) in Unicode UTF-8
– Bytearray Byte array (binary object)
Data Types
Complex Types
– Map – associative array
– Tuple – ordered list of data
• ( 1234,Jim Huston, 54 )
LOAD
– A=LOAD ‘myfile.txt’;
– A=LOAD’myfile.txt’ AS (F1:int, f2:int, f3:int);
– A=LOAD ‘myfile.txt’ USING PigStorage(‘\t’);
– A=LOAD’myfile.txt’ USING PigStorage(‘\t’) AS (f1:int, f2Lint, f3:int);
» 123 DUMP A;
» 421 (1,2,3)
» 834 (4,2,1)
» (8,3,4)
Running Pig
– Pig
– grunt>cust = LOAD ‘retail/cust’ AS (custid: chararray, firstname:chararray,
lastname:chararray, age:long, profession:chararray );
– grunt>lmt = LIMIT cust 10;
– grunt> dump lmt;
– grunt>groupByProfession = GROUP cust BY profession;
– grunt> countByProfession= FOREACH groupByProfession GENERATE group, count
(cust);
– grunt> dump countByProfession;
– grunt?>STORE countByProfession INTO ‘output’;
Statements, Operation and Commands
• A Pig Latin program is a collection of statements.
• Grouping
– Group alias By { [field_alias [,field_alias] }
– Can be grouped by multiple fields
– The output has key field named group
– Grunt> groupBy Profession = GROUP cust By profession;
• Built in Functions
– AVG, CONCAT, COUNT, DIFF, MAX, MIN, SIZE, SUM, TOKENIZE, IsEmpty
Filtering and Ordering
• Filtering
– Selects tuples based on a boolean expression. Select or remove tuples
– Grunt>teenagers= FILTER cust BY age<20;
– grunt?>topProfessions = FILTER countByProfession BY count>1000;
• Ordering
– Sorting – ascending or descending
– Grunt> sorted = ORDER countByProfession BY count ASC/DESC;
– Can sort by more than one field
FOREACH & DISTINCT
• FOREACH
– Iteration through a list
– Grunt> countByProfession = FOREACH groupByProfession GENERATE
group, count (cust);
• DISTINCT
– Select only distinct tuples
– Temoves duplicate entries
– Grunt> distinctProfession – DISTINCT groupByProfession
SAMPLE
– X = SAMPLE A 0.01;
UDFs
• For logic that cannot be done in Pig
X = group A by age;
dump X;
(18,{(joe,18,2.5)})
(,{(sam,,3.0),(bob,,3.5)})
Pig Latin – COGROUP Operator
(18,{(joe,18,2.5)},{(joe,18,2.5)})
(,{(sam,,3.0),(bob,,3.5)},{})
(,{},{(sam,,3.0),(bob,,3.5)})
JOIN Operator
(joe,18,2.5,joe,18,2.5)
COGROUP vs. JOIN
COGROUP takes advantage of nested data structure (combination of GROUP BY and JOIN).
User can choose to go through with cross-product for a join or perform aggregation on the nested bags.
Parsing
Logic Plan
• Configure path
– HIVE_INSTALL = <hivepath>
– PATH=$PATH:$HIVE_INSTALL/bin
• Metastore options
– Hive comes with Derby, a lightweight and embedded sql
– Can configure any other database as well e.g.MySQL
Partitions and Buckets
• Tables
– Schemas in namespaces
• Partitions
– How data is stored in HDFS
– Grouping data bases on some column
– Can have one or more columns
• Buckets or Clusters
– Partitions divided further into buckets bases on some other column
– Use for data sampling
Creating a table
• Hive does not delete the table (or hdfs files) even when the tables are dropped. It
leaves the table untouched and only metadata about the tables are deleted
Load Data
• Show tables
• Show partitions <table_name>;
• Describe <table_name>;
Create database and table
• Create database
– Create database retail;
• Use database
– Use retail;
• Select
– Select count(*) from txnrecords;
• Aggregation
– Select count (DISTINCT category) from txnrecords;
• Grouping
– Select category, sum( amount ) from txnrecords group by
category
Managing Outputs
• Altering a table
ALTER TABLE old_table_name RENAME TO new_table_name
Q & A..?
Thank You
See You in Class Next Week