Unit 4 Pig and Hive

Pig
and
Pig Latin
Motivation
You‟re a procedural You have huge data You want to analyze

programmer it
2
Motivation
➢ As a procedural programmer…
▪ May find writing queries in SQL unnatural and too
restrictive
▪ More comfortable with writing code; a series of

statements as opposed to a long query. (Ex:
MapReduce is so successful).
Motivation
Data analysis goals
Quick
Exploit parallel processing power of a distributed system
Easy
Be able to write a program or query without a huge learning curve
Have some common analysis tasks predefined
Flexible
Transform a data set(s) into a workable structure without much
overhead
Perform customized processing
Transparent
Have a say in how the data processing is executed on the system
Motivation
Relational Distributed Databases

Parallel database products expensive
Rigid schemas
Processing requires declarative SQL query construction
Map-Reduce
Relies on custom code for even common operations
Need to do workarounds for tasks that have different data
flows other than the expected Map Combine Reduce
Motivation
Relational Distributed Databases
Sweet Spot: Take the best of both SQL and Map-Reduce;

combine high-level declarative querying with low-level
procedural programming…Pig Latin!
Map-Reduce
7
Pig Latin Example
Table urls: (url,category, pagerank)
Find for each suffciently large category, the average pagerank of high-pagerank urls in that
category
SQL:
SELECT category, AVG(pagerank)
FROM urls WHERE pagerank > 0.2
GROUP BY category HAVING COUNT(*) > 10^6
Pig Latin:
good_urls = FILTER urls BY pagerank > 0.2;
groups = GROUP good_urls BY category;
big_groups = FILTER groups BY COUNT(good_urls)>10^6;
output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);
Big Picture
Pig
Pig Latin
Script Map-Reduce
Statements
User- Compile
Defined
Functions
Optimize
Write Results Read Data

Data Model
Atom - simple atomic value (ie: number or string)
Tuple
Bag
Map
Data Model
 Atom
 Tuple - sequence of fields; each field any type
 Bag
 Map
Data Model
 Atom
 Tuple
 Bag - collection of tuples
 Duplicates possible
 Tuples in a bag can have different field lengths and field types
 Map
13
Data Model
 Atom
 Tuple
 Bag
 Map - collection of key-value pairs
 Key is an atom; value can be any type
14
Data Model
 Control over dataflow
Ex 1 (less efficient)
spam_urls = FILTER urls BY isSpam(url);
culprit_urls = FILTER spam_urls BY pagerank > 0.8;
Ex 2 (most efficient)
highpgr_urls = FILTER urls BY pagerank > 0.8;
spam_urls = FILTER highpgr_urls BY isSpam(url);
 Fully nested
More natural for procedural programmers (target user) than
normalization
 Data is often stored on disk in a nested fashion
 Facilitates ease of writing user-defined functions
 No schema required
15
Data Model
 User-Defined Functions (UDFs)

 Ex: spam_urls = FILTER urls BY isSpam(url);
 Can be used in many Pig Latin statements
 Useful for custom processing tasks
 Can use non-atomic values for input and output

 Currently must be written in Java
16
Speaking Pig Latin
 LOAD
 Input is assumed to be a bag (sequence of tuples)
 Can specify a deserializer with “USING‟
 Can provide a schema with “AS‟
newBag = LOAD ‘filename’

<USING functionName() >
<AS (fieldName1, fieldName2,…)>;
Queries = LOAD ‘query_log.txt’

USING myLoad()
AS (userID,queryString, timeStamp)
17
Speaking Pig Latin
 FOREACH
 Apply some processing to each tuple in a bag
 Each field can be:
A fieldname of the bag
 A constant
A simple expression (ie: f1+f2)

 A predefined function (ie: SUM, AVG, COUNT, FLATTEN)
 A UDF (ie: sumTaxes(gst, pst) )
newBag =
FOREACH bagName
GENERATE field1, field2, …;
18
Speaking Pig Latin
 FILTER
 Select a subset of the tuples in a bag
newBag = FILTER bagName
BY expression;
 Expression uses simple comparison operators (==, !=, <, >, …)
and Logical connectors (AND, NOT, OR)
some_apples =
FILTER apples BY colour != ‘red’;
 Can use UDFs
some_apples =
FILTER apples BY NOT isRed(colour);
19
Speaking Pig Latin
 COGROUP
 Group two datasets together by a common attribute
 Groups data into nested bags
grouped_data = COGROUP results BY queryString,
revenue BY queryString;
20
Speaking Pig Latin
 Why COGROUP and not JOIN?
url_revenues =
FOREACH grouped_data GENERATE
FLATTEN(distributeRev(results, revenue));
21
Speaking Pig Latin
 Why COGROUP and not JOIN?
 May want to process nested bags of tuples before taking the
cross product.
 Keeps to the goal of a single high-level data transformation per
pig-latin statement.
 However, JOIN keyword is still available:
JOIN results BY queryString,
Equivalent
temp = COGROUP results BY queryString,
join_result = FOREACH temp GENERATE
FLATTEN(results), FLATTEN(revenue);
22
Speaking Pig Latin
 STORE (& DUMP)
 Output data to a file (or screen)
STORE bagName INTO ‘filename’

<USING deserializer ()>;
 Other Commands (incomplete)

 UNION - return the union of two or more bags
 CROSS - take the cross product of two or more bags
 ORDER - order tuples by a specified field(s)
 DISTINCT - eliminate duplicate tuples in a bag
 LIMIT - Limit results to a subset
23
Compilation
 Pig system does two tasks:
 Builds a Logical Plan from a Pig Latin script

 Supports execution platform independence
 No processing of data performed at this stage
 Compiles the Logical Plan to a Physical Plan and Executes

Convert the Logical Plan into a series of Map-Reduce statements to
be executed (in this case) by Hadoop Map-Reduce
24
Compilation
 Building a Logical Plan
 Verify input files and bags referred to are valid
 Create a logical plan for each bag(variable) defined
25
Compilation
 Building a Logical Plan Example
A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat)
B = GROUP A BY city;
C = FOREACH B GENERATE group AS city,
COUNT(A);
D = FILTER C BY city IS ‘kitchener’
OR city IS ‘waterloo’;
STORE D INTO ‘local_user_count.dat’;
26
Compilation
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Group
27
Compilation
COUNT(A);
Foreach
28
Compilation
COUNT(A);
Foreach
Filter
29
Compilation
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Filter
Group
Foreach
30
Compilation
 Building a Physical Plan
COUNT(A);
D = FILTER C BY city IS ‘kitchener’ Filter
Group
Only happens when output is
specified by STORE or DUMP
Foreach
32
Compilation
 Step 1: Create a map-reduce job for each Load(user.dat)
COGROUP
Filter
Map
Group
Reduce
Foreach
33
Compilation
 Step 1: Create a map-reduce job for each Load(user.dat)
COGROUP
 Step 2: Push other commands into the
map and reduce functions where Map Filter
possible
 May be the case certain commands Group

require their own map-reduce
Reduce
job (ie: ORDER needs separate map-
reduce jobs)
Foreach
34
Compilation
 Efficiency in Execution
 Parallelism
 Loading data - Files are loaded from HDFS
 Statements are compiled into map-reduce jobs
35
Compilation
 Efficiency with Nested Bags
 In many cases, the nested bags created in each tuple of a COGROUP

statement never need to physically materialize
 Generally perform aggregation after a COGROUP and the

statements for said aggregation are pushed into the reduce function
 Applies to algebraic functions (ie: COUNT, MAX, MIN, SUM, AVG)
36
Compilation
 Efficiency with Nested Bags Load(user.dat)
Map Filter
Group
Foreach
37
Compilation
Filter
Group
Combine
Foreach
38
Compilation
Filter
Group
Reduce Foreach
39
Compilation
 Why this works:

 COUNT is an algebraic function; it can be structured as a tree of sub-
functions with each leaf working on a subset of the data
Reduce SUM
Combine COUNT COUNT
40
Compilation
Pig provides an interface for writing algebraic UDFs so they can take
advantage of this optimization as well.
 Inefficiencies
 Non-algebraic aggregate functions (ie: MEDIAN) need entire bag to

materialize; may cause a very large bag to spill to disk if it doesn‟t fit
in memory
Every map-reduce job requires data be written and replicated to the

HDFS (although this is offset by parallelism achieved)
41
Debugging
 How to verify the semantics of an analysis program

 Run the program against whole data set. Might take hours!
 Generate sample dataset
 Empty result set may occur on few operations like join, filter
 Generally, testing with sample dataset is difficult
 Pig-Pen
 Samples data from large dataset for Pig statements
 Apply individual Pig-Latin commands against the dataset
 In case of empty result, pig system resamples
 Remove redundant samples
Debugging
 Pig-Pen
42
Debugging
 Pig-Latin command window and command generator
43
Debugging
 Sand Box Dataset (generated automatically!)
44
Debugging
 Pig-Pen
 Provides sample data that is:

 Real - taken from actual data
 Concise - as small as possible
 Complete - collectively illustrate the key semantics of each command
 Helps with schema definition
 Facilitates incremental program writing
45
Conclusion
 Pig is a data processing environment in Hadoop that is

specifically targeted towards procedural programmers
who perform large-scale data analysis.
Pig-Latin offers high-level data manipulation in a

procedural style.
Pig-Pen is a debugging environment for Pig-Latin

commands that generates samples from real data.
47
More Info
 Pig, http://hadoop.apache.org/pig/
 Hadoop, http://hadoop.apache.org
Anks-
Thay!
48
HIVE –
A WAREHOUSING SOLUTION
OVER
A MAP-REDUCE
FRAMEWORK
Agenda
 Why Hive???
 What is Hive?
 Hive Data Model
 Hive Architecture
 HiveQL
 Hive SerDe’s
 Pros and Cons
 Hive v/s Pig
 Graphs
Data Analysts with Hadoop
 Extra Overhead
Challenges that Data Analysts
faced
 Data Explosion
- TBs of data generated everyday
Solution – HDFS to store data and Hadoop Map-Reduce framework to
parallelize processing of Data
What is the catch?
- Hadoop Map Reduce is Java intensive
- Thinking in Map Reduce paradigm can get tricky
… Enter Hive!
Hive Key Principles
HiveQL to MapReduce
Hive Framework
Data Analyst
SELECT COUNT(1) FROM Sales;
rowcount, N
rowcount,1 rowcount,1
Sales: Hive table

MR JOB Instance
Hive Data Model
Data in Hive organized into :

 Tables
 Partitions
 Buckets
Hive Data Model Contd.
Tables
- Analogous to relational tables
- Each table has a corresponding directory in HDFS
- Data serialized and stored as files within that directory
- Hive has default serialization built in which supports
compression and lazy deserialization
- Users can specify custom serialization –deserialization
schemes (SerDe’s)
Partitions
- Each table can be broken into partitions
- Partitions determine distribution of data within subdirectories
Example -
CREATE_TABLE Sales (sale_id INT, amount FLOAT)
PARTITIONED BY (country STRING, year INT, month INT)
So each partition will be split out into different folders like
Sales/country=US/year=2012/month=12
Hierarchy of Hive Partitions
/hivebase/Sales
/country=US
/country=CANAD
A
/year=2012 /year=2012
/year=2015
/year=2014
/month=12
/month=11 /month=11
File File File
 Buckets
- Data in each partition divided into buckets
- Based on a hash function of the column
- H(column) mod NumBuckets = bucket number
- Each bucket is stored as a file in partition directory
Architecture
Externel Interfaces- CLI, WebUI, JDBC,
ODBC programming interfaces
Thrift Server – Cross Language service

framework .
Metastore - Meta data about the Hive

tables, partitions
Driver - Brain of Hive! Compiler, Optimizer

and Execution engine
Hive Thrift Server
• Framework for cross language services

• Server written in Java
• Support for clients written in different languages
- JDBC(java), ODBC(c++), php, perl, python scripts
Metastore
• System catalog which contains metadata about the Hive tables

• Stored in RDBMS/local fs. HDFS too slow(not optimized for random
access)
• Objects of Metastore
➢ Database - Namespace of tables
➢ Table - list of columns, types, owner, storage, SerDes
➢ Partition – Partition specific column, Serdes and storage
Hive Driver
• Driver - Maintains the lifecycle of HiveQL statement

• Query Compiler – Compiles HiveQL in a DAG of map reduce tasks
• Executor - Executes the tasks plan generated by the compiler in proper
dependency order. Interacts with the underlying Hadoop instance
Compiler
 Converts the HiveQL into a plan for execution
 Plans can
- Metadata operations for DDL statements e.g. CREATE
- HDFS operations e.g. LOAD
 Semantic Analyzer – checks schema information, type checking, implicit
type conversion, column verification
 Optimizer – Finding the best logical plan e.g. Combines multiple joins in a
way to reduce the number of map reduce jobs, Prune columns early to
minimize data transfer
 Physical plan generator – creates the DAG of map-reduce jobs
HiveQL
DDL :
CREATE DATABASE
CREATE TABLE
ALTER TABLE
SHOW TABLE
DESCRIBE
DML:
LOAD TABLE
INSERT
QUERY:
SELECT
GROUP BY
JOIN
MULTI TABLE INSERT
Hive SerDe
 SELECT Query
➢ Hive built in Serde: Record
Avro, ORC, Regex etc Reader
Hive Table
➢ Can use Custom Deserialize
SerDe’s (e.g. for
unstructured data
like audio/video Hive Row Object
data, End User
Object Inspector
semistructured XML Map Fields
data)
Good Things
 Boon for Data Analysts

 Easy Learning curve
 Completely transparent to underlying Map-Reduce
 Partitions(speed!)
 Flexibility to load data from localFS/HDFS into Hive Tables
Cons and Possible Improvements
 Extending the SQL queries support(Updates, Deletes)

 Parallelize firing independent jobs from the work DAG
 Table Statistics in Metastore
 Explore methods for multi query optimization
 Perform N- way generic joins in a single map reduce job
 Better debug support in shell
Hive v/s Pig
Similarities:
➢ Both High level Languages which work on top of map reduce framework
➢ Can coexist since both use the under lying HDFS and map reduce
Differences:
◆ Language
➢ Pig is a procedural ; (A = load ‘mydata’; dump A)
➢ Hive is Declarative (select * from A)
◆ Work Type
➢ Pig more suited for adhoc analysis (on demand analysis of click stream
search logs)
➢ Hive a reporting tool (e.g. weekly BI reporting)
Hive v/s Pig
Differences:
◆ Users
➢ Pig – Researchers, Programmers (build complex data pipelines,
machine learning)
➢ Hive – Business Analysts
◆ Integration
➢ Pig - Doesn’t have a thrift server(i.e no/limited cross language support)
➢ Hive - Thrift server
◆ User’s need
➢ Pig – Better dev environments, debuggers expected
➢ Hive - Better integration with technologies expected(e.g JDBC, ODBC)
Head-to-Head
(the bee, the pig, the elephant)
Version: Hadoop – 0.18x, Pig:786346, Hive:786346

❖ Sqoop is an open source framework provided by Apache.
Sqoop
❖ It is a command-line interface application for transferring
data between relational databases and Hadoop
❖ It is used to import data from relational databases such as
MySQL, Oracle to Hadoop HDFS, and export from Hadoop
file system to relational databases.
❖ When Big Data storages and analyzers such as MapReduce,
Hive, HBase, Cassandra, Pig, etc. of the Hadoop ecosystem
came into picture, they required a tool to interact with the
relational database servers for importing and exporting the
Big Data residing in them. Here, Sqoop occupies a place in
the Hadoop ecosystem to provide feasible interaction
between relational database server and Hadoop’s HDFS.
❖ Sqoop − “SQL to Hadoop and Hadoop to SQL”
Sqoop
Sqoop (“SQL-to-Hadoop”) is a straightforward
command-line tool. It offers the following capabilities:
1. Generally, helps to Import individual tables or entire
databases to files in HDFS
2. Also can Generate Java classes to allow you to
interact with your imported data
3. Moreover, it offers the ability to import from SQL
databases straight into your Hive data warehouse.
Sqoop
 It supports incremental loads of a
single table or a free form SQL
query as well as saved jobs which
can be run multiple times to import
updates made to a database since
the last import.
 Using Sqoop, Data can be moved

into HDFS/hive/hbase from
MySQL/ PostgreSQL/Oracle/SQL
Server/DB2 and vise versa.
Sqoop Working
Sqoop send the request to Relational DB to send the return the

Step 1 metadata informationabout the table(Metadata here is the
data about the table in relational DB).
From the received information it will generate the java classes

Step 2 (Reason why you shouldhave Java configured before get it
working-Sqoop internally uses JDBC API to generate data).
Now Sqoop (As its written in java ?tries to package the compiled
Step 3 classes to beable togenerate table structure) , post compiling
creates jar file(Java packaging standard).
There are many salient features of Sqoop, which shows us the
several reasons to learn sqoop.
a. Parallel import/export
While it comes to import and export the data, Sqoop uses YARN
Key Features framework. Basically, that offers fault tolerance on top of
parallelism.
of b. Connectors for all major RDBMS Databases

However, for multiple RDBMS databases, Sqoop offers
connectors, covering almost the entire circumference.
Sqoop c. Import results of SQL query
d. Full Load
It is one of the important features of sqoop, in which we can load
the whole table by a single command in Sqoop. Also, by using a
single command we can load all the tables from a database.
f. Kerberos Security Integration
Basically, Sqoop supports Kerberos authentication. Where
Kerberos defined as a computer network authentication
protocol. That works on the basis of ‘tickets’ to allow nodes
communicating over a non-secure network to prove their
Key Features identity to one another in a secure manner.
g. Load data directly into HIVE/HBase
of Basically, for analysis, we can load data directly into Apache
Hive. Also, can dump your data in HBase, which is a NoSQL
Sqoop database.
h. Compression
By using deflate(gzip) algorithm with –compress argument,
We can compress your data. Moreover, it is also possible by
specifying –compression-codec argument.
Sqoop commands
codegen Generate code to interact with database records
Create-hive- table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
Import-all-tables Import tables from a database to HDFS
list-databases List available databases on a server
list-tables List available tables in a database
version Display version information

Flume
➢ Flume is a standard, simple, robust, flexible,
and extensible tool for data ingestion from
various data producers (webservers) into
Hadoop.
➢ Apache Flume is a tool/service/data ingestion
mechanism for collecting aggregating and
Flume transporting large amounts of streaming data
such as log files, events (etc...) from various
sources to a centralized data store.
➢ Flume is a highly reliable, distributed, and
configurable tool. It is principally designed
to copy streaming data (log data) from
various web servers to HDFS.
Data FLOW using Flume
Apache Flume - Data Flow
❑ Flume is a framework which is used to move

log data into HDFS. Generally events and log
data are generated by the log servers and these
servers have Flume agents running on them.
These agents receive the data from the data
generators.
❑ The data in these agents will be collected by an
intermediate node known as Collector. Just like
agents, there can be multiple collectors in
Flume.
❑ Finally, the data from all these collectors will be
aggregated and pushed to a centralized store
such as HBase or HDFS. The following diagram
explains the data flow in Flume.
Data FLOW using Flume
Multi-hop Flow
 Within Flume, there can be multiple agents and before reaching the final destination, an event may
travel through more than one agent. This is known as multi-hop flow.
Fan-out Flow
 The dataflow from one source to multiple channels is known as fan-out flow. It is of two types −
➢ Replicating − The data flow where the data will be replicated in all the configured channels.
➢ Multiplexing − The data flow where the data will be sent to a selected channel which is mentioned
in the header of the event.
Fan-in Flow
 The data flow in which the data will be transferred from many sources to one channel is known as fan-
in flow.
Failure Handling
 In Flume, for each event, two transactions take place: one at the sender and one at the receiver. The
sender sends events to the receiver. Soon after receiving the data, the receiver commits its own
transaction and sends a “received” signal to the sender. After receiving the signal, the sender commits
its transaction. (Sender will not commit its transaction till it receives a signal from the receiver.)
Advantages of FLUME
➢ Using Apache Flume we can store the data in to any of the centralized stores
(HBase, HDFS).
➢ When the rate of incoming data exceeds the rate at which data can be written to
the destination, Flume acts as a mediator between data producers and the
centralized stores and provides a steady flow of data between them.
➢ Flume provides the feature of contextual routing.
➢ The transactions in Flume are channel-based where two transactions (one sender
and one receiver) are maintained for each message. It guarantees reliable message
delivery.
➢ Flume is reliable, fault tolerant, scalable, manageable, and customizable.
Features of FLUME
➢ Flume ingests log data from multiple web servers into a centralized store
(HDFS, HBase) efficiently.
➢ Using Flume, we can get the data from multiple servers immediately into
Hadoop.
➢ Along with the log files, Flume is also used to import huge volumes of
event data produced by social networking sites like Facebook and Twitter,
and e-commerce websites like Amazon and Flipkart.
➢ Flume supports a large set of sources and destinations types.
➢ Flume supports multi-hop flows, fan-in fan-out flows, contextual routing,
etc.
➢ Flume can be scaled horizontally.
The flume agent has 3 components:
Flume ~ Source: It accepts the data from the incoming streamline and
stores the data in the channel.
~ Channel: In general, the reading speed is faster than the writing
Architecture speed. Thus, we need some buffer to match the read & write speed
difference. Basically, the buffer acts as a intermediary storage that
stores the data being transferred temporarily and therefore prevents
data loss. Similarly, channel acts as the local storage or a temporary
storage between the source of data and persistent data in the HDFS.
~ Sink: Then, our last component i.e. Sink, collects the data from
the channel and commits or writes the data in the HDFS
permanently.
REFERENCES
 https://hive.apache.org/
 https://cwiki.apache.org/confluence/display/Hive/Presentations
 https://developer.yahoo.com/blogs/hadoop/comparing-pig-latin-
sql-constructing-data-processing-pipelines-444.html
 http://www.qubole.com/blog/big-data/hive-best-practices/
 Hortonworks tutorials (youtube)
 Graph :
https://issues.apache.org/jira/secure/attachment/12411185/hive_b
enchmark_2009-06-18.pdf
Thanks

Unit 4 Pig and Hive

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 4 Pig and Hive

Uploaded by

Copyright:

Available Formats

Pig

You‟re a procedural You have huge data You want to analyze

▪ More comfortable with writing code; a series of

Relational Distributed Databases

Relational Distributed Databases

Sweet Spot: Take the best of both SQL and Map-Reduce;

Write Results Read Data

 User-Defined Functions (UDFs)

 Can use non-atomic values for input and output

newBag = LOAD ‘filename’

Queries = LOAD ‘query_log.txt’

A simple expression (ie: f1+f2)

STORE bagName INTO ‘filename’

 Other Commands (incomplete)

 Builds a Logical Plan from a Pig Latin script

 Compiles the Logical Plan to a Physical Plan and Executes

 Building a Logical Plan

 Verify input files and bags referred to are valid

 Create a logical plan for each bag(variable) defined

 May be the case certain commands Group

 Loading data - Files are loaded from HDFS

 Statements are compiled into map-reduce jobs

 Efficiency with Nested Bags

 In many cases, the nested bags created in each tuple of a COGROUP

 Generally perform aggregation after a COGROUP and the

 Applies to algebraic functions (ie: COUNT, MAX, MIN, SUM, AVG)

 Efficiency with Nested Bags Load(user.dat)

 Efficiency with Nested Bags Load(user.dat)

 Efficiency with Nested Bags Load(user.dat)

 Efficiency with Nested Bags

 Why this works:

Combine COUNT COUNT

 Efficiency with Nested Bags

 Non-algebraic aggregate functions (ie: MEDIAN) need entire bag to

Every map-reduce job requires data be written and replicated to the

 How to verify the semantics of an analysis program

 Provides sample data that is:

 Complete - collectively illustrate the key semantics of each command

 Helps with schema definition

 Facilitates incremental program writing

 Pig is a data processing environment in Hadoop that is

Pig-Latin offers high-level data manipulation in a

Pig-Pen is a debugging environment for Pig-Latin

SELECT COUNT(1) FROM Sales;

Sales: Hive table

Data in Hive organized into :

Thrift Server – Cross Language service

Metastore - Meta data about the Hive

Driver - Brain of Hive! Compiler, Optimizer

• Framework for cross language services

• System catalog which contains metadata about the Hive tables

• Driver - Maintains the lifecycle of HiveQL statement

 Boon for Data Analysts

 Extending the SQL queries support(Updates, Deletes)

Version: Hadoop – 0.18x, Pig:786346, Hive:786346

 Using Sqoop, Data can be moved

Sqoop send the request to Relational DB to send the return the

From the received information it will generate the java classes

of b. Connectors for all major RDBMS Databases

Create-hive- table Import a table definition into Hive

eval Evaluate a SQL statement and display the results

export Export an HDFS directory to a database table

help List available commands