Apache - SQOOP and Flume

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 16

Apache Sqoop and Flume

Dr. S.K. Sudarsanam

VIT Business School


VIT University
This session covers

a) Introduction to
SQOOP
b) Sqoop Export
c) Sqoop Import
d) Flume
e) Components of
Flume
f) Configure Flume to
Ingest web log data
into HDFS from local
directory
Introduction to Sqoop
Flume and Sqoop are the technologies which
enables the transfer of data to and from Hadoop.
• Sqoop – Acronym for SQL to Hadoop
• Import Export framework for moving data
between Hadoop and Relational Databases
and Data Warehouses
• Sqoop Connectors available for many DW
platforms
Sqoop Export
(Data transfer from HDFS to MySQL)
• Create a database amed ‘test’ using the command
‘create database test’
• Using the command ‘show databases’ check if the db is
created or not
• Using the command ‘use test’ select the required db
• A table created using ‘create table emp’
• Create a data file named ‘sqoop-data.txt’ in the local file
system
Sqoop Export
(Data transfer from HDFS to MySQL)
• Sqoop-data.txt file contains the same schema as the
‘Employee’ table
• Insert rows in the txt file with tab separated data
• Push file ‘sqoop-data.txt’ into HDFS using Command Below
• Hadoop fs –put /usr/local/sqoop-data.txt /data/sqoop-
data.txt
• Export all rows of ‘sqoop-data.txt’ into MySQL table ‘emp’ by
the below command
• ‘sqoop export –connect jdbc:mysql://localhost:330/sqoopTest
–table employee –export-dir /data/sqoop-data.txt –m 1 –
input-fields-terminated-by ‘\t’
Sqoop IMPORT
(Importing Fresh Table from MySQL to HIVE)
• Create a table named emp in MySQL
create table emp (emp_id int, emp_name varchar(30),
emp_add varchar (30)) ;
• Insert dummy data into table emp.
• Create an external table into HIVE
• Run command to IMPORT data in HIVE
• sqoop import –connect jdbc:mysql://localhost:3306/sqoopTest
–username root –password admin –table emp –split-by emp_id
–m 1 –target-dir /data/sqoopTest
Flume
• Used to ingest streaming data (log files) into HDFS
• E-commerce web application wants to analyse
customer behavior from a region
• Need to move web log data into Hadoop HDFS for
analysis
• Company can track customer product searching in
the mobile segment in electronics, sports shoes in
Sports segment etc
• Flume is used to move log data generated by E-
commerce web servers into HDFS at a higher speed
Components of Flume
• Used to ingest streaming data (log files) into HDFS
• E-commerce web application wants to analyse customer behavior from a
region
• Need to move web log data into Hadoop HDFS for analysis
• Company can track customer product searching in the mobile segment in
electronics, sports shoes in Sports segment etc
• Flume is used to move log data generated by E-commerce web servers into
HDFS at a higher speed
Apache Pig Mapreduce
Data Flow Language Data Processing Paradigm
High Level Language Low Level and Rigid
Performing Join Operation is easier Difficult to perform Join operation
between datasets
Programmer with basic SQL Programming in Java or Python
language is easier required
Multi-query approach reduces the Need more lines of code to
lines of code perform similar operation in
Mapreduce
Compilation not required. On Has a long compilation process
execution operation is converted
into map-reduce jobs
Apache Pig SQL
Procedural Language Declarative Language
Schema is optional Schema is madatory
Data Model is nested relational Data model is flat relational
Limited opportunity for query Lot of opportunity for query
optimization optimization
Additional Features of Pig Latin

•Allows splits in the pipeline.


•Allows developers to store data anywhere
in the pipeline.
•Declares execution plans.
•Provides operators to perform ETL
(Extract, Transform, and Load) functions.
Pig Architecture

• Script Parser : Pig scripts are handled by the script parser. The parser does
the syntax checking, type checking and other tasks. The output of the parser
will be a DAG (Directed Acyclic Graph), which represents the Pig Latin
Statements and logical operators. In the DAG, the logical operations are
represented as Nodes and the data flows are represented as edge.
• Pig Optimizer : The logical Plan (DAG) is passed to logical optimizer, which
carried out optimizations such as projections and pushdown
•Pig Compiler : The compiler compiles the optimized logical plan into a series
of MapReduce jobs
•Execution Engine : The MapReduce jobs are submitted through YARN
container to Hadoop engine in a sorted order. Finally, these MapReduce jobs
are executed on Hadoop producing the desired results.
.
Applications of Apache Pig
Apache Pig is generally used by data scientists for
performing tasks involving ad-hoc processing and quick
prototyping.
Apache Pig is used
•To process huge data sources such as web logs.
•To perform data processing for search platforms.
•To process time sensitive data loads.
You can run a Pig script from the Grunt shell using the run command
Given below is the syntax of the run command.
grunt> run [–param param_name = param_value]
[–param_file file_name] script
Let us assume there is a file named student.txt in the /pig_data/ directory of HDFS with the
following content.
Student.txt
001,Rajiv,Hyderabad 002,siddarth,Kolkata 003,Rajesh,Delhi
And, assume we have a script file named sample_script.pig in the local file system with the
following content.
Sample_script.pig
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt'
USING PigStorage(',') as (id:int,name:chararray,city:chararray);
Now, let us run the above script from the Grunt shell using
the run command as shown below.
grunt> run /sample_script.pig
You can see the output of the script using the Dump
operator as shown below.
grunt> Dump;
(1,Rajiv,Hyderabad) (2,siddarth,Kolkata)
(3,Rajesh,Delhi)
Summarize

a) Introduction to Pig
b) Need for Pig
c) Key Features of Pig
d) Pig Architecture
e) Applications of Pig
f) Execution of a Sample Pig
Latin Script

You might also like