Hadoop - Session 7 Python

12/19/2015
Python
www.Bigdatainpractice.com
Introduction to Python and Python Streaming
Introduction to Python
Python Data Types
Working with Files
Conditions, Loops etc
Data Structures
Python Streaming on Hadoop
12/19/2015
Python Features
1. Open Source, All purpose
Programming Language
10. Versions available for

most of the operating
systems.
9. Extensive standard
libraries and
modules (like sci.py
and num.py)
2. Developed by Guido
Van Rossum (first
released in 1991)
3. Guido wanted to bridge
the gap between C and
Shell
4. Rapid development
(Python is interpreted
Language therefore no
compilation)
8. Very high level dynamic

data types like lists and
dictionaries
7. Supports object oriented as well
as procedural code.
6. A lot of heavy lifting (e.g. working with
hadoop) can be done very easily using
Python
5. Very handy for professionals who dont

have java or c++ or c skills
Comparison of Programming languages (popularity)
12/19/2015
Working with Python

Working with Python
1. Working with Python Shell:
2. Env Setup add path
3. Python Scripting:
For running python script python hello.py
Blocks.. Blocks blocks no braces just indentation
4. Python Variables:
Created when assigned to
Can hold any type data
Variable name can be of any length and is case-sensitive
Type is given based on assignment (int, float, str)
5. Working with Files
With open
Working with Python

Working with Python
6. Decision controls and Loops
For loop (when no of iterations are known)
While loop (loop based on condition)
7. Data Structures: Lists, tuples
Lists are used to store multiple values (of similar type)
Tuples are used to store multiple values (of different type)
Indexing and Slicing is similar for Lists, Tuples and Strings
8. Data Structues: Dictionaries, Sets

9. Working with File System
10. Functions
12/19/2015
Python Streaming on Hadoop - mapReduce

******************mapper.py******************
import sys
for line in sys.stdin:

line = line.strip()
memtype, cat, year, month, day, qty, sales = line.split(",")
print '%s\t%s' % (cat, sales)
*********************************************
******************reducer.py******************
Look at script
*********************************************
hadoop jar <streaming.jar> -file /user/cloudera/mapper.py -file
/user/cloudera/reducer.py -mapper /user/cloudera/mapper.py -reducer
/user/cloudera/reducer.py -input /user/cloudera/INPUT1/SalesData.csv output /user/cloudera/OUT_PY
Python Streaming on Hadoop - PIG

******************PIG SCRIPT******************
define streampy `/usr/bin/python pigPython.py`
input (stdin using PigStreaming(','))
output (stdout using PigStreaming(','))
ship ('pigPython.py');
salesData = LOAD '/user/cloudera/INPUT1/SalesData.csv' USING
PigStorage(',') AS (member_type:chararray, cat:chararray, year:int,
month:int, day:chararray, quantity:int, sales:float);
salesData2 = FILTER salesData BY cat == 'C1';
salesData3 = STREAM salesData2 THROUGH streampy as
(member_type:chararray, cat:chararray, year:int, month:int, day:chararray,
qty:int, sales:float);
DUMP salesData3;
12/19/2015
Python Streaming on Hadoop - PIG

******************pigPython.py******************
#!/usr/bin/python
import sys
name = 'NONE'
salary = 0.0
line = line.strip()
member_type, cat, year, month, day, qty, sales = line.split(",")
member_type = 'MEMBER_' + member_type
cat = 'CAT_' + cat
print member_type + "," + cat + "," + year + "," + month + "," + day + "," +
qty + "," + sales
Python Streaming on Hadoop - HIVE

******************HIVE SCRIPT*****************
CREATE TABLE employee_python (
empid string,
name string,
assist string,
salary float,
country string,
state string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t
add FILE hiveTran.py;
INSERT OVERWRITE TABLE employee_python
SELECT
TRANSFORM (empid,name,assist,salary,country,state)
USING 'python hiveTran.py'
AS (empid, name, assist, salary, country, state)
FROM employee;
12/19/2015
Python Streaming on Hadoop - HIVE

******************hiveTran.py******************
#!/usr/bin/env python
import sys
name = 'NONE'
salary = 0.0
line = line.strip()
empid, name, assist, salary, country, state = line.split("\t")
name = str(name)
name = name.upper()
salary = float(salary)
salary = salary/1000
country = str(country)
country = country.upper()
state = str(state)
state = state.upper()
print '\t'.join([empid, name, assist, str(salary),country, state])
Thank You

Hadoop - Session 7 Python

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop - Session 7 Python

Uploaded by

Copyright:

Available Formats

12/19/2015

Introduction to Python and Python Streaming

10. Versions available for

8. Very high level dynamic

5. Very handy for professionals who dont

Comparison of Programming languages (popularity)

Working with Python

5. Working with Files

Working with Python

For loop (when no of iterations are known)

While loop (loop based on condition)

7. Data Structures: Lists, tuples

Lists are used to store multiple values (of similar type)

Tuples are used to store multiple values (of different type)

Indexing and Slicing is similar for Lists, Tuples and Strings

8. Data Structues: Dictionaries, Sets

Python Streaming on Hadoop - mapReduce

for line in sys.stdin:

Python Streaming on Hadoop - PIG

Python Streaming on Hadoop - PIG

Python Streaming on Hadoop - HIVE

Python Streaming on Hadoop - HIVE

You might also like