Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

12/19/2015

Python

www.Bigdatainpractice.com

Introduction to Python and Python Streaming

Introduction to Python
Python Data Types
Working with Files
Conditions, Loops etc
Data Structures
Python Streaming on Hadoop

www.Bigdatainpractice.com

12/19/2015

Python Features
1. Open Source, All purpose
Programming Language

10. Versions available for


most of the operating
systems.
9. Extensive standard
libraries and
modules (like sci.py
and num.py)

2. Developed by Guido
Van Rossum (first
released in 1991)
3. Guido wanted to bridge
the gap between C and
Shell

4. Rapid development
(Python is interpreted
Language therefore no
compilation)

8. Very high level dynamic


data types like lists and
dictionaries
7. Supports object oriented as well
as procedural code.
6. A lot of heavy lifting (e.g. working with
hadoop) can be done very easily using
Python

5. Very handy for professionals who dont


have java or c++ or c skills
www.Bigdatainpractice.com

Comparison of Programming languages (popularity)

www.Bigdatainpractice.com

12/19/2015

Working with Python


Working with Python
1. Working with Python Shell:
2. Env Setup add path
3. Python Scripting:
For running python script python hello.py
Blocks.. Blocks blocks no braces just indentation
4. Python Variables:
Created when assigned to
Can hold any type data
Variable name can be of any length and is case-sensitive
Type is given based on assignment (int, float, str)

5. Working with Files

With open

www.Bigdatainpractice.com

Working with Python


Working with Python
6. Decision controls and Loops

For loop (when no of iterations are known)

While loop (loop based on condition)

7. Data Structures: Lists, tuples

Lists are used to store multiple values (of similar type)

Tuples are used to store multiple values (of different type)

Indexing and Slicing is similar for Lists, Tuples and Strings

8. Data Structues: Dictionaries, Sets


9. Working with File System
10. Functions

www.Bigdatainpractice.com

12/19/2015

Python Streaming on Hadoop - mapReduce


******************mapper.py******************
import sys

for line in sys.stdin:


line = line.strip()
memtype, cat, year, month, day, qty, sales = line.split(",")
print '%s\t%s' % (cat, sales)

*********************************************
******************reducer.py******************
Look at script
*********************************************
hadoop jar <streaming.jar> -file /user/cloudera/mapper.py -file
/user/cloudera/reducer.py -mapper /user/cloudera/mapper.py -reducer
/user/cloudera/reducer.py -input /user/cloudera/INPUT1/SalesData.csv output /user/cloudera/OUT_PY

www.Bigdatainpractice.com

Python Streaming on Hadoop - PIG


******************PIG SCRIPT******************
define streampy `/usr/bin/python pigPython.py`
input (stdin using PigStreaming(','))
output (stdout using PigStreaming(','))
ship ('pigPython.py');
salesData = LOAD '/user/cloudera/INPUT1/SalesData.csv' USING
PigStorage(',') AS (member_type:chararray, cat:chararray, year:int,
month:int, day:chararray, quantity:int, sales:float);
salesData2 = FILTER salesData BY cat == 'C1';
salesData3 = STREAM salesData2 THROUGH streampy as
(member_type:chararray, cat:chararray, year:int, month:int, day:chararray,
qty:int, sales:float);
DUMP salesData3;

www.Bigdatainpractice.com

12/19/2015

Python Streaming on Hadoop - PIG


******************pigPython.py******************
#!/usr/bin/python
import sys
name = 'NONE'
salary = 0.0
for line in sys.stdin:
line = line.strip()
member_type, cat, year, month, day, qty, sales = line.split(",")
member_type = 'MEMBER_' + member_type
cat = 'CAT_' + cat
print member_type + "," + cat + "," + year + "," + month + "," + day + "," +
qty + "," + sales

www.Bigdatainpractice.com

Python Streaming on Hadoop - HIVE


******************HIVE SCRIPT*****************
CREATE TABLE employee_python (
empid string,
name string,
assist string,
salary float,
country string,
state string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t
add FILE hiveTran.py;
INSERT OVERWRITE TABLE employee_python
SELECT
TRANSFORM (empid,name,assist,salary,country,state)
USING 'python hiveTran.py'
AS (empid, name, assist, salary, country, state)
FROM employee;
www.Bigdatainpractice.com

12/19/2015

Python Streaming on Hadoop - HIVE


******************hiveTran.py******************
#!/usr/bin/env python
import sys
name = 'NONE'
salary = 0.0
for line in sys.stdin:
line = line.strip()
empid, name, assist, salary, country, state = line.split("\t")
name = str(name)
name = name.upper()
salary = float(salary)
salary = salary/1000
country = str(country)
country = country.upper()
state = str(state)
state = state.upper()
print '\t'.join([empid, name, assist, str(salary),country, state])
www.Bigdatainpractice.com

Thank You

www.Bigdatainpractice.com

You might also like