Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Python Intro

Shankar Venkatagiri
Reference

Python for Data Analysis, 2nd Edition. O’Reilly 2017


Wes McKinney (WMcK) is the creator of Pandas
Written for Python 3.6+
Code for the book: http://github.com/wesm/pydata-book
Book provides an overview of these libraries
Python and Jupyter
NumPy, Pandas, Matplotlib, SciPy, Scikit-learn, Statsmodels
Origins

Authored by Guido van Rossum, released in 1991


Developed it further at Google between 2005-2012
Q: Why did he call it Python?
Advantages of the language
Easy to understand - no *pointers*
Interpreted, not compiled
Not only used for research and prototyping, but also for
deployment in production
Facilitates big data processing on a cluster (PySpark)
Structure

Q: How does a typical Python program flow?


You import some libraries to enable normal/ special tasks
You read in data from a CSV file/ database/ . . .
You clean the data - missing values, duplicates, outliers, . . .
You perform some computes and store results in variables
For repeat tasks, you define functions
You visualise the outcomes with some cool graphs
Essentials

Fact: Writing a program is 80% thinking & 20% typing


Syntax is just nuts and bolts - focus on the big picture
Don’t sweat the small stuff - there’s always Google!
Don’t be afraid to make mistakes
Some tasks are routine (e.g. reading in data). The interesting
parts require innovative thinking
No prizes for getting it right the first time!
Real projects involve teams - learn to communicate
Jupyter

Julia, Python and R


Supports kernels for 40+ platforms (Python, Spark, R, Scala, … )
Create a directory to contain all your code. Issue this command
$ docker run -d -p 8888:8888 -e GRANT_SUDO=yes
-v <Path to Code Directory>:/home/jovyan/work
jupyter/all-spark-notebook start-notebook.sh
--NotebookApp.token=''
Open up Firefox/Chrome. List the files in your work directory
with localhost:8888
Open up a new Python 3 notebook and start typing
Basics

Assign variables using =


Multiple assignments in one go!
To work with a set of numbers,
use an array
Negative index has meaning
To store a collection, we could use
a list object [ ]
Generate a sequence with
np.arange(n)= [0,1,…,n-1]
Filename:
Outputs IntroToPython.ipynb

In a cell, only the last output is printed. Remedy: Secret sauce!

Not printed!
Flow

Code blocks are bracketed by { } in many languages


Not in Python!
We use Indentation instead
Loop over items with for
Conditional branching with if-elif-else
Function

Want to reuse logic?


Bundle it into a
function using def
day is an argument
to the function
Functions are objects
in Python
You can pass them
as arguments
Task: What if you
want to cube instead?
Loops

Q: Which is quicker?

Two styles of iteration: Directly using a for loop


We use time() to clock the time elapsed
Or a Pythonic way using a list comprehension
File

File operations are used to bring in data / code from a file


More when we talk about data frames
Exception handling provides a graceful way to handle errors
Essential during file operations - files cannot be left “open”
You can even run the contents of this Python file
Dictionary

The dictionary data structure is useful to process key-values


Q: Why are key-values useful?
Filename:
Libraries NumPy.ipynb

Python has a disaggregated approach, in contrast with R


NumPy - short for Numerical Python
Vectorised array oriented computing
Has a C language API - can access legacy C/C++/Fortran
Internally stores data in a contiguous block of memory
Pandas - primarily for tabular data representation
Time series manipulation
Matplotlib - for visualisation
Matrices
randn = normally
Indexing is straightforward
distributed random numbers
Boolean indexing, retrieval
and substitution in place
Reshape an array to get a
matrix, construct its transpose
Pandas
Filename:
Pandas.ipynb

Designed for tabular, heterogeneous data


Structures: Series (single column) & DataFrame
Series: 1D objects with a default zero-based index
Can relabel & access elements by index, filtering operations
Excel

We can read into a data frame from an Excel spreadsheet


Get a dictionary of DataFrame objects - one per sheet
JSON

Can also initiate a dataframe using JSON data


JavaScript Online Notation
Ideal for unstructured data with missing values
Data is read line by line into a data frame
Operations

Get the header with list


Delete columns with del
Get top items with head
List data types with dtypes
Convert type of a column
Subset the data
Slice the data
Description

Numerical variable

Categorical variable
Database

SQLite is “a C-language library that implements a sma!, fast, self-


contained, high-reliability, fu!-featured, SQL database engine.”
A dataframe can be loaded from a database query
We use Pandas built-in support for SQLite
Study the code for database table creation
Delete the table if you wish
CSV Filename:
BankMarketing.ipynb

Standard route: Initialising a dataframe from a CSV file


E.g. UCI Repository Bank Marketing Dataset
Dataset has 45,200 instances
16 predictors, 1 response
Phone-based direct marketing campaigns of a Portuguese bank
Required one or more contacts to the same client
Outcome: Product = term deposit subscribed or not

S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing:
An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings
of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121,
Guimarães, Portugal, October, 2011. EUROSIS.
# bank client data:
1 - age
2 - job : type of job (“admin.”,”unknown”,"unemployed","management", ...)
3 - marital : marital status ("married","divorced","single")
4 - education ("unknown","secondary","primary","tertiary")
5 - default: has credit in default? ("yes","no")
6 - balance: average yearly balance, in euros
7 - housing: has housing loan? ("yes","no")
8 - loan: has personal loan? (“yes","no")

# related with the last contact of the current campaign:


9 - contact: contact communication type ("unknown","telephone","cellular")
10 - day: last contact day of the month
11 - month: last contact month of year ("jan", "feb", "mar", ..., "nov", "dec")
12 - duration: last contact duration, in seconds

# other attributes:
13 - campaign: number of contacts performed during this campaign, for this client
14 - pdays: days passed after last contact (-1 = client was not previously contacted)
15 - previous: number of contacts performed before this campaign for this client
16 - poutcome: outcome of previous campaign ("unknown","other","failure","success")

Output variable (desired target):


17 - y - has the client subscribed a term deposit? ("yes","no")
read_csv

Whenever you read a dataset in, list out a few rows (head)
Check for any surprises in the data types
Use astype to convert some “object” columns to categorical
Categorical

describe gives helpful statistics


Activity
Derive the 12 job levels with Calls[“job”].unique()
Spreadsheet-like pivot tables are also possible
Numerical

Getting stats on columns is straightforward


round off the numbers to make better sense
Combined

A combined pivot table throws more light…


Filename:
Matplotlib Matplotlib.ipynb

John Hunter enabled a Matlab-like plotting library


In IPython, type: %matplotlib notebook
Directly plotting numerical values does not help
The default histogram needs some jazzing up
Filename:
Matplotlib Matplotlib.ipynb

John Hunter enabled a Matlab-like plotting library


In IPython, type: %matplotlib notebook
Directly plotting numerical values does not help
The default histogram needs some jazzing up
Separate colour for each “patch”
For horizontal, use
barh

Bar

Choose a sequential colour


palette when you wish to
compare
seaborn

Don’t like the looks of Matplotlib graphs? Try seaborn


Q: What do you infer from the boxplots?
Advanced

There’s an alternate way: you can use lambda functions aka


anonymous functions
Multiple arguments possible

You can attach a conditional


E.g. Print the values only for odd entries
Reference

Advanced Analytics with Spark, 2nd Edition O’Reilly 2017


Sandy Ryza et al
”I think the best way to teach data science is by example.”
Code samples: https://github.com/sryza/aas

You might also like