BDA02 IntroToPython

Python Intro
Shankar Venkatagiri
Reference
Python for Data Analysis, 2nd Edition. O’Reilly 2017

Wes McKinney (WMcK) is the creator of Pandas
Written for Python 3.6+
Code for the book: http://github.com/wesm/pydata-book
Book provides an overview of these libraries
Python and Jupyter
NumPy, Pandas, Matplotlib, SciPy, Scikit-learn, Statsmodels
Origins
Authored by Guido van Rossum, released in 1991

Developed it further at Google between 2005-2012
Q: Why did he call it Python?
Advantages of the language
Easy to understand - no *pointers*
Interpreted, not compiled
Not only used for research and prototyping, but also for
deployment in production
Facilitates big data processing on a cluster (PySpark)
Structure
Q: How does a typical Python program flow?

You import some libraries to enable normal/ special tasks
You read in data from a CSV file/ database/ . . .
You clean the data - missing values, duplicates, outliers, . . .
You perform some computes and store results in variables
For repeat tasks, you define functions
You visualise the outcomes with some cool graphs
Essentials
Fact: Writing a program is 80% thinking & 20% typing

Syntax is just nuts and bolts - focus on the big picture
Don’t sweat the small stuﬀ - there’s always Google!
Don’t be afraid to make mistakes
Some tasks are routine (e.g. reading in data). The interesting
parts require innovative thinking
No prizes for getting it right the first time!
Real projects involve teams - learn to communicate
Jupyter
Julia, Python and R

Supports kernels for 40+ platforms (Python, Spark, R, Scala, … )
Create a directory to contain all your code. Issue this command
$ docker run -d -p 8888:8888 -e GRANT_SUDO=yes
-v <Path to Code Directory>:/home/jovyan/work
jupyter/all-spark-notebook start-notebook.sh
--NotebookApp.token=''
Open up Firefox/Chrome. List the files in your work directory
with localhost:8888
Open up a new Python 3 notebook and start typing
Basics
Assign variables using =

Multiple assignments in one go!
To work with a set of numbers,
use an array
Negative index has meaning
To store a collection, we could use
a list object [ ]
Generate a sequence with
np.arange(n)= [0,1,…,n-1]
Filename:
Outputs IntroToPython.ipynb
In a cell, only the last output is printed. Remedy: Secret sauce!
Not printed!
Flow
Code blocks are bracketed by { } in many languages

Not in Python!
We use Indentation instead
Loop over items with for
Conditional branching with if-elif-else
Function
Want to reuse logic?

Bundle it into a
function using def
day is an argument
to the function
Functions are objects
in Python
You can pass them
as arguments
Task: What if you
want to cube instead?
Loops
Q: Which is quicker?
Two styles of iteration: Directly using a for loop

We use time() to clock the time elapsed
Or a Pythonic way using a list comprehension
File
File operations are used to bring in data / code from a file

More when we talk about data frames
Exception handling provides a graceful way to handle errors
Essential during file operations - files cannot be left “open”
You can even run the contents of this Python file
Dictionary
The dictionary data structure is useful to process key-values

Q: Why are key-values useful?
Filename:
Libraries NumPy.ipynb
Python has a disaggregated approach, in contrast with R

NumPy - short for Numerical Python
Vectorised array oriented computing
Has a C language API - can access legacy C/C++/Fortran
Internally stores data in a contiguous block of memory
Pandas - primarily for tabular data representation
Time series manipulation
Matplotlib - for visualisation
Matrices
randn = normally
Indexing is straightforward
distributed random numbers
Boolean indexing, retrieval
and substitution in place
Reshape an array to get a
matrix, construct its transpose
Pandas
Filename:
Pandas.ipynb
Designed for tabular, heterogeneous data

Structures: Series (single column) & DataFrame
Series: 1D objects with a default zero-based index
Can relabel & access elements by index, filtering operations
Excel
We can read into a data frame from an Excel spreadsheet

Get a dictionary of DataFrame objects - one per sheet
JSON
Can also initiate a dataframe using JSON data

JavaScript Online Notation
Ideal for unstructured data with missing values
Data is read line by line into a data frame
Operations
Get the header with list

Delete columns with del
Get top items with head
List data types with dtypes
Convert type of a column
Subset the data
Slice the data
Description
Numerical variable
Categorical variable
Database
SQLite is “a C-language library that implements a sma!, fast, self-

contained, high-reliability, fu!-featured, SQL database engine.”
A dataframe can be loaded from a database query
We use Pandas built-in support for SQLite
Study the code for database table creation
Delete the table if you wish
CSV Filename:
BankMarketing.ipynb
Standard route: Initialising a dataframe from a CSV file

E.g. UCI Repository Bank Marketing Dataset
Dataset has 45,200 instances
16 predictors, 1 response
Phone-based direct marketing campaigns of a Portuguese bank
Required one or more contacts to the same client
Outcome: Product = term deposit subscribed or not
S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing:
An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings
of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121,
Guimarães, Portugal, October, 2011. EUROSIS.
# bank client data:
1 - age
2 - job : type of job (“admin.”,”unknown”,"unemployed","management", ...)
3 - marital : marital status ("married","divorced","single")
4 - education ("unknown","secondary","primary","tertiary")
5 - default: has credit in default? ("yes","no")
6 - balance: average yearly balance, in euros
7 - housing: has housing loan? ("yes","no")
8 - loan: has personal loan? (“yes","no")
# related with the last contact of the current campaign:

9 - contact: contact communication type ("unknown","telephone","cellular")
10 - day: last contact day of the month
11 - month: last contact month of year ("jan", "feb", "mar", ..., "nov", "dec")
12 - duration: last contact duration, in seconds
# other attributes:
13 - campaign: number of contacts performed during this campaign, for this client
14 - pdays: days passed after last contact (-1 = client was not previously contacted)
15 - previous: number of contacts performed before this campaign for this client
16 - poutcome: outcome of previous campaign ("unknown","other","failure","success")
Output variable (desired target):

17 - y - has the client subscribed a term deposit? ("yes","no")
read_csv
Whenever you read a dataset in, list out a few rows (head)
Check for any surprises in the data types
Use astype to convert some “object” columns to categorical
Categorical
describe gives helpful statistics

Activity
Derive the 12 job levels with Calls[“job”].unique()
Spreadsheet-like pivot tables are also possible
Numerical
Getting stats on columns is straightforward

round oﬀ the numbers to make better sense
Combined
A combined pivot table throws more light…

Filename:
Matplotlib Matplotlib.ipynb
John Hunter enabled a Matlab-like plotting library

In IPython, type: %matplotlib notebook
Directly plotting numerical values does not help
The default histogram needs some jazzing up
Filename:
Matplotlib Matplotlib.ipynb
John Hunter enabled a Matlab-like plotting library

In IPython, type: %matplotlib notebook
Directly plotting numerical values does not help
The default histogram needs some jazzing up
Separate colour for each “patch”
For horizontal, use
barh
Bar
Choose a sequential colour

palette when you wish to
compare
seaborn
Don’t like the looks of Matplotlib graphs? Try seaborn

Q: What do you infer from the boxplots?
Advanced
There’s an alternate way: you can use lambda functions aka

anonymous functions
Multiple arguments possible
You can attach a conditional

E.g. Print the values only for odd entries
Reference
Advanced Analytics with Spark, 2nd Edition O’Reilly 2017

Sandy Ryza et al
”I think the best way to teach data science is by example.”
Code samples: https://github.com/sryza/aas

BDA02 IntroToPython

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BDA02 IntroToPython

Uploaded by

Copyright:

Available Formats

Python Intro

Python for Data Analysis, 2nd Edition. O’Reilly 2017

Authored by Guido van Rossum, released in 1991

Q: How does a typical Python program flow?

Fact: Writing a program is 80% thinking & 20% typing

Julia, Python and R

Assign variables using =

In a cell, only the last output is printed. Remedy: Secret sauce!

Code blocks are bracketed by { } in many languages

Want to reuse logic?

Two styles of iteration: Directly using a for loop

File operations are used to bring in data / code from a file

The dictionary data structure is useful to process key-values

Python has a disaggregated approach, in contrast with R

Designed for tabular, heterogeneous data

We can read into a data frame from an Excel spreadsheet

Can also initiate a dataframe using JSON data

Get the header with list

SQLite is “a C-language library that implements a sma!, fast, self-

Standard route: Initialising a dataframe from a CSV file

# related with the last contact of the current campaign:

Output variable (desired target):

describe gives helpful statistics

Getting stats on columns is straightforward

A combined pivot table throws more light…

John Hunter enabled a Matlab-like plotting library

John Hunter enabled a Matlab-like plotting library

Choose a sequential colour

Don’t like the looks of Matplotlib graphs? Try seaborn

There’s an alternate way: you can use lambda functions aka

You can attach a conditional

Advanced Analytics with Spark, 2nd Edition O’Reilly 2017

You might also like