Download as pdf or txt
Download as pdf or txt
You are on page 1of 147

Python Training Module

This material is meant for IBM Academic Initiative use only. NOT FOR RESALE.

© 2019 IBM Corporation


1
Python Training Module

Preface
February 2019

NOTICES

This information was developed for products and services offered in the USA.

IBM may not offer the products, services, or features discussed in this document in
other countries. Consult your local IBM representative for information on the products
and services currently available in your area. Any reference to an IBM product,
program, or service may be used. Any functionally equivalent product, program, or
service that does not infringe any IBM intellectual property right may be used instead.
However, it is the user’s responsibility to evaluate and verify the operation non-IBM
product, program, or service. IBM may have patents or pending patent applications
covering subject matter described in this document. The furnishings of this document
does not grant you any license to these patents. You can send license inquiries, in
writing, to:

IBM Director of Licensing


IBM Corporation
North Castle Drive, MD-NC119
Armonk, NY 10504-1785
United States of America

The following paragraph does not apply to the United Kingdom or any other country
where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS
MACHINES CORPORATION PROVIDES THIS PUBLICATION “AS IS” WITHOUT
WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not
allow disclaimer of express or implied warranties in certain transactions, therefore,
this statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes


are periodically made to the information herein; these changes will be incorporated in
the new editions on the publication. IBM may make improvements and/or changes in
the product(s) and/or the program(s) described in this publication at any time without
notice.

© 2019 IBM Corporation


2
Python Training Module

Any inference in this information to non-IBM websites are provided for convenience
only and do not in any manner serve as an endorsement of these websites. The
materials at those websites are not part of the materials for this IBM product, and use
of those websites is at your own risk.

IBM may use or distribute any of the information you supply in any way it believes
appropriate without incurring any obligation to you. Information concerning non-IBM
products was obtained from the suppliers of those products, their published
announcements, or other publicly available sources. IBM has not tested those
products and cannot confirm the accuracy of performance, compatibility, or any other
claims related to the non-IBM products. Questions on the capabilities of non-IBM
products should be addressed to the suppliers of those products.

This information contains examples of data reports used in daily business operations.
To illustrate them as completely as possible, the examples include the names of
individuals, companies, brands, and products. All of these names are fictitious and any
similarity to the names and addresses used by an actual business enterprise is entirely
coincidental.

TRADEMARKS

IBM, the IBM logo, ibm.com, and Python are trademarks or registered trademarks of
the International Business Machines Corp., registered in many jurisdictions worldwide.
Other product and service names might be trademarks of IBM or other companies. A
current list of IBM trademarks in available on the web at “Copyright and trademark
information” at www.ibm.com/legal/copytrade.html.

Adobe, and the Adobe logo are either registered trademarks or trademarks of Adobe
Systems Incorporated in the United States, and/or other countries.

Microsoft, Windows, and the Windows logo are trademarks of Microsoft Corporation in
the United States, other countries, or both.

© Copyright International Business Machines Corporation 2019.

This document may not be reproduced in whole or in part without prior permission of
IBM.

US Government Users Restricted Rights – Use, duplication, or disclosure restricted by


GSA ADP Schedule Contract with IBM Corp.

© 2019 IBM Corporation


3
Python Training Module

Table of Contents
Introduction to Python ...................................................................... 9
What is Python? .......................................................................................... 9
Advantages and disadvantages ................................................................. 9
Benefits .................................................................................................. 9
Limitations ........................................................................................... 10
Downloading and installing ...................................................................... 10
Downloading ........................................................................................ 10
Installing .............................................................................................. 10
Python versions ........................................................................................ 11
Running Python scripts ............................................................................ 12
Executing scripts with Python Launcher: ............................................ 12
Executing scripts without Python Launcher: ....................................... 12
Using the interpreter interactively ........................................................... 13
Using variables ......................................................................................... 14
Rules for variable names ..................................................................... 14
Dynamic typing .................................................................................... 14
Assigning variables .............................................................................. 15
Re-assigning variables ......................................................................... 16
Determining variable type with type() ................................................. 16
Simple exercise.................................................................................... 17
String types: normal, raw, and Unicode ................................................... 17
Creating a String .................................................................................. 18
String operators and functions ................................................................ 19
Printing a string .................................................................................... 19

© 2019 IBM Corporation


4
Python Training Module

String basics......................................................................................... 19
String indexing ..................................................................................... 19
String properties .................................................................................. 22
Basic built-in string methods .............................................................. 23
Math operators and functions .................................................................. 24
Writing to the screen ................................................................................ 25
Test Your Knowledge ............................................................................... 27
Deep Dive into Python ......................................................................28
Reading from the keyboard...................................................................... 28
raw_input ............................................................................................. 28
input ..................................................................................................... 28
Indenting is significant ............................................................................. 29
Boolean .................................................................................................... 30
The if and elif statements ........................................................................ 31
The statements can also have multiple branches .............................. 33
While loops ............................................................................................... 34
break, continue, and pass statements ................................................ 36
Using lists ................................................................................................. 37
Indexing and slicing ............................................................................. 38
Basic list methods ............................................................................... 39
Nesting lists ......................................................................................... 41
List comprehensions............................................................................ 42
Dictionaries .............................................................................................. 43
Using the ‘for’ statement ......................................................................... 45
Tuples .................................................................................................. 48

© 2019 IBM Corporation


5
Python Training Module

Opening, reading, and writing a text file .................................................. 49


Test Your Knowledge ............................................................................... 52
Guessing Game Challenge ................................................................... 52
Python Libraries ...............................................................................54
Using Pandas–the Python data analysis library ...................................... 54
How to install Pandas .......................................................................... 54
Series and Data Frames ........................................................................... 54
pandas.Series ...................................................................................... 55
Creating an empty series ..................................................................... 56
Creating a series from dictionary......................................................... 56
pandas.DataFrame .............................................................................. 57
Creating an empty Data Frame ............................................................ 57
Creating a Data Frame from list ........................................................... 58
Read a tabular data file in pandas ....................................................... 59
Basic Series functionalities ................................................................. 59
Basic DataFrame functionalities.......................................................... 60
Handling missing value in pandas ....................................................... 63
Grouping, aggregating, and applying ....................................................... 65
GroupBy object .................................................................................... 67
Aggregation .......................................................................................... 67
The apply() method ............................................................................. 70
Merging and joining .................................................................................. 71
Merge() ................................................................................................. 71
Merge using ‘how’ argument ............................................................... 73
Test Your Knowledge ............................................................................... 77

© 2019 IBM Corporation


6
Python Training Module

Error Handling ..................................................................................78


Dealing with syntax errors ....................................................................... 78
Exceptions ................................................................................................ 78
Zero Division Error ............................................................................... 80
Type Division Error ............................................................................... 80
Name Error ........................................................................................... 80
Handling exceptions with try/except ....................................................... 82
Test Your Knowledge ............................................................................... 85
Other Topics.....................................................................................86
RE objects................................................................................................. 86
Pattern matching ...................................................................................... 89
Parsing data.............................................................................................. 91
Test Your Knowledge ............................................................................... 94
Regression (Use Case Study) ............................................................95
Introduction to regression ....................................................................... 95
Why use regression analysis? .................................................................. 96
Types of regression .................................................................................. 98
Test Your Knowledge ............................................................................ 106
Other Regression related topics ......................................................107
Exploratory Data Analysis ..................................................................... 107
Correlation Matrix.................................................................................. 110
Visualizations using matplotlib ............................................................. 113
Implementing Linear Regression .......................................................... 125
Use Case: Churn Analysis...................................................................... 129
Implementing Linear Regression (using Scikit Learn) .......................... 130

© 2019 IBM Corporation


7
Python Training Module

Results of Linear Regression................................................................. 132


Test Your Knowledge ............................................................................ 136
Advance .........................................................................................137
Machine Learning Algorithms ............................................................... 137
Support Vector Machine........................................................................ 139
Random Forest ...................................................................................... 141
Conclusion .....................................................................................143
Glossary.........................................................................................144

© 2019 IBM Corporation


8
Python Training Module

Introduction to Python

What is Python?
Created in 1991 by Guido van Rossum, Python is an object-oriented, high-
level programming language used for a wide variety of applications and
not limited to basic usage. It was considered a gap-filler, a way to write
scripts that ‘automate the boring stuff’. It is a great language for beginners
because of its readability and other structural elements designed to make
it easy to understand.
Over the past few years, Python has emerged as a first-class citizen in
modern software development, infrastructure management, and data
analysis.
It is no longer a back-room utility language, but a major force in web
application creation and systems management, and a key driver of the
explosion in big data analytics and machine intelligence.

Advantages and disadvantages


Benefits
• Extensive Libraries
• Extensible
• Embeddable
• Improved Productivity
• IOT Opportunities
• Simple and easy
• Readable
• Object Oriented
• Free and Open Source
• Portable

© 2019 IBM Corporation


9
Python Training Module

Limitations
• Speed Limitations
• Weak in Mobile Computing Browsers
• Design Restrictions
• Underdeveloped Database Access Layers
• Simple
• Run Time Errors

Downloading and installing


Downloading
1. First, click the Python Download link.
2. Next, click the Download Python 3.6.2 button. The file named
python-3.6.2.exe should start downloading into your standard
download folder.
3. Then, move this file to a more permanent location, so that you can
install Python (and reinstall it easily later, if necessary).
4. Finally, follow the Installing instructions to complete the process.

Installing
1. First, double-click the icon labeling the file python-3.6.2.exe.
(An Open File - Security Warning pop-up window will appear.)
2. Then, click Run.
(A Python 3.6.2 (32-bit) Setup pop-up window will appear.)
Ensure that the Install launcher for all users (recommended) and
the Add Python 3.6 to PATH checkboxes at the bottom are
checked.

© 2019 IBM Corporation


10
Python Training Module

If the Python Installer finds an earlier version of Python installed on


your computer, the ‘Install Now’ message will instead appear as
‘Upgrade Now’ (and the checkboxes will not appear).
3. Next, highlight the ‘Install Now’ (or ‘Upgrade Now’) message, and
then click it.
(A User Account Control pop-up window will appear, posing the
question ‘Do you want to allow the following program to make
changes to this computer?’)
4. Then, click the Yes button.
(A new Python 3.6.2 (32-bit) Setup pop-up window will appear with a
‘Setup Progress’ message and a progress bar.)
5. Soon, a new Python 3.6.2 (32-bit) Setup pop-up window will
appear with a ‘Setup was successful’ message. Click the Close
button.

Python versions
Python is available in two versions:
• Python 2.x - The older “legacy” branch, will continue to be
supported (that is, receive official updates) through 2020, and it
might persist unofficially after that.
• Python 3.x - The current and future incarnation of the language, has
many useful and important features not found in 2.x, such as better
concurrency controls and a more efficient interpreter.

© 2019 IBM Corporation


11
Python Training Module

Running Python scripts


Executing scripts with Python Launcher:
The Python launcher for Windows is a utility that aids in the location and
execution of different Python versions. It allows scripts (or the
command-line) to indicate a preference for a specific Python version and
will locate and execute that version.
• From the command-line
Ensure the launcher is on your PATH. Depending on how it was
installed, the launcher may already be there. However, it is always
better to check and ensure the launcher is in place. From a
command-prompt, execute the following command:
py
• From a script
Let us create a test Python script called hello.py with the following
contents
#! python
import sys
sys.stdout.write("hello from Python %s\n" % (sys.version,))
py hello.py
• From file associate
The launcher should have been associated with Python files (i.e., .py,
.pyw, .pyc, .pyo files) when it was installed.

Executing scripts without Python Launcher:


• Python scripts (files with the extension .py) will be executed by
python.exe by default.

© 2019 IBM Corporation


12
Python Training Module

• This executable opens a terminal, which stays open even if the


program uses a GUI.
• If you do not want this to happen, use the extension .pyw, which will
cause the script to be executed by pythonw.exe by default (both
executables are located in the top-level of your Python installation
directory).
• This suppresses the terminal window on startup.
• You can also make all .py scripts execute with pythonw.exe, setting
this through the usual facilities.
1. Launch a command prompt.
2. Associate the correct file group with .py scripts:
assoc .py=Python.File
3. Redirect all Python files to the new executable:
ftype Python.File=C:\Path\to\pythonw.exe "%1" %*

Using the interpreter interactively


• The interpreter operates somewhat like the Unix shell:
o When called with standard input connected to a tty device, it
reads and executes commands interactively;
o When called with a file name argument or with a file as standard
input, it reads and executes a script from that file.
• When commands are read from a tty, the interpreter is said to be in
interactive mode.
• In this mode, it prompts for the next command with the primary
prompt, usually three greater-than signs (>>>); for continuation
lines it prompts with the secondary prompt, by default three dots
(...).

© 2019 IBM Corporation


13
Python Training Module

• The interpreter prints a welcome message stating its version


number and a copyright notice before printing the first prompt:
>>> the_world_is_flat = True
>>> if the_world_is_flat:
... print("Be careful not to fall off!")
...
Be careful not to fall off!

Using variables
Variable assignment
Rules for variable names
• Do not start the names with a number.
• Do not use spaces in names, use _ instead.
• Do not use any of these symbols in names:
:'",<>/?|\!@#%^&*~-+
• Inculcate the best practice or writing names in lowercase with
underscores.
• Avoid using Python built-in keywords, such as list and str.
• Avoid using the single characters l (lowercase letter el), O
(uppercase letter oh), and I (uppercase letter eye) as they can be
confused with 1 and 0.

Dynamic typing
Python uses dynamic typing, meaning we can reassign variables to
different data types. This makes Python very flexible in assigning data
types; it differs from other languages that are statically typed.

© 2019 IBM Corporation


14
Python Training Module

Pros:
This is very easy to work with and has a faster development time.
Cons:
It may result in unexpected bugs! So, you need to be aware of type().

Assigning variables
Variable assignment follows name = object, where a single equals sign = is
an assignment operator.

Here we assigned the integer object 5 to the variable name a. Let us


assign a to something else:

© 2019 IBM Corporation


15
Python Training Module

We can now use a in place of the number 10:

Re-assigning variables
Python lets us reassign variables with a reference to the same object.

There is another shortcut way of doing this. Python lets us add, subtract,
multiply, and divide numbers with reassignment using +=, -=, *=, and /=.

Determining variable type with type()


We can check what type of object is assigned to a variable using Python's
built-in type() function. Common data types include:
• int (for integer)

© 2019 IBM Corporation


16
Python Training Module

• Float
• str (for string)
• List
• Tuple
• dict (for dictionary)
• Set
• bool (for Boolean True/False)

Simple exercise
This shows how variables make calculations more readable and easier to
follow.

String types: normal, raw, and Unicode


Strings are used in Python to record text information, such as names.
Strings in Python are a sequence, which means Python keeps track of
every element in the string as a sequence. For example, Python

© 2019 IBM Corporation


17
Python Training Module

understands the string "hello' to be a sequence of letters in a specific


order. This means we will be able to use indexing to grab particular letters
(like the first letter, or the last letter).

Creating a String
To create a string in Python we need to use either single quotes or double
quotes. For example:

The reason for the error above is because the single quote in I'm stopped
the string. You can use combinations of double and single quotes to get
the complete statement.

© 2019 IBM Corporation


18
Python Training Module

String operators and functions

Printing a string
We can use a print statement to print a string.

String basics
We can also use a function called len() to check the length of a string!

Python's built-in len() function counts all of the characters in the string,
including spaces and punctuation.

String indexing
In Python, we use brackets [] after an object to call its index. We should
also note that indexing starts at 0 for Python. Let's create a new object
called s and then walk through a few examples of indexing.

© 2019 IBM Corporation


19
Python Training Module

Let’s start String Indexing!

We can use a : to perform slicing that grabs everything up to a designated


point. For example:

© 2019 IBM Corporation


20
Python Training Module

Note the above slicing. Here, we're telling Python to grab everything from
0 up to 3. It doesn't include the 3rd index. You'll notice this a lot in Python,
where statements and are usually in the context of "up to, but not
including".

We can also use negative indexing to go backwards.

We can also use index and slice notation to grab elements of a sequence
by a specified step size (the default is 1). For instance, we can use two
colons in a row and then a number specifying the frequency to grab
elements. For example:

© 2019 IBM Corporation


21
Python Training Module

String properties
It is important to note that strings have an important property known as
immutability. This means that once a string is created, the elements within
it cannot be changed or replaced. For example:

Notice how the error tells us directly what we can't do, change the item
assignment!
Something we can do is concatenate strings!

© 2019 IBM Corporation


22
Python Training Module

We can use the multiplication symbol to create repetition!

Basic built-in string methods


We call methods with a period and then the method name. Methods are in
the form:
object.method(parameters)
Here are some examples of built-in methods in strings:

© 2019 IBM Corporation


23
Python Training Module

Math operators and functions


Arithmetic/Math operators are used with numeric values to perform
common mathematical operations:

Operator Name Example


+ Addition x+y
- Subtraction x–y
* Multiplication x*y
/ Division x/y
% Modulus x%y
** Exponentiation x ** y
// Floor Division x // y

© 2019 IBM Corporation


24
Python Training Module

Writing to the screen


open() returns a file object, and is most commonly used with two
arguments: open(filename, mode).
>>> f = open('workfile', 'w')

© 2019 IBM Corporation


25
Python Training Module

• The first argument is a string containing the filename.


• The second argument is another string containing a few characters
describing the way in which the file will be used. mode can be:
o 'r' when the file will only be read,
o 'w' for only writing (an existing file with the same name will be
erased),
o 'a' opens the file for appending; any data written to the file is
automatically added to the end, and
o 'r+' opens the file for both reading and writing.
• The mode argument is optional; 'r' will be assumed if it’s omitted.

There are three ways of writing values:


1. Expression statements
Expression statements are used (mostly interactively) to compute
and write a value, or (usually) to call a procedure (a function that
returns no meaningful result; in Python, procedures return the value
None).
2. Print() function
The print function in Python is a function that outputs to your
console window whatever you say you want to print out. At first
blush, it might appear that the print function is rather useless for
programming, but it is actually one of the most widely used functions
in all of python. The reason for this is that it makes for a great
debugging tool.
3. Write() method
The method write() writes a string ‘str’ to the file. There is no return
value. Due to buffering, the string may not actually show up in the
file until the flush() or close() method is called.

© 2019 IBM Corporation


26
Python Training Module

Test Your Knowledge

1. Answer these three questions without typing code. Then type


code to check your answer.
a. What is the value of the expression 4 * (6 + 5)?
b. What is the value of the expression 4 * 6 + 5?
c. What is the value of the expression 4 + 6 * 5?

2. What is the type of the result of the expression 3 + 1.5 + 4?

3. What would you use to find a number’s square root, as well as its
square?

4. Given the string 'hello', give an index command that returns 'e’.

5. Reverse the string 'hello' using slicing.

6. Given the string ‘hello’, give two methods of producing the letter
'o' using indexing.

© 2019 IBM Corporation


27
Python Training Module

Deep Dive into Python

Reading from the keyboard


There are two functions in Python that you can use to read data from the
user:
• raw_input()
• input()

raw_input
raw_input is used to read text (strings) from the user. raw_input does not
interpret the input. It always returns the input of the user without
changes, that is, raw. This raw input can be changed into the data type
needed for the algorithm. To accomplish this, we can use either a ‘casting’
function or the ‘eval’ function.

input
If the input function is called, the program flow will be stopped until the
user has given an input and has ended the input with the return key. The
text of the optional parameter, that is, the prompt, will be printed on the
screen. The input of the user will be interpreted. For example, if the user

© 2019 IBM Corporation


28
Python Training Module

puts in an integer value, the input function returns this integer value. If the
user on the other hand inputs a list, the function will return a list.

Indenting is significant
Python programs get structured through indentation, that is, code blocks
are defined by their indentation. In the case of Python, it is a language
requirement, not a matter of style. This principle makes it easier to read
and understand other people's Python code.
All statements with the same distance to the right belong to the same
block of code, that is, the statements within a block line up vertically. The
block ends at a line less indented or the end of the file. If a block must be
more deeply nested, it is simply indented further to the right.

© 2019 IBM Corporation


29
Python Training Module

Boolean
Boolean values are the two constant objects False and True. They are
used to represent truth values (other values can also be considered false
or true). In numeric contexts (for example, when used as the argument to
an arithmetic operator), they behave like the integers 0 and 1,
respectively.
The built-in function bool() can be used to cast any value to a Boolean, if
the value can be interpreted as a truth value. They are written as False and
True, respectively.
A string in Python can be tested for truth value. The return type will be in
Boolean value (True or False). Let’s make an example, by first creating a
new variable, and giving it a value.

© 2019 IBM Corporation


30
Python Training Module

my_string = "Hello World"

my_string.isalnum() #check if all char are numbers


my_string.isalpha() #check if all char in the string are alphabetic
my_string.isdigit() #test if string contains digits
my_string.istitle() #test if string contains title words
my_string.isupper() #test if string contains upper case
my_string.islower() #test if string contains lower case
my_string.isspace() #test if string contains spaces

To see what the return value (True or False) will be, simply print it out.

my_string="Hello World"

print my_string.isalnum() #False


print my_string.isalpha() #False
print my_string.isdigit() #False
print my_string.istitle() #True
print my_string.isupper() #False
print my_string.islower() #False
print my_string.isspace() #False

The if and elif statements


‘if’ statements in Python allow us to tell the computer to perform
alternative actions based on a certain set of results. Verbally, we can
imagine we are telling the computer:
"Hey if this case happens, perform some action."
We can then expand the idea further with ‘elif’ and ‘else’ statements,
which allow us to tell the computer:

© 2019 IBM Corporation


31
Python Training Module

"Hey if this case happens, perform some action. Else, if another case
happens, perform some other action. Else, if none of the above cases
happen, perform this action."

Syntax
if case1:
perform action1
elif case2:
perform action2
else:
perform action3
If the condition "condition_1" is True, the statements of the block
statement_block_1 will be executed. If not, condition_2 will be evaluated.
If condition_2 evaluates to True, statement_block_2 will be executed, if
condition_2 is False, the other conditions of the following ‘elif’ conditions
will be checked, and finally if none of them has been evaluated to True, the
indented block below the else keyword will be executed.

© 2019 IBM Corporation


32
Python Training Module

Here are a few examples:

The statements can also have multiple branches

Note how the nested if statements are each checked until a True Boolean
causes the nested code below it to run. We should also note that we can put
in as many ‘elif’ statements as we want before we close off with an ‘else’.

© 2019 IBM Corporation


33
Python Training Module

While loops
The ‘while’ statement in Python is one of most general ways to perform
iteration. A ‘while’ statement will repeatedly execute a single statement or
group of statements as long as the condition is true. The reason that it is
called a 'loop' is because the code statements are looped through over
and over again until the condition is no longer met.

Syntax
while test:
code statements
else:
final code statements

© 2019 IBM Corporation


34
Python Training Module

Notice how many times the print statements occurred and how the ‘while’
loop kept going until the True condition was met, which occurred once
x==10. It is important to note that once this occurred, the code stopped.
We can also add ‘else’ statement in the loop as shown below. When the
loop completes, the ‘else’ statement is read.

© 2019 IBM Corporation


35
Python Training Module

break, continue, and pass statements


We can use ‘break’, ‘continue’, and ‘pass’ statements in our loops to add
additional functionality for various cases. The three statements are
defined as:
• break: Breaks out of the current closest enclosing loop
• continue: Goes to the top of the closest enclosing loop
• pass: Does nothing at all
‘break’ and ‘continue’ statements can appear anywhere inside the loop’s
body, but we will usually put them further nested in conjunction with an ‘if’
statement to perform an action based on some condition.

Syntax:
while test:
code statement
if test:
break
if test:
continue
else:

© 2019 IBM Corporation


36
Python Training Module

However, here is a word of caution. It is possible to create an infinitely


running loop with ‘while’ statements.

Using lists
Lists can be thought of as the most general version of a sequence in
Python. Unlike strings, they are mutable, meaning the elements inside a
list can be changed. They are constructed with brackets [] and commas
separating every element in the list and can actually hold different object
types.

Just like strings, the len() function will tell you how many items are in the
sequence of the list.

© 2019 IBM Corporation


37
Python Training Module

Indexing and slicing


Indexing and slicing work just like in strings.

We can also use + to concatenate lists, just like we did for strings.

Note: This doesn't actually change the original list.

You would have to reassign the list to make the change permanent.

© 2019 IBM Corporation


38
Python Training Module

We can also use the * for a duplication method similar to strings:

Basic list methods


Lists are flexible as:
• They have no fixed size (meaning we do not have to specify how big
a list will be).
• They have no fixed type constraint.
Some of the basic list methods used in python are:
• append

© 2019 IBM Corporation


39
Python Training Module

• pop
• sort
• reverse
Use the append method to permanently add an item to the end of a list.

Use pop to "pop off" an item from the list. By default, pop takes off the last
index, but we can also specify which index to pop off.

We can use the sort method and the reverse methods to also affect your
lists.

© 2019 IBM Corporation


40
Python Training Module

Nesting lists
A great feature of Python data structures is that they support nesting. This
means we can have data structures within data structures. For example, a
list inside a list.

We can again use indexing to grab elements, but now there are two levels
for the index: the items in the matrix object, and then the items inside that
list.

© 2019 IBM Corporation


41
Python Training Module

List comprehensions
List comprehensions provide a concise way to create lists. It consists of
brackets containing an expression followed by a ‘for’ clause, then zero or
more ‘for’ or ‘if’ clauses. The expressions can be anything, meaning you
can put in all kinds of objects in lists. The result will be a new list, resulting
from evaluating the expression in the context of the ‘for’ and ‘if’ clauses,
which follow it. The list comprehension always returns a result list.
The list comprehension starts with a '[' and ends with a ']' to help you
remember that the result is going to be a list.
The basic syntax is:
[ expression for item in list if conditional ]
This is equivalent to:
for item in list:
if conditional:
expression
A list comprehension is used here to grab the first element of every row in
the matrix object.

© 2019 IBM Corporation


42
Python Training Module

Lists should be used for the following cases:


• Need for ordered sequence of items, when order is important
• Need for a mutable ordered list of elements. For example, when you
want to append new phone numbers to a list: [number1, number2,
...]
• Need for adding or removing items
• Need for using non-hashable items
• Need for duplicates

Dictionaries
A dictionary is a collection that is unordered, changeable, and indexed. In
Python, dictionaries are written with curly brackets, and they have keys
and values.
A dictionary can be constructed in the following manner:

It is important to note that dictionaries are very flexible in the data types
they can hold.

© 2019 IBM Corporation


43
Python Training Module

We can also create keys by assignment. For instance, if we started off with
an empty dictionary, we could continually add to it:

A few dictionary methods include:


• d.keys()
• d.values()
• d.items()

© 2019 IBM Corporation


44
Python Training Module

A dictionary maps a set of objects (keys) to another set of objects (values).


A Python dictionary is a mapping of unique keys to values. Dictionaries are
mutable, which means they can be changed. The values that the keys
point to can be any Python value. Dictionaries are unordered, so the order
that the keys are added doesn't necessarily reflect what order they may be
reported back.
Use a dictionary when you have an unordered set of unique keys that map
to values.

Using the ‘for’ statement


A ‘for’ loop acts as an iterator in Python. It goes through items that are in a
sequence or any other iterable item. Some of the objects that can be
iterated include strings, lists, tuples, and even built-in iterables for
dictionaries, such as keys or values.
for item in object:
statements to do stuff

Here are some examples:

© 2019 IBM Corporation


45
Python Training Module

To print only the even numbers from that list:

An ‘else’ statement can also be added:

© 2019 IBM Corporation


46
Python Training Module

Another common idea during a ‘for’ loop is keeping some sort of running
tally during multiple loops.

Loops can also be used with strings. Strings are a sequence, so when we
iterate through them, we will be accessing each item in that string.

Here is how a ‘for’ loop can be used with a tuple:

© 2019 IBM Corporation


47
Python Training Module

Tuples
Tuples are very similar to lists. However, unlike lists, they are immutable,
meaning they cannot be changed. You would use tuples to present things
that should not be changed, such as days of the week, or dates on a
calendar.
The construction of a ‘tuples’ uses () with elements separated by commas.

Tuples have a special quality when it comes to ‘for’ loops. If you are
iterating through a sequence that contains tuples, the item can actually be
the tuple itself, this is an example of tuple unpacking. During the ‘for’
loop, we will be unpacking the tuple inside of a sequence and we can
access the individual items inside that tuple.

Here are some examples for iterating through dictionaries:

© 2019 IBM Corporation


48
Python Training Module

Opening, reading, and writing a text file


In Python, there is no need for importing external library to read and write
files. Python provides an in-built function for creating, writing, reading, and
deleting files.
• Use the function open("file name","w+") to create a file. The + tells
the python compiler to create a file if it does not exist.
• Use the command open("Filename", "a") to append data to an
existing file.
• Use the ‘read’ function to read the ENTIRE content of a file.
• Use the ‘readlines’ function to read the content of the file one by
one.

© 2019 IBM Corporation


49
Python Training Module

Mode Description
‘r’ This is the default mode. It opens file the file for
reading.
‘w’ This mode opens the file for writing.
If the file does not exist, it creates a new file.
If the file exists, it truncates the file.
‘x’ This creates a new file. If the file already exists, the
operation fails.
‘a’ This opens a file in append mode.
If the file does not exist, it creates a new file.
‘t’ This is the default mode. It opens the file in text mode.
‘b’ This opens the file in binary mode.
‘+’ This opens a file for reading and writing (updating).

Let us look at an example.


def main():
f = open(“newfile.txt","w+")
#f =open(“newfile.txt","a+")
for i in range(10):
f.write("This is line %d\r\n" % (i+1))
f.close()

#Open the file back and read the contents


#f =open(“newfile.txt", "r")
# if f.mode == 'r':
# contents =f.read()
# print contents
#or, readlines reads the individual line into a list
#fl =f.readlines()
#for x in fl:
#print x

© 2019 IBM Corporation


50
Python Training Module

if __name__== "__main__":
main()

Here, we declared the variable f to open a file named newfile.txt. ‘open’


takes two arguments: the file that we want to open and a string that
represents the kinds of permission or operation we want to do on the file.
We used w letter in our argument, which indicates write and the plus sign
that means it creates a file if it does not exist in the library. The available
options beside ‘w’ are ‘r’ for read, ‘a’ for append, and the plus sign for
create if a file is not there.
We have used a ‘for’ loop that runs over a range of 10 numbers. The write
function is used to enter data into the file. The output that we want to
iterate in the file is "this is line number", which we declared with write
function and then percent d (displays integer). So basically, we have put in
the line number that we are writing, then put it in a carriage return and a
new line character.

© 2019 IBM Corporation


51
Python Training Module

Test Your Knowledge

1. Use ‘for’, .split(), and ‘if’ to create a statement that will print out
words that start with ‘s’:
st = 'Print only the words that start with s in this sentence’

2. Use List comprehension to create a list of all numbers between 1


and 50 that are divisible by 3.

3. Go through the string below and if the length of a word is even, print
"even!”
st = 'Print every word in this sentence that has an even number of
letters’

Guessing Game Challenge


Let us use while loops to create a guessing game.
Write a program that picks a random integer from 1 to 100, and has
players guess the number. The rules are:
1. If a player's guess is less than 1 or greater than 100, say "OUT OF
BOUNDS”.
2. On a player's first turn, if their guess is
a. Within 10 of the number, return "WARM!“.
b. Further than 10 away from the number, return "COLD!“.
3. On all subsequent turns, if a guess is
a. Closer to the number than the previous guess, return
"WARMER!“.

© 2019 IBM Corporation


52
Python Training Module

b. Farther from the number than the previous guess, return


"COLDER!“.
4. When the player's guess equals the number, tell them that they have
guessed correctly and how many guesses it took.

© 2019 IBM Corporation


53
Python Training Module

Python Libraries

Using Pandas–the Python data analysis library


‘Pandas’ are used for data manipulation, analysis, and cleaning. Python
‘pandas’ are well suited for different kinds of data, such as:
• Tabular data with heterogeneously typed columns
• Ordered and unordered time series data
• Arbitrary matrix data with row and column labels
• Unlabeled data
• Any other form of observational or statistical data sets

How to install Pandas


To install Python Pandas:
1. First, go to your command line or terminal.
2. Next, type pip install pandas.
(If you have anaconda installed in your system, just type in conda
install pandas.)
3. Once the installation is completed, go to your IDE (Jupyter,
PyCharm, and so on) and simply import it by typing import pandas as
pd.

Series and Data Frames


There are two types of data structures, namely Series and Data Frames.
Series are single dimensional, labeled homogeneous array. They are
immutable in size, but the values of data are mutable. General two-
dimensional labeled data frames are mutable in size. They have a tabular

© 2019 IBM Corporation


54
Python Training Module

structure with potentially heterogeneously typed columns and mutable


data.
The following series is a collection of integers 10, 23, 56, …

The table represents the data of a sales team of an organization, with their
overall performance rating. The data is represented in rows and columns.
Each column represents an attribute and each row represents a person.

Name Age Gender Rating


Steve 32 Male 3.45
Lia 28 Female 4.6
Vin 45 Male 3.9
Katie 38 Female 2.78

The data types of the four columns can be found in the following table.

Column Type
Name String
Age Integer
Gender String
Rating Float

pandas.Series
Series is a one-dimensional labeled array capable of holding data of any
type (integer, string, float, python objects, and so on). The axis labels are
collectively called index.
A pandas series can be created using the following constructor.
pandas.Series (data, index, dtype, copy)

© 2019 IBM Corporation


55
Python Training Module

Parameter Description
data data takes various forms such as ndarray, list,
constants
index Index values must be unique and hashable, same
length as data. Default np.arrange(n), if no index
is passed
dtype dtype is for data type. If none, data type will be
inferred
copy Copy data. Default False

Creating an empty series

#import the pandas library and aliasing as pd


import pandas as pd
s = pd.Series()
print s

The output is:


Series([], dtype: float64)

Creating a series from dictionary


#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print s
The output is:

© 2019 IBM Corporation


56
Python Training Module

a 0.0
b 1.0
c 2.0
dtype: float64

pandas.DataFrame
A data frame is a two-dimensional data structure, that is, data is aligned in
a tabular fashion in rows and columns. A Pandas data frame can be
created using the following constructor.
pandas.DataFrame(data, index, columns, dtype, copy)

Parameter Description
data data takes various forms such as ndarray, series,
map, lists, dict, constants, and also another
DataFrame.
index For the row labels, the index to be used for the
resulting frame is Optional Default np.arrange(n)
if no index is passed.
columns For column labels, the optional default syntax is -
np.arrange(n). This is only true if no index is
passed.
dtype Data type of each column
copy This command is used for copying of data, if the
default is False

Creating an empty Data Frame


#import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
print df

© 2019 IBM Corporation


57
Python Training Module

The output is:


Empty DataFrame
Columns: []
Index: []

Creating a Data Frame from list


import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print df

The output is:


0
0 1
1 2
2 3
3 4
4 5

© 2019 IBM Corporation


58
Python Training Module

Read a tabular data file in pandas

read_table: read a dataset


Head(): examine the first 5 rows

Basic Series functionalities

S. No. Attribute or Method Description


1. axes Returns a list of the row axis labels
2. dtype Returns the dtype of the object
3. empty Returns True if series is empty
4. ndim Returns the number of dimensions of
the underlying data, by definition 1
5. size Returns the number of elements in the
underlying data
6. values Returns the Series as ndarray
7. head() Returns the first n rows
8. tail() Returns the last n rows

© 2019 IBM Corporation


59
Python Training Module

Basic DataFrame functionalities

S. No. Attribute or Method Description


1. T Transposes rows and columns
2. Axes Returns a list with the row axis labels
and column axis labels as the only
members
3. dtypes Returns the dtypes in this object
4. empty True if NDFrame is entirely empty [no
items], if any of the axes are of length
0
5. ndim Number of axes / array dimensions
6. shape Returns a tuple representing the
dimensionality of the DataFrame
7. size Number of elements in the NDFrame
8. values Numpy representation of NDFrame
9. head() Returns the first n rows
10. tail() Returns last n rows

© 2019 IBM Corporation


60
Python Training Module

S. No. Attribute or Method Description


1. count() Transposes rows and columns
2. sum() Returns a list with the row axis labels
and column axis labels as the only
members
3. mean() Returns the dtypes in this object

© 2019 IBM Corporation


61
Python Training Module

S. No. Attribute or Method Description


4. median() True if NDFrame is entirely empty [no
items], if any of the axes are of length
0
5. mode() Number of axes/array dimensions
6. std() Returns a tuple representing the
dimensionality of the DataFrame
7. min() Number of elements in the NDFrame
8. max() Numpy representation of NDFrame
9. abs() Returns the first n rows
10. prod() Returns last n rows
11. cumsum() Cumulative Sum
12. cumprod() Cumulative Product
13. describe() Summary of statistics pertaining to the
DataFrame columns

© 2019 IBM Corporation


62
Python Training Module

Handling missing value in pandas


"NaN" or “Not a Number" is not a string, but a special value: numpy.nan. It
indicates a missing value. read_csv detects missing values (by default)
when reading the file, and replaces them with this special value.
How to handle missing values depends on the dataset as well as the
nature of your analysis. Here are some options:

© 2019 IBM Corporation


63
Python Training Module

In the following example, all NaN(missing values) are replaced with the
string ‘VARIOUS’ in order to remove the NaN/ missing values from the
data. Replacing the missing values is a business decision and it depends
on the data you are working on.

© 2019 IBM Corporation


64
Python Training Module

Grouping, aggregating, and applying


Groupby essentially splits the data into different groups depending on a
variable of your choice.

This makes clear what the groupby accomplishes:

© 2019 IBM Corporation


65
Python Training Module

• The split step involves breaking up and grouping a DataFrame


depending on the value of the specified key.
• The apply step involves computing some function, usually an
aggregate, transformation, or filtering, within the individual groups.
• The combine step merges the results of these operations into an
output array.
For example,
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data': range(6)}, columns=['key', 'data'])
df

df.groupby('key').sum()

The sum() method is just one possibility here; you can apply virtually any
common Pandas or NumPy aggregation function, as well as virtually any
valid DataFrame operation.

© 2019 IBM Corporation


66
Python Training Module

GroupBy object
The GroupBy object is a very flexible abstraction. In many ways, you can
simply treat it as if it is a collection of DataFrames, and it does the difficult
things under the hood.

Notice that what is returned is not a set of DataFrames, but a


DataFrameGroupBy object.
This object is where the magic is–it can be seen as a special view of the
DataFrame, which is poised to dig into the groups but does no actual
computation until the aggregation is applied. This "lazy evaluation"
approach means that common aggregates can be implemented very
efficiently in a way that is almost transparent to the user.

Aggregation
An aggregated function returns a single aggregated value for each group.
Once the group by object is created, several aggregation operations can be
performed on the grouped data.
Here is an example of the aggregate or equivalent agg method:

© 2019 IBM Corporation


67
Python Training Module

Year
2014 795.25
2015 769.50
2016 725.00
2017 739.00
Name: Points, dtype: float64
Below is an example of the size() function:

Points Rank Year


Team
Devils 2 2 2
Kings 3 3 3
Riders 4 4 4
Royals 2 2 2
kings 1 1 1

Note: We can also apply multiple Aggregation function at once and


generate data frame as output.
The below table describes some common Aggregation functions.

Function Description
mean() Compute mean of groups
sum() Compute sum of group values

© 2019 IBM Corporation


68
Python Training Module

Function Description
size() Compute group sizes
count() Compute count of group
std() Standard deviation of groups
var() Compute variance of groups
sem() Standard error of the mean of groups
describe() Generate descriptive statistics
first() Compute first of group values
last() Compute last of group values
nth() Take nth value, or a subset if n is a list
min() Compute min of group values

Aggregation can also be used while applying multiple functions at once.


With grouped Series, you can also pass a list or dict of functions to do
aggregation with and get a DataFrame as the output.

The resulting aggregations are named for the functions themselves. If you
need to rename, then you can add in a chained operation for a Series like
this:

© 2019 IBM Corporation


69
Python Training Module

The apply() method


The apply() method lets you apply an arbitrary function to the group
results. The function should take a DataFrame, and return either a Pandas
object (for example, DataFrame, Series) or a scalar; the combine operation
will be tailored to the type of output returned.

GroupBy.apply(func, *args, **kwargs)

Parameters: func: function


A callable that takes a dataframe as its first argument,
and returns a dataframe, a series, or a scalar. In
addition, the callable may take positional and keyword
arguments.

args, kwargs: tuple and dict

Optional, positional, and keyword arguments to pass to


func
Returns: applied: Series or DataFrame

For example, here is an apply() that normalizes the first column by the
sum of the second:

© 2019 IBM Corporation


70
Python Training Module

apply() within a GroupBy is quite flexible. The only criterion is that the
function takes a DataFrame and returns a Pandas object or scalar. What
you do in the middle is up to you.

Merging and joining


Merge()
Pandas provide a single function, merge(), as the entry point for all
standard database join operations between DataFrame objects.
pd.merge(left, right, how='inner', on=None, left_on=None,
right_on=None,
left_index=False, right_index=False, sort=True)

Following is the brief description of the arguments of the merge() function:


• left − A DataFrame object.
• right − Another DataFrame object.

© 2019 IBM Corporation


71
Python Training Module

• on − Columns (names) to join on. Must be found in both the left and
right DataFrame objects.
• left_on − Columns from the left DataFrame to use as keys. Can either
be column names or arrays with length equal to the length of the
DataFrame.
• right_on − Columns from the right DataFrame to use as keys. Can
either be column names or arrays with length equal to the length of
the DataFrame.
• left_index − If True, use the index (row labels) from the left
DataFrame as its join key(s). In case of a DataFrame with a
MultiIndex (hierarchical), the number of levels must match the
number of join keys from the right DataFrame.
• right_index − Same usage as left_index for the right DataFrame
• how − One of 'left', 'right', 'outer', 'inner'. Defaults to inner. Each
method has been described later.
• sort − Sort the result DataFrame by the join keys in lexicographical
order. Defaults to True, setting to False will improve the
performance substantially in many cases.
Let us now create two different DataFrames and perform the merging
operations on it.

Name id subject_id
0 Alex 1 sub1
1 Amy 2 sub2

© 2019 IBM Corporation


72
Python Training Module

2 Allen 3 sub4
3 Alice 4 sub6
4 Ayoung 5 sub5

Name id subject_id
0 Billy 1 sub2
1 Brian 2 sub4
2 Bran 3 sub3
3 Bryce 4 sub6
4 Betty 5 sub5
Below is an example of merging two DataFrames on multiple keys:

We are merging two dataframes “left” and “right” using keys “id” and
“subject_id”. Hence, when the values of “id” and “subject_id” are
common in the two dataframes, those values will be merged giving the
output as:
Name_x id subject_id Name_y
0 Alice 4 sub6 Bryce
1 Ayoung 5 sub5 Betty

Merge using ‘how’ argument


The ‘how’ argument to merge specifies how to determine which keys are
to be included in the resulting table. If a key combination does not appear

© 2019 IBM Corporation


73
Python Training Module

in either the left or the right tables, the values in the joined table will be
NA.
Here is a summary of the ‘how’ options and their SQL equivalent names.

Merge Method SQL Equivalent Description


Left LEFT OUTER JOIN Use keys
from left
object
Right RIGHT OUTER JOIN Use keys
from right
object
Outer FULL OUTER JOIN Use union
of keys
Inner INNER JOIN Use
intersection
of keys

Below is an example of Left Join:

We are merging two dataframes “left” and “right” using keys from left of the
item “subject_id”. So the values of keys in the left of “subject_id” in
dataframe “left”, will be the base of merging the two dataframes, giving the
output as:

© 2019 IBM Corporation


74
Python Training Module

Name_x id_x subject_id Name_y id_y


0 Alex 1 sub1 NaN NaN
1 Amy 2 sub2 Billy 1.0
2 Allen 3 sub4 Brian 2.0
3 Alice 4 sub6 Bryce 4.0
4 Ayoung 5 sub5 Betty 5.0

Here is an example of Outer Join:

Name_x id_x subject_id Name_y id_y


0 Alex 1.0 sub1 NaN NaN
1 Amy 2.0 sub2 Billy 1.0
2 Allen 3.0 sub4 Brian 2.0
3 Alice 4.0 sub6 Bryce 4.0
4 Ayoung 5.0 sub5 Betty 5.0
5 NaN NaN sub3 Bran 3.0

While using Inner Join, joining will be performed on index. Join operation
honors the object on which it is called. So, a.join(b) is not equal to
b.join(a).

© 2019 IBM Corporation


75
Python Training Module

Name_x id_x subject_id Name_y id_y


0 Amy 2 sub2 Billy 1
1 Allen 3 sub4 Brian 2
2 Alice 4 sub6 Bryce 4
3 Ayoung 5 sub5 Betty 5

© 2019 IBM Corporation


76
Python Training Module

Test Your Knowledge

1. Define Series and DataFrames.

2. Provide the description for the following functions.

a. sem()

b. var()

c. size()

3. What is GroupBy? Describe the various steps involved in it.

© 2019 IBM Corporation


77
Python Training Module

Error Handling

Dealing with syntax errors


Syntax errors, also known as parsing errors, are perhaps the most
common kind of complaint we get while we are still learning Python.
Usually the easiest to spot, syntax errors occur when you make a typo.

Explanation:
• The parser repeats the offending line and displays a little ‘arrow’
pointing at the earliest point in the line where the error was
detected.
• The error is caused by (or at least detected at) the token preceding
the arrow. In the example, the error is detected at the keyword print,
since a colon (':') is missing before it.
• File name and line number are printed so you know where to look in
case the input came from a script.

Exceptions
Even if a statement or expression is syntactically correct, it may cause an
error when an attempt is made to execute it. Errors detected during
execution are called exceptions and are not unconditionally fatal.

© 2019 IBM Corporation


78
Python Training Module

Explanation:
• The last line of the error message indicates what happened.
Exceptions come in different types, and the type is printed as part of
the message. The types in the example are:
o ZeroDivisionError
o NameError and
o TypeError
• The string printed as the exception type is the name of the built-in
exception that occurred. This is true for all built-in exceptions but
need not be true for user-defined exceptions (although it is a useful
convention).
• Standard exception names are built-in identifiers (not reserved
keywords).
• The rest of the line provides detail based on the type of exception
and what caused it.
• The preceding part of the error message shows the context where
the exception happened, in the form of a stack traceback. In general,
it contains a stack traceback listing source lines; however, it will not
display lines read from standard input.

© 2019 IBM Corporation


79
Python Training Module

Examples of some of the most common exceptions:

Zero Division Error

This type of error occurs whenever syntactically correct Python code


results in an error. The last line of the message indicated what type of
exception error you ran into.

Type Division Error

The integer data type is not iterable and trying to iterate over it will
produce a type error.

Name Error

© 2019 IBM Corporation


80
Python Training Module

Name Error can occur when we try and refer to a variable that has not
been defined.
The below table lists the standard exceptions available in Python.

Exception Name Description


Exception Base class for all exceptions
StopIteration Raised when the next() method of an iterator
does not point to any object
SystemExit Raised by the sys.exit() function.
StandardError Base class for all built-in exceptions except
StopIteration and SystemExit
ArithmeticError Base class for all errors that occur for numeric
calculation
OverflowError Raised when a calculation exceeds maximum limit
for a numeric type
FloatingPointError Raised when a floating point calculation fails
ZeroDivisionError Raised when division or modulo by zero takes
place for all numeric types
AssertionError Raised in case of failure of the Assert statement
AttributeError Raised in case of failure of attribute reference or
assignment
RuntimeError Raised when a generated error does not fall into
any category
ImportError Raised when an import statement fails
KeyboardInterrupt Raised when the user interrupts program
execution, usually by pressing Ctrl+c
LookupError Base class for all lookup errors
IndexError Raised when an index is not found in a sequence
KeyError Raised when the specified key is not found in the
dictionary

© 2019 IBM Corporation


81
Python Training Module

Exception Name Description


NameError Raised when an identifier is not found in the local
or global namespace
OSError Raised for operating system-related errors

Handling exceptions with try/except


The ‘try’ statement works as follows:
• First, the try clause (the statement(s) between the try and except
keywords) is executed.
• If no exception occurs, the ‘except’ clause is skipped and execution
of the try statement is finished.
• If an exception occurs during execution of the try clause, the rest of
the clause is skipped. Then, if its type matches the exception named
after the ‘except’ keyword, the ‘except’ clause is executed and then,
execution continues after the try statement.
• If an exception occurs that does not match the exception named in
the ‘except’ clause, it is passed on to outer try statements. If no
handler is found, it is an unhandled exception and execution stops
with a message as shown above.
A ‘try’ statement may have more than one except clause, to specify
handlers for different exceptions. At most one handler will be executed.
Handlers only handle exceptions that occur in the corresponding try
clause, not in other handlers of the same try statement.

© 2019 IBM Corporation


82
Python Training Module

The ‘try … except’ statement has an optional else clause, which, when
present, must follow all except clauses. It is useful for code that must be
executed if the ‘try’ clause does not raise an exception. The use of the
‘else’ clause is better than adding additional code to the ‘try’ clause
because it avoids accidentally catching an exception that was not raised
by the code being protected by the ‘try … except’ statement.

When an exception occurs, it may have an associated value, also known as


the exception’s argument. The presence and type of the argument
depend on the exception type. The ‘except’ clause may specify a variable
after the exception name (or tuple). The variable is bound to an exception
instance with the arguments stored in instance.args. For convenience, the
exception instance defines __str__() so the arguments can be printed
directly without having to reference .args.

© 2019 IBM Corporation


83
Python Training Module

Exception handlers do not just handle exceptions if they occur


immediately in the try clause, but also if they occur inside functions that
are called (even indirectly) in the ‘try’ clause.

© 2019 IBM Corporation


84
Python Training Module

Test Your Knowledge

1. Syntax error is also known as _______.

2. What are exceptions? Name at least four types of exceptions.

3. Provide the descriptions for the following Python exception:

a. OverflowError

b. RuntimeError

c. StandardError

4. Name the error that is raised when the user interrupts execution by
pressing Ctrl+c.

5. Which error is raised when an identifier is not found in the local or


global namespace?

© 2019 IBM Corporation


85
Python Training Module

Other Topics

RE objects
A regular expression (RE) in a programming language is a special text
string used for describing a search pattern. It is extremely useful for
extracting information from text such as code, files, log, spreadsheets, or
even documents.
Regular expressions can contain both special and ordinary characters.
Most ordinary characters such as 'A', 'a', or '0' are the simplest regular
expressions. These characters simply match themselves.
Some characters such as '|' or '(' are special. Special characters either
stand for classes of ordinary characters or affect how the regular
expressions around them are interpreted.
Repetition qualifiers (*, +, ?, {m,n}, and so on) cannot be directly nested.
This avoids ambiguity with the non-greedy modifier suffix ‘?’, and with
other modifiers in other implementations. To apply a second repetition to
an inner repetition, parentheses may be used.

Regular Expression Syntax

import re

For regular expression syntax in the Python language, ‘re’ module is


primarily used for string searching and manipulation. It is also frequently
used for web page “Scraping” (extract large amount of data from
websites).

© 2019 IBM Corporation


86
Python Training Module

Regular Expression
In Python, a regular expression, denoted as RE (REs, regexes, or regex
pattern), are imported through the ‘re’ module. Python supports regular
expression through libraries. In Python, regular expression supports
various things such as Modifiers, Identifiers, and White space characters.

White space Escape


Identifiers Modifiers
characters required
\d represents a digit.
E.g.: \d{1,5} declares
\d= any number (a . + * ? [] $ ^ ()
digit between 1,5 \n = new line
digit) {} | \
such as 424,444,545,
etc.
\D = anything but
+ = matches 1 or
a number (a non- \s = space
more
digit)

© 2019 IBM Corporation


87
Python Training Module

\s = space (tab,
space, newline, ? = matches 0 or 1 \t = tab
etc.)
\S = anything but a
* = 0 or more \e = escape
space
\w = letters
(Match
$ match end of a \r = carriage
alphanumeric
string return
character,
including "_")
\W =anything but
letters (Matches a
^ match start of a
non-alphanumeric \f= form feed
string
character
excluding "_")
. = anything but | matches either or -----------------
letters (periods) x/y
\b = any character
[] = range or -----------------
except for new
"variance"
line
{x} = this amount of
\. -----------------
preceding code

Remember the following while working with RE:

Pattern.search(string[, pos[, endpos]])

• Scan through the string looking for the first location where this
regular expression produces a match and return a corresponding
match object.

© 2019 IBM Corporation


88
Python Training Module

• Return None if no position in the string matches the pattern. Note


that this is different from finding a zero-length match at some point
in the string.
• The optional second parameter pos gives an index in the string
where the search is to start; it defaults to 0. This is not completely
equivalent to slicing the string. The '^' pattern character matches at
the real beginning of the string and at positions just after a newline,
but not necessarily at the index where the search is to start.
• The optional parameter endpos limits how far the string will be
searched. It will be as if the string is endpos characters long, so only
the characters from pos to endpos - 1 will be searched for a match. If
endpos is less than pos, no match will be found. Otherwise, if rx is a
compiled regular expression object, rx.search(string, 0, 50) is
equivalent to rx.search(string[:50], 0).

Pattern matching

Pattern.match(string[, pos[, endpos]])

In pattern matching, if zero or more characters at the beginning of a string


match this regular expression, return a corresponding match object.
Return ‘None’ if the string does not match the pattern. However, note that
this is different from a zero-length match. The optional pos and endpos
parameters have the same meaning as for the search() method.

© 2019 IBM Corporation


89
Python Training Module

• Pattern.split(string, maxsplit=0)
Identical to the split() function, that uses the compiled pattern

• Pattern.findall(string[, pos[, endpos]])


Similar to the findall() function, using the compiled pattern, but also
accepts optional pos and endpos parameters that limit the search
region like for search()

• Pattern.sub(repl, string, count=0)


Identical to the sub() function, using the compiled pattern

• Pattern.flags
The regex matching flags. This is a combination of the flags given to
compile(), any (?...) inline flags in the pattern, and implicit flags such as
UNICODE if the pattern is a Unicode string.

• Pattern.groups
The number of capturing groups in the pattern

• Pattern.pattern
The pattern string from which the pattern object was compiled

© 2019 IBM Corporation


90
Python Training Module

Finding All Adverbs and Their Positions


If one wants more information about all matches of a pattern than the
matched text, finditer() is useful as it provides match objects instead of
strings. Continuing with the previous example, if a writer wanted to find all
of the adverbs and their positions in some text, they would use finditer() in
the following manner:

>>> text = "He was carefully disguised but captured quickly by police."
>>> for m in re.finditer(r"\w+ly", text):
... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
07-16: carefully
40-47: quickly
search() vs. match()
Python offers two different primitive operations based on regular
expressions:
• re.match() checks for a match only at the beginning of the string.
• re.search() checks for a match anywhere in the string.

>>> re.match("c", "abcdef") # No match


>>> re.search("c", "abcdef") # Match
<re.Match object; span=(2, 3), match='c'>

Parsing data
Parsing is the process of analyzing a string of symbols, either in natural
language, computer languages, or data structures, conforming to the rules
of a formal grammar. There are fundamentally three ways to parse a
language or document from Python:
• Use an existing library supporting that specific language:
o The first option is the best for well-known and supported
languages such as XML or HTML.

© 2019 IBM Corporation


91
Python Training Module

o A good library usually also includes APIs to programmatically


build and modify documents in that language.
o This is typically more of what we get from a basic parser.
o The problem is that such libraries are not-so-common and they
support only the most common languages.
• Building your own custom parser by hand:
o We may need to pick the second option if we have particular
needs; whether the language that we need to parse cannot be
parsed with traditional parser generators or we have specific
requirements that we cannot satisfy that uses a typical parser
generator.
o For instance, maybe we need the best possible performance or a
deep integration between different components.
• A tool or library to generate a parser:
o In all other cases, the third option should be the default one
because it is the most flexible and has a shorter development
time.
o For example, ANTLR can be used to build parsers for any
language. That is why, we concentrate on the tools and libraries
that correspond to this option.

Parsing with Regular Expressions and the Like


Reparse (Regular Expression-based parsers), python library is used for
extracting data from natural languages. This library basically just gives a
way to combine Regular Expressions together and hook them up to some
callback functions in Python.
The basic idea is that we define regular expressions, the patterns they can
use to combine, and the functions that are called when an expression or
pattern is found.

© 2019 IBM Corporation


92
Python Training Module

from datetime import time


def color_time(Color=None, Time=None):
Color, Hour, Period = Color[0], int(Time[0]), Time[1]
if Period == 'pm':
Hour += 12
Time = time(hour=Hour)
return Color, Time
functions = {
'BasicColorTime' : color_time,
}
from reparse_functions import functions
import reparse
colortime_parser = reparse.parser(
parser_type=reparse.basic_parser,
expressions_yaml_path=path + "expressions.yaml",
patterns_yaml_path=path + "patterns.yaml",
functions=functions
)
print(colortime_parser("~ ~ ~ go to the store ~ buy green at 11pm!
~ ~"))

© 2019 IBM Corporation


93
Python Training Module

Test Your Knowledge

1. What is the difference between an ordinary character and a special


character in Regular Expression?

2. List the functions for the following:


a. /D
b. ?
c. /s
d. $
e. /W

3. Define pos and explain its functions.

4. Find the prepositions and their position using finter() for the
following sentence: Ramesh could not go out of his house due to the
cyclone warning in his area.

5. What is parsing in Regular Expression? Explain with an example.

© 2019 IBM Corporation


94
Python Training Module

Regression (Use Case Study)

Introduction to regression

Regression analysis is a form of predictive modelling technique which


investigates the relationship between a dependent (target) and
independent variable(s) (predictor). This technique is used for forecasting,
time series modeling, and finding the causal effect relationship between
the variables.
For example, relationship between rash driving and number of road
accidents by a driver is best studied through regression. Regression
analysis is an important tool for modeling and analyzing data.
Here, we fit a curve / line to the data points, in such a manner that the
differences between the distances of data points from the curve or line is
minimized.

© 2019 IBM Corporation


95
Python Training Module

Why use regression analysis?


There are multiple benefits of using regression analysis. They are as
follows:
• It indicates the significant relationships between dependent variable
and independent variable.
• It indicates the strength of impact of multiple independent variables
on a dependent variable.
• Regression analysis also allows us to compare the effects of
variables measured on different scales, such as the effect of price
changes and the number of promotional activities. These benefits
help market researchers/data analysts/data scientists to eliminate
and evaluate the best set of variables to be used for building
predictive models.

Outliers
Suppose there is an observation in the dataset that has a very high or very
low value as compared to the other observations in the data, i.e., it does
not belong to the population. Such an observation is called an outlier. In

© 2019 IBM Corporation


96
Python Training Module

simple words, it is the extreme value. An outlier is a problem because


many times it hampers the results we get.

Multicollinearity
When the independent variables are highly correlated to each other, then
the variables are said to be multicollinear. Many types of regression
techniques assume multicollinearity should not be present in the dataset.
It is because it either causes problems in ranking variables based on their
importance or it makes job difficult in selecting the most important
independent variable (factor).

Heteroscedasticity
When dependent variable’s variability is not equal across values of an
independent variable, it is called heteroscedasticity. For example, as one’s
income increases, the variability of food consumption will increase. A
poorer person will spend a rather constant amount by always eating
inexpensive food. However, a wealthier person may occasionally buy
inexpensive food and at other times eat expensive meals. Those with
higher incomes display a greater variability of food consumption.

Overfitting
When we use unnecessary explanatory variables, it might lead to
overfitting. Overfitting means that our algorithm works well on the training
set but is unable to perform better on the test sets. It is also known as the
problem of high variance.

© 2019 IBM Corporation


97
Python Training Module

Underfitting
When our algorithm works so poorly that it is unable to fit even the training
set well, then it is said to underfit the data. It is also known as the problem
of high bias.

Types of regression
There are various kinds of regression techniques available to make
predictions. These techniques are mostly driven by three metrics (number
of independent variables, type of dependent variables, and shape of
regression line).

Some of the commonly used regression techniques are:


• Linear Regression
o It is the simplest form of regression. It is a technique in which the
dependent variable is continuous in nature.
o The relationship between the dependent variable and
independent variables is assumed to be linear in nature.
• Logistic Regression
o Logistic regression is used to find the probability of
event=Success and event=Failure.

© 2019 IBM Corporation


98
Python Training Module

o We should use logistic regression when the dependent variable is


binary (0/ 1, True/ False, Yes/ No) in nature.
• Polynomial Regression
o It is a technique to fit a nonlinear equation by taking polynomial
functions of independent variable.
o A regression equation is a polynomial regression equation if the
power of independent variable is more than 1.
• Stepwise Regression
o This form of regression is used when we deal with multiple
independent variables.
o It basically fits the regression model by adding or dropping co-
variates, one at a time, based on a specified criterion.
• Ridge Regression
o Ridge Regression is a technique used when the data suffers from
multicollinearity (independent variables are highly correlated). In
multicollinearity, even though the least squares estimates (OLS)
are unbiased, their variances are large which deviates the
observed value far from the true value.
o By adding a degree of bias to the regression estimates, ridge
regression reduces the standard errors.
• Lasso Regression
o Like Ridge Regression, Lasso (Least Absolute Shrinkage and
Selection Operator) also penalizes the absolute size of the
regression coefficients.
o In addition, it is capable of reducing the variability and improving
the accuracy of linear regression models
• ElasticNet Regression
o ElasticNet is a hybrid of Lasso and Ridge Regression techniques.
o ElasticNet is useful when there are multiple features that are
correlated.

© 2019 IBM Corporation


99
Python Training Module

Linear Regression
Linear Regression is one of the most widely known modeling techniques.
In this technique, the dependent variable is continuous, independent
variable(s) can be continuous or discrete, and nature of regression line is
linear.
Linear Regression establishes a relationship between dependent variable
(Y) and one or more independent variables (X) using a best fit straight
line (also known as regression line). It is represented by an equation
Y=a+b*X + e, where a is the intercept, b is slope of the line, and e is error
term. This equation can be used to predict the value of target variable
based on given predictor variable(s).

The difference between simple linear regression and multiple linear


regression is that, multiple linear regression has (>1) independent
variables, whereas simple linear regression has only 1 independent
variable.

© 2019 IBM Corporation


100
Python Training Module

How to Obtain Best Fit Line (Value of a and b)?

This task can be easily accomplished by Least Square Method. It is the


most common method used for fitting a regression line. It calculates the
best-fit line for the observed data by minimizing the sum of the squares of
the vertical deviations from each data point to the line. Because the
deviations are first squared, when added, there is no canceling out
between positive and negative values. We can evaluate the model
performance using the metric R-square.
The coefficient of determination, denoted R2 or r2 and pronounced "R
square", is the proportion of the variance in the dependent variable that is
predictable from the independent variable(s).

Linear Regression: Important points


• There must be linear relationship between independent and
dependent variables.

© 2019 IBM Corporation


101
Python Training Module

• Multiple regression suffers from multicollinearity, autocorrelation,


heteroskedasticity.
• Linear Regression is very sensitive to Outliers. It can terribly affect
the regression line and eventually the forecasted values.
• Multicollinearity can increase the variance of the coefficient
estimates and make the estimates very sensitive to minor changes in
the model. The result is that the coefficient estimates are unstable.
• In case of multiple independent variables, we can go with forward
selection, backward elimination and step-wise approach for
selection of most significant independent variables.

Logistic Regression
Logistic regression is used to find the probability of event=Success
and event=Failure. We should use logistic regression when the
dependent variable is binary (0/ 1, True/ False, Yes/ No) in nature.
Here the value of Y ranges from 0 to 1 and it can have represented by
following equation.

In the above equation, p is the probability of presence of the characteristic


of interest. A question that should be asked here is “why has log in been
used in the equation?”. Since we are working here with a binomial
distribution (dependent variable), we need to choose a link function that is
best suited for this distribution. And, it is logit function. In the equation
above, the parameters are chosen to maximize the likelihood of observing

© 2019 IBM Corporation


102
Python Training Module

the sample values rather than minimizing the sum of squared errors (like
in ordinary regression).

Logistic Regression: Important Points


• It is widely used for classification problems.
• Logistic regression doesn’t require linear relationship between
dependent and independent variables. It can handle various types of
relationships because it applies a non-linear log transformation to
the predicted odds ratio.
• To avoid overfitting and underfitting, we should include all significant
variables. A good approach to ensure this practice is to use a step
wise method to estimate the logistic regression.
• It requires large sample sizes because maximum likelihood
estimates are less powerful at low sample sizes than ordinary least
square.

© 2019 IBM Corporation


103
Python Training Module

• The independent variables should not be correlated with each other,


that is, no multicollinearity. However, we have the options to include
interaction effects of categorical variables in the analysis and in the
model.
• If the value of dependent variable is ordinal, then it is called as
Ordinal logistic regression.
• If dependent variable is multi-class, then it is known as Multinomial
Logistic regression.

How to select the right regression model?


Within multiple types of regression models, it is important to choose the
best suited technique based on type of independent and dependent
variables, dimensionality in the data, and other essential characteristics of
the data. Below are the key factors that should be practiced to select the
right regression model:
1. Data exploration is an inevitable part of building predictive model. It
should be your first step before selecting the right model such as
identify the relationship and impact of variables.
2. To compare the goodness of fit for different models, we can analyze
different metrics such as statistical significance of parameters, R-
square, Adjusted r-square, AIC, BIC, and error term.
3. Cross-validation is the best way to evaluate models used for
prediction. Here you divide your data set into two group (train and
validate). A simple mean squared difference between the observed
and predicted values give you a measure for the prediction
accuracy.
4. If your data set has multiple confounding variables, you should not
choose automatic model selection method because you do not want
to put these in a model at the same time.

© 2019 IBM Corporation


104
Python Training Module

5. It’ll also depend on your objective. It can occur that a less powerful
model is easy to implement as compared to a highly statistically
significant model.
6. Regression regularization methods (Lasso, Ridge, and ElasticNet)
works well in case of high dimensionality and multicollinearity
among the variables in the data set.

© 2019 IBM Corporation


105
Python Training Module

Test Your Knowledge

1. What is regression analysis and how is it useful?

2. Name and explain briefly any two types of regression techniques.

3. Why is the Least Square Method the correct method for finding the best
fit line?

4. Elaborate the points that must be considered while choosing a


regression technique.

5. Briefly explain the study conducted on the telecom industry using


churn analysis.

© 2019 IBM Corporation


106
Python Training Module

Other Regression related topics

Exploratory Data Analysis


In statistics, exploratory data analysis (EDA) is an approach to analyze
data sets and to summarize their main characteristics, often with visual
methods. A statistical model can be used or not, but primarily EDA is for
seeing what the data can tell us beyond the formal modeling or hypothesis
testing task.
You can say that EDA is statistician’s way of story-telling where you
explore data, find patterns, and tell insights.
EDA is a phenomenon under data analysis used for gaining a better
understanding of data aspects such as:
• Main features of data variables and relationships that hold between
them
• Identifying which variables are important for our problem

EDA in Python
Multiple libraries are available to perform basic EDA, but we are going to
use pandas and matplotlib.
• Pandas for data manipulation
• Matplotlib for plotting graphs
We shall, now, look at various exploratory data analysis methods:
• Descriptive Statistics, which is a way of giving a brief overview of
the dataset we are dealing with, including some measures and
features of the sample
• Grouping data, which is, basic grouping with group by

© 2019 IBM Corporation


107
Python Training Module

• ANOVA, Analysis of Variance, which is a computational method to


divide variations in an observation set into different components
• Correlation and correlation methods

Descriptive Statistics

DF.describe()

Descriptive statistics is a helpful way to understand characteristics of your


data and to get a quick summary of it. Pandas in python provide an
interesting method describe(). The describe function applies basic
statistical computations on the dataset such as extreme values, count of
data points standard deviation, and so on. Any missing value or NaN value
is automatically skipped. The describe() function gives a good picture of
distribution of data.

DF[“<category>"].value_counts()

Another useful method is value_counts(), which can get count of each


category in a categorical attributed series of values.
One more useful tool is boxplot, which you can use through matplotlib
module. Boxplot is a pictorial representation of distribution of data, which
shows extreme values, median, and quartiles.

© 2019 IBM Corporation


108
Python Training Module

Feature Engineering
Feature engineering is the process of using domain knowledge of the data
to create features that make machine learning algorithms work. Feature
engineering can be used to increase the predictive power of learning
algorithms by creating features from raw data that will help the learning
process. This can be done by creating additional relevant features from
the existing raw features in the data.
Feature engineering is something that will cost some time to get the hang
of. It’s not always clear what you can do with the raw data so that you can
help the predictive power of the data.

© 2019 IBM Corporation


109
Python Training Module

Correlation Matrix
Correlation is a simple relationship between two variables in a context,
such that, one variable affects the other. Correlation is different from act
of causing. For example, sales might increase when the marketing
department spends more on TV advertisements, or a customer's average
purchase amount on an e-commerce website might depend on a number
of factors related to that customer. Often, correlation is the first step to
understanding these relationships and subsequently building better
business and statistical models.

© 2019 IBM Corporation


110
Python Training Module

One way to calculate correlation among variables is to find Pearson


correlation. Here, we find two parameters, namely Pearson coefficient and
p-value. We can say there is a strong correlation between two variables
when Pearson correlation coefficient is close to either 1 or -1 and the p-
value is less than 0.0001.
There are two key components of a correlation value:
• magnitude – the larger the magnitude (closer to 1 or -1), the
stronger the correlation
• sign – if negative, there is an inverse correlation; if positive, there is a
regular correlation
Pandas provides a convenient one-line method corr() for calculating
correlation between data frame columns. Pandas also supports
highlighting methods for tables, so it is easier to see high and low
correlations. It is important to understand possible correlations in your
data, especially when building a regression model. Strongly correlated
predictors, phenomenon referred to as multicollinearity, will cause
coefficient estimates to be less reliable.

Why is correlation a useful metric?


Following are the reasons:
• Correlation can help in predicting one quantity from another.
• Correlation can (but often does not, as we will see in some examples
below) indicate the presence of a causal relationship.
• Correlation is used as a basic quantity and foundation for many other
modeling techniques.
Pandas can be used to create a correlation matrix to view the correlations
between different variables in a dataframe:

© 2019 IBM Corporation


111
Python Training Module

import pandas as pd
df = pd.DataFrame({'a': np.random.randint(0, 50, 1000)})
df['b'] = df['a'] + np.random.normal(0, 10, 1000) # positively
correlated with 'a'
df['c'] = 100 - df['a'] + np.random.normal(0, 5, 1000) #
negatively correlated with 'a'
df['d'] = np.random.randint(0, 50, 1000) # not correlated with
'a'
df.corr()

These correlations can be viewed graphically as a scatter matrix:

pd.scatter_matrix(df, figsize=(6, 6))


plt.show()

Or can directly plot a correlation matrix plot:

plt.matshow(df.corr())
plt.xticks(range(len(df.columns)), df.columns)
plt.yticks(range(len(df.columns)), df.columns)
plt.colorbar()
plt.show()

© 2019 IBM Corporation


112
Python Training Module

Visualizations using matplotlib


Matplotlib is the most popular data visualization library in Python. It allows
us to create figures and plots, and makes it very easy to produce static
raster or vector files without the need for any GUIs (graphic user
interface).

Installing Matplotlib
• Using Anaconda, we can install Matplotlib from terminal or
command prompt using:
conda install matplotlib

• Installing Matplotlib from directly from terminal using:


pip install matplotlib

Anatomy of a Plot
There are two key components in a Plot, namely Figure and Axes.

© 2019 IBM Corporation


113
Python Training Module

• The Figure is the top-level container that acts as the window or page
on which everything is drawn. It can contain multiple independent
figures, multiple Axes, a subtitle (which is a cantered title for the
figure), a legend, a color bar, and so on.
• The Axes is the area on which we plot our data and any labels/ticks
associated with it. Each Axes has an X-Axis and a Y-Axis (as in the
image above).

Step 1: Importing Matplotlib using:

import matplotlib.pyplot as plt

Step 2: Displaying plots using:

%matplotlib inline or plt.show()

Two Approaches for creating Plots


1. Functional Approach: Using the basic matplotlib command, a plot can
be easily created. Let’s plot an example using two Numpy arrays x and
y:

© 2019 IBM Corporation


114
Python Training Module

Now that we have a plot, let’s go on to name the x-axis and y-axis, and add
a title using .xlabel(), .ylabel(), and .title() using the following:

Matplotlib allows us easily create multi-plots on the same figure using the
.subplot() method. This .subplot() method takes in three parameters,
namely:
1. nrows: The number of rows the Figure should have
2. ncols: The number of columns the Figure should have
3. plot_number: Which refers to a specific plot in the Figure

© 2019 IBM Corporation


115
Python Training Module

Using .subplot() we will create two plots on the same canvas:

Notice how the two plots have different colors. This is because we need
to be able to differentiate the plots. This is possible by simply setting
the color attribute to ‘red’ and ‘green’.

2. Object-oriented Interface: This is the best way to create plots. The


idea here is to create Figure objects and call methods off it.
Now we need to add a set of axes to it using the .add_axes() method.
The add_axes() method takes in a list of four arguments (left, bottom,
width, and height, which are the positions where the axes should be
placed) ranging from 0 to 1.

© 2019 IBM Corporation


116
Python Training Module

As you can see, we have a blank set of axes. Now let’s plot our x and y
arrays on it:

© 2019 IBM Corporation


117
Python Training Module

We can further add x and y labels and a title to our plot the same way we
did in the Function approach, but there’s a slight difference here. Using
.set_xlabel(), .set_ylabel(), and .set_title(), let us go ahead and add labels
and a title to our plot:

We noted that a Figure can contain multiple figures. Let’s try to put in two
sets of figures on one canvas:

© 2019 IBM Corporation


118
Python Training Module

Now let’s plot our x and y arrays on the axes we have created:

Figure size, Aspect ratio, and DPI


Matplotlib allows us to create customized plots by specifying the figure
size, aspect ratio, and DPI by simply specifying the figsize and dpi
arguments. The figsize is a tuple of the width and height of the figure (in
inches), and dpi is the dots-per-inch (pixel-per-inch).

© 2019 IBM Corporation


119
Python Training Module

How to Save a Figure


We can use Matplotlib to generate high quality figures and save them in a
number of formats such as png, jpg, svg, pdf, and so on. Using the
.savefig() method, we’ll save the above figure in a file named
my_figure.png.

fig.savefig(‘my_figure.png’)

How to Decorate Figures


Legends
Legends allows us to distinguish between plots. With Legends, you can
use label texts to identify or differentiate one plot from another. To
identify the plots, we need to add a legend using .legend() and then specify
the label=” ” attribute for each plot:

© 2019 IBM Corporation


120
Python Training Module

Plot Appearance
Matplotlib gives us a lot of options for customizing the appearance of our
plots. By now, you should be familiar with changing line color using
color=’red’ or ‘red’ like we did in previous examples. Now we want to
change linewidth or lw, linestyle, or ls, and mark out data points using
marker.

© 2019 IBM Corporation


121
Python Training Module

Special Plot Types


Matplotlib allows us to create different kinds of plots:
Histograms
These help us understand the distribution of a numeric value in a way that
you cannot with mean or median alone. Using .hist() method, we can
create a simple histogram:

© 2019 IBM Corporation


122
Python Training Module

Time series (Line Plot)


It is a chart that shows a trend over a period of time. It allows you to test
various hypotheses under certain conditions, like what happens on
different days of the week or between different times of the day.

Scatter plots
They offer a convenient way to visualize how two numeric values are
related in your data. It helps in understanding relationships between
multiple variables. Using .scatter() method, we can create a scatter plot:

© 2019 IBM Corporation


123
Python Training Module

Bar graphs
These are convenient for comparing numeric values of several groups.
Using .bar() method, we can create a bar graph:

© 2019 IBM Corporation


124
Python Training Module

Implementing Linear Regression


Linear Regression Using Two-Dimensional Data
Let’s understand Linear Regression using just one dependent and
independent variable.

Plot these lists using a scatter plot. Assume xs as the independent variable
and ys as the dependent variable.

© 2019 IBM Corporation


125
Python Training Module

A linear regression line has the equation Y = mx+c, where m is the


coefficient of independent variable and c is the intercept.
The mathematical formula to calculate slope (m) is:
(mean(x) * mean(y) – mean(x*y)) / (mean (x)^2 – mean( x^2))
The formula to calculate intercept (c) is:
mean(y) – mean(x) * m

Now, let’s write a function for intercept and slope (coefficient):

© 2019 IBM Corporation


126
Python Training Module

To see the slope and intercept for xs and ys, we just need to call the
function slope_intercept:

reg_line is the equation of the regression line

Now, let’s plot a regression line on xs and ys:

© 2019 IBM Corporation


127
Python Training Module

Root Mean Squared Error(RMSE)


RMSE is the standard deviation of the residuals (prediction errors).
Residuals are a measure of how far from the regression line data points
are. RMSE is a measure of how spread out these residuals are.
If Yi is the actual data point and Y^i is the predicted value by the equation
of line then RMSE is the square root of:
(Yi – Y^i)**2

Let’s define a function for RMSE:

© 2019 IBM Corporation


128
Python Training Module

Use Case: Churn Analysis


Why Churn Analysis?
The Telecom industry is growing rapidly, and each company is inclined on
increasing the customer base. In a survey, it was found that cost of
retaining existing customer is far less than acquiring new customers.
However, to retain new customers, customized offers or service
improvement is required. And that requires a prior idea of potential
customer who might leave for other company service.
Due to massive growth in communication and data storage, huge data
regarding customers is available and that can be leveraged to predict the
potential churn.

Objective
We will use customer data that closely mocks the telecom industry data to
predict the potential churn. We will also try to predict the ARPU (Average
Revenue Per User) using the data.

Methodology
Churn Analysis:
1. Logistic Regression
2. SVM
ARPU Analysis:
1. Linear Regression

© 2019 IBM Corporation


129
Python Training Module

Sample Data: Telco Sample Data_V0.4.xlsx


Code: Telco_python_Linear.py
Telco_python_Logit_svm.py

Implementing Linear Regression (using Scikit Learn)


Data Ingestion
Ingest the data from the CSV or Excel Sheet

© 2019 IBM Corporation


130
Python Training Module

Data Curation
Curate the data by properly treating the categorical values and treating the
missing and null values

Data Preparation
Prepare the data by creating a Train-set and Test-set. Sometimes, we
also have a blind-set.

© 2019 IBM Corporation


131
Python Training Module

Rule the Models and Test the Models


Create and train the model and test them using the testing data.

Results of Linear Regression


The images given below specify various details of the model summary,
along with various parameters such as AIC, BIC, and R-Squared values to
analyze the model. Model analysis will tell about the performance of the
model in prediction.

© 2019 IBM Corporation


132
Python Training Module

Data Ingestion
Ingest the data from the CSV or Excel Sheet

Data Curation
Curate the data by properly treating the categorical values and treating the
missing and null values

© 2019 IBM Corporation


133
Python Training Module

Data Preparation
Prepare the data by creating a Train-set and Test-set. Sometimes, we
also have a blind-set.

Rule the Models and Test the Models


Create and train the model and test them using the testing data.

© 2019 IBM Corporation


134
Python Training Module

© 2019 IBM Corporation


135
Python Training Module

Test Your Knowledge

1. What are the different exploratory data analysis methods?

2. Why is feature engineering important?

3. Describe how to create plots using the object-oriented interface.

4. Write down the codes used for creating the following plot types:
a. Scatter plots
b. Bar graphs

5. What is the method for implementing linear regression using two-


dimensional data?

© 2019 IBM Corporation


136
Python Training Module

Advance

Machine Learning Algorithms


Machine Learning Algorithms, a subfield of Artificial Intelligence (AI), is a
theoretical concept. There are various techniques with various
implementations. The name is derived from the concept that it deals with
‘construction and study of systems that can learn from data’. Machine
Learning Algorithms can be seen as building blocks to make computers
learn to behave more intelligently.
Let us look at an example.
An emergency room in a hospital measures 17 variables, such as blood
pressure, age, of newly admitted patients. Based on this, a decision is
taken on whether to put a new patient in an intensive-care unit. Due to the
high cost of ICU, those patients who may survive less than a month are
given higher priority. The problem is to predict high-risk patients and
discriminate them from low-risk patients.
With the help of machine learning, we can predict high-risk patients based
on various variables such as age, blood pressure, and so on.
There are three broad types of Machine Learning Algorithms.
• Supervised Learning – This algorithm consists of a target or outcome
variable (or dependent variable) that is to be predicted. Using these
set of variables, we generate a function that maps inputs to desired
outputs.
The training process continues until the model achieves a desired
level of accuracy on the training data. Examples of Supervised
Learning include Regression, Decision Tree, Random Forest, KNN,
Logistic Regression, and so on.

© 2019 IBM Corporation


137
Python Training Module

• Unsupervised Learning – In this algorithm, there is no target or


outcome variable to predict or estimate. It is used for clustering
population in different groups, which is widely used for segmenting
customers in different groups for specific intervention.

Examples of Unsupervised Learning are Apriori algorithm or K-


means.

• Reinforced Learning – Using this algorithm, the machine is trained to


make specific decisions. The machine is exposed to an environment
where it trains itself continually using trial and error.

This machine learns from past experience and tries to capture the
best possible knowledge to make accurate business decisions.
Examples of Reinforcement Learning include Markov Decision
Process.
Commonly used Machine Learning Algorithms include:
• Linear Regression
• Logistic Regression
• Decision Tree
• SVM
• Naive Bayes
• kNN
• K-Means
• Random Forest
• Dimensionality Reduction Algorithms
• Gradient Boosting Algorithms including GBM, XGBoost, LightGBM,
and CatBoost

© 2019 IBM Corporation


138
Python Training Module

Support Vector Machine


Support Vector Machine is a classification method.
In this algorithm, we plot each data item as a point in n-dimensional
space (where n is number of features you have) with the value of each
feature being the value of a particular coordinate.
For example, if we only had two features such as Height and Hair length of
an individual, we would first plot these two variables in a two-dimensional
space, where each point has two co-ordinates (these co-ordinates are
known as Support Vectors).

Now, we will find some line that splits the data between the two
differently classified groups of data. This will be the line such that the
distances from the closest point in each of the two groups will be farthest
away.

© 2019 IBM Corporation


139
Python Training Module

In the example shown above, the line which splits the data into two
differently classified groups is the black line, since the two closest points
are the farthest apart from the line. This line is our classifier. Then,
depending on where the testing data lands on either side of the line, that’s
what class we can classify the new data as.
Now let us look at a Python code for this.
#Import Library
from sklearn import svm
#Assumed you have, X (predictor) and Y (target) for training data set and
x_test(predictor) of test_dataset
# Create SVM classification object
model = svm.svc() # there is various option associated with it, this is simple for
classification.
# Train the model using the training sets and check score

© 2019 IBM Corporation


140
Python Training Module

model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)

Random Forest
Random Forest is a trademark term for an ensemble of decision trees. In
Random Forest, we have a collection of decision trees, hence the term
“Forest”.
To classify a new object based on attributes, each tree gives a
classification and we say the tree “votes” for that class. The forest
chooses the classification having the most votes (over all the trees in the
forest). Each tree is planted and grown as follows:

© 2019 IBM Corporation


141
Python Training Module

1. If the number of cases in the training set is N, then sample of N


cases is taken at random but with replacement. This sample will be
the training set for growing the tree.
2. If there are M input variables, a number m<<M is specified such that
at each node, m variables are selected at random out of the M and
the best split on these m is used to split the node. The value of m is
held constant during the forest growing.
3. Each tree is grown to the largest extent possible. There is no
pruning.
Now let us look at a Python code for this.
#Import Library
from sklearn.ensemble import RandomForestClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and
x_test(predictor) of test_dataset
# Create Random Forest object
model= RandomForestClassifier()
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)

© 2019 IBM Corporation


142
Python Training Module

Conclusion
The purpose of this learning material is to make the reader accustomed to
python language. This material has been written in a manner to explain
key concepts of the language. The reader should now have an
understanding of the following:
• Basics of python language
• Programming in python
• Various python libraries
• Handling various types of error in python
• Basics of regression analysis
• Overview of machine learning

© 2019 IBM Corporation


143
Python Training Module

Glossary

Key Terms Descriptions


Adjusted R Square The adjusted R-squared is a modified version of R-
squared that has been adjusted for the number of
predictors in the model. The adjusted R-squared
increases only if the new term improves the model
more than would be expected by chance. It
decreases when a predictor improves the model by
less than expected by chance.
API Application Programming Interface is a
set of subroutine definitions, communication
protocols, and tools for building software. In
general terms, it is a set of clearly defined methods
of communication among various components.
ARPU Average revenue per user (sometimes known as
average revenue per unit), usually abbreviated to
ARPU, is a measure used primarily by consumer
communications, digital media, and networking
companies, defined as the total revenue divided by
the number of subscribers. This term is used by
companies that offer subscription services to
clients, for example, telephone carriers, internet
service providers, and hosts. It is a measure of the
revenue generated by one customer phone, pager,
and so on, per unit time, typically per year or
month. (Reference Wikipedia)
Binomial It is a frequency distribution of the possible
distribution number of successful outcomes in a given number
of trials in each of which, there is the same
probability of success.

© 2019 IBM Corporation


144
Python Training Module

Decision tree It is a decision support tool that uses a tree-like


model of decisions and their possible
consequences, including chance event outcomes,
resource costs, and utility.
Dependent variable Dependent variable is a characteristic whose value
depends on the values of independent variables.
The dependent variable is what is being measured
in an experiment or evaluated in a mathematical
equation. The dependent variable is sometimes
called "the outcome variable."
Formal Modeling A formal model is a precise statement of
components to be used and the relationships
among them. Formal models are usually stated via
mathematics, often equations.
Hypothesis Testing A hypothesis test is rule that specifies whether to
accept or reject a claim about a population
depending on the evidence provided by a sample of
data. A hypothesis test examines two opposing
hypotheses about a population: the null hypothesis
and the alternative hypothesis.
Independent Independent variables are characteristics that can
variable be measured directly.
KNN k-nearest neighbors algorithm (KNN) is a non-
parametric method used for classification and
regression. In both cases, the input consists of the
k closest training examples in the feature space.
Linear Regression It is a linear approach to modeling the relationship
between a scalar response (or dependent variable)
and one or more explanatory variables (or
independent variables).
Link function The link function provides the relationship between
the linear predictor and the mean of the
distribution function. There are many commonly

© 2019 IBM Corporation


145
Python Training Module

used link functions, and their choice is informed by


several considerations
Logistic Regression Logistic Regression is a predictive analysis, used to
describe data and to explain the relationship
between one dependent binary variable and one or
more nominal, ordinal, interval, or ratio-level
independent variables.
Logit function The logit in logistic regression is a special case of a
link function in a generalized linear model.
NaN Value NaN, standing for not a number, is a numeric data
type value representing an undefined or
unrepresentable value.
P- value When you perform a hypothesis test in statistics, a
p-value helps you determine the significance of
your results. The p-value is a number between 0
and 1 and is interpreted in the following way: A
small p-value (typically ≤ 0.05) indicates strong
evidence against the null hypothesis, so you reject
the null hypothesis.
Pearson Correlation A Pearson correlation is a number between -1 and
1 that indicates the extent to which two variables
are linearly related. The Pearson correlation is also
known as the “product moment correlation
coefficient” (PMCC) or simply “correlation”.
Pearson Value The Pearson correlation coefficient, r, can take a
range of values from +1 to -1. A value of 0 indicates
that there is no association between the two
variables.
Predictor Variable Predictor Variable refers to one or more variables
that are used to determine (predict) the 'Target
Variable'.
R Squared R-squared is a statistical measure of how close the
data is to the fitted regression line. 0% indicates

© 2019 IBM Corporation


146
Python Training Module

that the model explains none of the variability of


the response data around its mean. 100%
indicates that the model explains all the variability
of the response data around its mean.
Regression A technique used to model and analyze the
relationships between variables and often times
how they contribute and are related to producing a
particular outcome together
Scikit-learn Scikit-learn (formerly scikits.learn) is a free
software machine learning library for the Python
programming language. It features various
classification, regression, and clustering algorithms
including support vector machines, random
forests, gradient boosting, k-means, and DBSCAN,
and is designed to interoperate with the Python
numerical and scientific libraries NumPy and SciPy.
(Reference Wikipedia)
Target Variable A variable that needs to be predicted is a target
variable.
Tty Device A tty is a terminal (it stands for teletype - the
original terminals used a line printer for output and
a keyboard for input!). A terminal is basically just a
user interface device that uses text for input and
output.
UNIX Shell A UNIX shell is a command-line interpreter or shell
that provides a command-line user interface for
Unix-like operating systems.

© 2019 IBM Corporation


147

You might also like