Professional Documents
Culture Documents
Python Training Module
Python Training Module
This material is meant for IBM Academic Initiative use only. NOT FOR RESALE.
Preface
February 2019
NOTICES
This information was developed for products and services offered in the USA.
IBM may not offer the products, services, or features discussed in this document in
other countries. Consult your local IBM representative for information on the products
and services currently available in your area. Any reference to an IBM product,
program, or service may be used. Any functionally equivalent product, program, or
service that does not infringe any IBM intellectual property right may be used instead.
However, it is the user’s responsibility to evaluate and verify the operation non-IBM
product, program, or service. IBM may have patents or pending patent applications
covering subject matter described in this document. The furnishings of this document
does not grant you any license to these patents. You can send license inquiries, in
writing, to:
The following paragraph does not apply to the United Kingdom or any other country
where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS
MACHINES CORPORATION PROVIDES THIS PUBLICATION “AS IS” WITHOUT
WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not
allow disclaimer of express or implied warranties in certain transactions, therefore,
this statement may not apply to you.
Any inference in this information to non-IBM websites are provided for convenience
only and do not in any manner serve as an endorsement of these websites. The
materials at those websites are not part of the materials for this IBM product, and use
of those websites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes
appropriate without incurring any obligation to you. Information concerning non-IBM
products was obtained from the suppliers of those products, their published
announcements, or other publicly available sources. IBM has not tested those
products and cannot confirm the accuracy of performance, compatibility, or any other
claims related to the non-IBM products. Questions on the capabilities of non-IBM
products should be addressed to the suppliers of those products.
This information contains examples of data reports used in daily business operations.
To illustrate them as completely as possible, the examples include the names of
individuals, companies, brands, and products. All of these names are fictitious and any
similarity to the names and addresses used by an actual business enterprise is entirely
coincidental.
TRADEMARKS
IBM, the IBM logo, ibm.com, and Python are trademarks or registered trademarks of
the International Business Machines Corp., registered in many jurisdictions worldwide.
Other product and service names might be trademarks of IBM or other companies. A
current list of IBM trademarks in available on the web at “Copyright and trademark
information” at www.ibm.com/legal/copytrade.html.
Adobe, and the Adobe logo are either registered trademarks or trademarks of Adobe
Systems Incorporated in the United States, and/or other countries.
Microsoft, Windows, and the Windows logo are trademarks of Microsoft Corporation in
the United States, other countries, or both.
This document may not be reproduced in whole or in part without prior permission of
IBM.
Table of Contents
Introduction to Python ...................................................................... 9
What is Python? .......................................................................................... 9
Advantages and disadvantages ................................................................. 9
Benefits .................................................................................................. 9
Limitations ........................................................................................... 10
Downloading and installing ...................................................................... 10
Downloading ........................................................................................ 10
Installing .............................................................................................. 10
Python versions ........................................................................................ 11
Running Python scripts ............................................................................ 12
Executing scripts with Python Launcher: ............................................ 12
Executing scripts without Python Launcher: ....................................... 12
Using the interpreter interactively ........................................................... 13
Using variables ......................................................................................... 14
Rules for variable names ..................................................................... 14
Dynamic typing .................................................................................... 14
Assigning variables .............................................................................. 15
Re-assigning variables ......................................................................... 16
Determining variable type with type() ................................................. 16
Simple exercise.................................................................................... 17
String types: normal, raw, and Unicode ................................................... 17
Creating a String .................................................................................. 18
String operators and functions ................................................................ 19
Printing a string .................................................................................... 19
String basics......................................................................................... 19
String indexing ..................................................................................... 19
String properties .................................................................................. 22
Basic built-in string methods .............................................................. 23
Math operators and functions .................................................................. 24
Writing to the screen ................................................................................ 25
Test Your Knowledge ............................................................................... 27
Deep Dive into Python ......................................................................28
Reading from the keyboard...................................................................... 28
raw_input ............................................................................................. 28
input ..................................................................................................... 28
Indenting is significant ............................................................................. 29
Boolean .................................................................................................... 30
The if and elif statements ........................................................................ 31
The statements can also have multiple branches .............................. 33
While loops ............................................................................................... 34
break, continue, and pass statements ................................................ 36
Using lists ................................................................................................. 37
Indexing and slicing ............................................................................. 38
Basic list methods ............................................................................... 39
Nesting lists ......................................................................................... 41
List comprehensions............................................................................ 42
Dictionaries .............................................................................................. 43
Using the ‘for’ statement ......................................................................... 45
Tuples .................................................................................................. 48
Introduction to Python
What is Python?
Created in 1991 by Guido van Rossum, Python is an object-oriented, high-
level programming language used for a wide variety of applications and
not limited to basic usage. It was considered a gap-filler, a way to write
scripts that ‘automate the boring stuff’. It is a great language for beginners
because of its readability and other structural elements designed to make
it easy to understand.
Over the past few years, Python has emerged as a first-class citizen in
modern software development, infrastructure management, and data
analysis.
It is no longer a back-room utility language, but a major force in web
application creation and systems management, and a key driver of the
explosion in big data analytics and machine intelligence.
Limitations
• Speed Limitations
• Weak in Mobile Computing Browsers
• Design Restrictions
• Underdeveloped Database Access Layers
• Simple
• Run Time Errors
Installing
1. First, double-click the icon labeling the file python-3.6.2.exe.
(An Open File - Security Warning pop-up window will appear.)
2. Then, click Run.
(A Python 3.6.2 (32-bit) Setup pop-up window will appear.)
Ensure that the Install launcher for all users (recommended) and
the Add Python 3.6 to PATH checkboxes at the bottom are
checked.
Python versions
Python is available in two versions:
• Python 2.x - The older “legacy” branch, will continue to be
supported (that is, receive official updates) through 2020, and it
might persist unofficially after that.
• Python 3.x - The current and future incarnation of the language, has
many useful and important features not found in 2.x, such as better
concurrency controls and a more efficient interpreter.
Using variables
Variable assignment
Rules for variable names
• Do not start the names with a number.
• Do not use spaces in names, use _ instead.
• Do not use any of these symbols in names:
:'",<>/?|\!@#%^&*~-+
• Inculcate the best practice or writing names in lowercase with
underscores.
• Avoid using Python built-in keywords, such as list and str.
• Avoid using the single characters l (lowercase letter el), O
(uppercase letter oh), and I (uppercase letter eye) as they can be
confused with 1 and 0.
Dynamic typing
Python uses dynamic typing, meaning we can reassign variables to
different data types. This makes Python very flexible in assigning data
types; it differs from other languages that are statically typed.
Pros:
This is very easy to work with and has a faster development time.
Cons:
It may result in unexpected bugs! So, you need to be aware of type().
Assigning variables
Variable assignment follows name = object, where a single equals sign = is
an assignment operator.
Re-assigning variables
Python lets us reassign variables with a reference to the same object.
There is another shortcut way of doing this. Python lets us add, subtract,
multiply, and divide numbers with reassignment using +=, -=, *=, and /=.
• Float
• str (for string)
• List
• Tuple
• dict (for dictionary)
• Set
• bool (for Boolean True/False)
Simple exercise
This shows how variables make calculations more readable and easier to
follow.
Creating a String
To create a string in Python we need to use either single quotes or double
quotes. For example:
The reason for the error above is because the single quote in I'm stopped
the string. You can use combinations of double and single quotes to get
the complete statement.
Printing a string
We can use a print statement to print a string.
String basics
We can also use a function called len() to check the length of a string!
Python's built-in len() function counts all of the characters in the string,
including spaces and punctuation.
String indexing
In Python, we use brackets [] after an object to call its index. We should
also note that indexing starts at 0 for Python. Let's create a new object
called s and then walk through a few examples of indexing.
Note the above slicing. Here, we're telling Python to grab everything from
0 up to 3. It doesn't include the 3rd index. You'll notice this a lot in Python,
where statements and are usually in the context of "up to, but not
including".
We can also use index and slice notation to grab elements of a sequence
by a specified step size (the default is 1). For instance, we can use two
colons in a row and then a number specifying the frequency to grab
elements. For example:
String properties
It is important to note that strings have an important property known as
immutability. This means that once a string is created, the elements within
it cannot be changed or replaced. For example:
Notice how the error tells us directly what we can't do, change the item
assignment!
Something we can do is concatenate strings!
3. What would you use to find a number’s square root, as well as its
square?
4. Given the string 'hello', give an index command that returns 'e’.
6. Given the string ‘hello’, give two methods of producing the letter
'o' using indexing.
raw_input
raw_input is used to read text (strings) from the user. raw_input does not
interpret the input. It always returns the input of the user without
changes, that is, raw. This raw input can be changed into the data type
needed for the algorithm. To accomplish this, we can use either a ‘casting’
function or the ‘eval’ function.
input
If the input function is called, the program flow will be stopped until the
user has given an input and has ended the input with the return key. The
text of the optional parameter, that is, the prompt, will be printed on the
screen. The input of the user will be interpreted. For example, if the user
puts in an integer value, the input function returns this integer value. If the
user on the other hand inputs a list, the function will return a list.
Indenting is significant
Python programs get structured through indentation, that is, code blocks
are defined by their indentation. In the case of Python, it is a language
requirement, not a matter of style. This principle makes it easier to read
and understand other people's Python code.
All statements with the same distance to the right belong to the same
block of code, that is, the statements within a block line up vertically. The
block ends at a line less indented or the end of the file. If a block must be
more deeply nested, it is simply indented further to the right.
Boolean
Boolean values are the two constant objects False and True. They are
used to represent truth values (other values can also be considered false
or true). In numeric contexts (for example, when used as the argument to
an arithmetic operator), they behave like the integers 0 and 1,
respectively.
The built-in function bool() can be used to cast any value to a Boolean, if
the value can be interpreted as a truth value. They are written as False and
True, respectively.
A string in Python can be tested for truth value. The return type will be in
Boolean value (True or False). Let’s make an example, by first creating a
new variable, and giving it a value.
To see what the return value (True or False) will be, simply print it out.
my_string="Hello World"
"Hey if this case happens, perform some action. Else, if another case
happens, perform some other action. Else, if none of the above cases
happen, perform this action."
Syntax
if case1:
perform action1
elif case2:
perform action2
else:
perform action3
If the condition "condition_1" is True, the statements of the block
statement_block_1 will be executed. If not, condition_2 will be evaluated.
If condition_2 evaluates to True, statement_block_2 will be executed, if
condition_2 is False, the other conditions of the following ‘elif’ conditions
will be checked, and finally if none of them has been evaluated to True, the
indented block below the else keyword will be executed.
Note how the nested if statements are each checked until a True Boolean
causes the nested code below it to run. We should also note that we can put
in as many ‘elif’ statements as we want before we close off with an ‘else’.
While loops
The ‘while’ statement in Python is one of most general ways to perform
iteration. A ‘while’ statement will repeatedly execute a single statement or
group of statements as long as the condition is true. The reason that it is
called a 'loop' is because the code statements are looped through over
and over again until the condition is no longer met.
Syntax
while test:
code statements
else:
final code statements
Notice how many times the print statements occurred and how the ‘while’
loop kept going until the True condition was met, which occurred once
x==10. It is important to note that once this occurred, the code stopped.
We can also add ‘else’ statement in the loop as shown below. When the
loop completes, the ‘else’ statement is read.
Syntax:
while test:
code statement
if test:
break
if test:
continue
else:
Using lists
Lists can be thought of as the most general version of a sequence in
Python. Unlike strings, they are mutable, meaning the elements inside a
list can be changed. They are constructed with brackets [] and commas
separating every element in the list and can actually hold different object
types.
Just like strings, the len() function will tell you how many items are in the
sequence of the list.
We can also use + to concatenate lists, just like we did for strings.
You would have to reassign the list to make the change permanent.
• pop
• sort
• reverse
Use the append method to permanently add an item to the end of a list.
Use pop to "pop off" an item from the list. By default, pop takes off the last
index, but we can also specify which index to pop off.
We can use the sort method and the reverse methods to also affect your
lists.
Nesting lists
A great feature of Python data structures is that they support nesting. This
means we can have data structures within data structures. For example, a
list inside a list.
We can again use indexing to grab elements, but now there are two levels
for the index: the items in the matrix object, and then the items inside that
list.
List comprehensions
List comprehensions provide a concise way to create lists. It consists of
brackets containing an expression followed by a ‘for’ clause, then zero or
more ‘for’ or ‘if’ clauses. The expressions can be anything, meaning you
can put in all kinds of objects in lists. The result will be a new list, resulting
from evaluating the expression in the context of the ‘for’ and ‘if’ clauses,
which follow it. The list comprehension always returns a result list.
The list comprehension starts with a '[' and ends with a ']' to help you
remember that the result is going to be a list.
The basic syntax is:
[ expression for item in list if conditional ]
This is equivalent to:
for item in list:
if conditional:
expression
A list comprehension is used here to grab the first element of every row in
the matrix object.
Dictionaries
A dictionary is a collection that is unordered, changeable, and indexed. In
Python, dictionaries are written with curly brackets, and they have keys
and values.
A dictionary can be constructed in the following manner:
It is important to note that dictionaries are very flexible in the data types
they can hold.
We can also create keys by assignment. For instance, if we started off with
an empty dictionary, we could continually add to it:
Another common idea during a ‘for’ loop is keeping some sort of running
tally during multiple loops.
Loops can also be used with strings. Strings are a sequence, so when we
iterate through them, we will be accessing each item in that string.
Tuples
Tuples are very similar to lists. However, unlike lists, they are immutable,
meaning they cannot be changed. You would use tuples to present things
that should not be changed, such as days of the week, or dates on a
calendar.
The construction of a ‘tuples’ uses () with elements separated by commas.
Tuples have a special quality when it comes to ‘for’ loops. If you are
iterating through a sequence that contains tuples, the item can actually be
the tuple itself, this is an example of tuple unpacking. During the ‘for’
loop, we will be unpacking the tuple inside of a sequence and we can
access the individual items inside that tuple.
Mode Description
‘r’ This is the default mode. It opens file the file for
reading.
‘w’ This mode opens the file for writing.
If the file does not exist, it creates a new file.
If the file exists, it truncates the file.
‘x’ This creates a new file. If the file already exists, the
operation fails.
‘a’ This opens a file in append mode.
If the file does not exist, it creates a new file.
‘t’ This is the default mode. It opens the file in text mode.
‘b’ This opens the file in binary mode.
‘+’ This opens a file for reading and writing (updating).
if __name__== "__main__":
main()
1. Use ‘for’, .split(), and ‘if’ to create a statement that will print out
words that start with ‘s’:
st = 'Print only the words that start with s in this sentence’
3. Go through the string below and if the length of a word is even, print
"even!”
st = 'Print every word in this sentence that has an even number of
letters’
Python Libraries
The table represents the data of a sales team of an organization, with their
overall performance rating. The data is represented in rows and columns.
Each column represents an attribute and each row represents a person.
The data types of the four columns can be found in the following table.
Column Type
Name String
Age Integer
Gender String
Rating Float
pandas.Series
Series is a one-dimensional labeled array capable of holding data of any
type (integer, string, float, python objects, and so on). The axis labels are
collectively called index.
A pandas series can be created using the following constructor.
pandas.Series (data, index, dtype, copy)
Parameter Description
data data takes various forms such as ndarray, list,
constants
index Index values must be unique and hashable, same
length as data. Default np.arrange(n), if no index
is passed
dtype dtype is for data type. If none, data type will be
inferred
copy Copy data. Default False
a 0.0
b 1.0
c 2.0
dtype: float64
pandas.DataFrame
A data frame is a two-dimensional data structure, that is, data is aligned in
a tabular fashion in rows and columns. A Pandas data frame can be
created using the following constructor.
pandas.DataFrame(data, index, columns, dtype, copy)
Parameter Description
data data takes various forms such as ndarray, series,
map, lists, dict, constants, and also another
DataFrame.
index For the row labels, the index to be used for the
resulting frame is Optional Default np.arrange(n)
if no index is passed.
columns For column labels, the optional default syntax is -
np.arrange(n). This is only true if no index is
passed.
dtype Data type of each column
copy This command is used for copying of data, if the
default is False
In the following example, all NaN(missing values) are replaced with the
string ‘VARIOUS’ in order to remove the NaN/ missing values from the
data. Replacing the missing values is a business decision and it depends
on the data you are working on.
df.groupby('key').sum()
The sum() method is just one possibility here; you can apply virtually any
common Pandas or NumPy aggregation function, as well as virtually any
valid DataFrame operation.
GroupBy object
The GroupBy object is a very flexible abstraction. In many ways, you can
simply treat it as if it is a collection of DataFrames, and it does the difficult
things under the hood.
Aggregation
An aggregated function returns a single aggregated value for each group.
Once the group by object is created, several aggregation operations can be
performed on the grouped data.
Here is an example of the aggregate or equivalent agg method:
Year
2014 795.25
2015 769.50
2016 725.00
2017 739.00
Name: Points, dtype: float64
Below is an example of the size() function:
Function Description
mean() Compute mean of groups
sum() Compute sum of group values
Function Description
size() Compute group sizes
count() Compute count of group
std() Standard deviation of groups
var() Compute variance of groups
sem() Standard error of the mean of groups
describe() Generate descriptive statistics
first() Compute first of group values
last() Compute last of group values
nth() Take nth value, or a subset if n is a list
min() Compute min of group values
The resulting aggregations are named for the functions themselves. If you
need to rename, then you can add in a chained operation for a Series like
this:
For example, here is an apply() that normalizes the first column by the
sum of the second:
apply() within a GroupBy is quite flexible. The only criterion is that the
function takes a DataFrame and returns a Pandas object or scalar. What
you do in the middle is up to you.
• on − Columns (names) to join on. Must be found in both the left and
right DataFrame objects.
• left_on − Columns from the left DataFrame to use as keys. Can either
be column names or arrays with length equal to the length of the
DataFrame.
• right_on − Columns from the right DataFrame to use as keys. Can
either be column names or arrays with length equal to the length of
the DataFrame.
• left_index − If True, use the index (row labels) from the left
DataFrame as its join key(s). In case of a DataFrame with a
MultiIndex (hierarchical), the number of levels must match the
number of join keys from the right DataFrame.
• right_index − Same usage as left_index for the right DataFrame
• how − One of 'left', 'right', 'outer', 'inner'. Defaults to inner. Each
method has been described later.
• sort − Sort the result DataFrame by the join keys in lexicographical
order. Defaults to True, setting to False will improve the
performance substantially in many cases.
Let us now create two different DataFrames and perform the merging
operations on it.
Name id subject_id
0 Alex 1 sub1
1 Amy 2 sub2
2 Allen 3 sub4
3 Alice 4 sub6
4 Ayoung 5 sub5
Name id subject_id
0 Billy 1 sub2
1 Brian 2 sub4
2 Bran 3 sub3
3 Bryce 4 sub6
4 Betty 5 sub5
Below is an example of merging two DataFrames on multiple keys:
We are merging two dataframes “left” and “right” using keys “id” and
“subject_id”. Hence, when the values of “id” and “subject_id” are
common in the two dataframes, those values will be merged giving the
output as:
Name_x id subject_id Name_y
0 Alice 4 sub6 Bryce
1 Ayoung 5 sub5 Betty
in either the left or the right tables, the values in the joined table will be
NA.
Here is a summary of the ‘how’ options and their SQL equivalent names.
We are merging two dataframes “left” and “right” using keys from left of the
item “subject_id”. So the values of keys in the left of “subject_id” in
dataframe “left”, will be the base of merging the two dataframes, giving the
output as:
While using Inner Join, joining will be performed on index. Join operation
honors the object on which it is called. So, a.join(b) is not equal to
b.join(a).
a. sem()
b. var()
c. size()
Error Handling
Explanation:
• The parser repeats the offending line and displays a little ‘arrow’
pointing at the earliest point in the line where the error was
detected.
• The error is caused by (or at least detected at) the token preceding
the arrow. In the example, the error is detected at the keyword print,
since a colon (':') is missing before it.
• File name and line number are printed so you know where to look in
case the input came from a script.
Exceptions
Even if a statement or expression is syntactically correct, it may cause an
error when an attempt is made to execute it. Errors detected during
execution are called exceptions and are not unconditionally fatal.
Explanation:
• The last line of the error message indicates what happened.
Exceptions come in different types, and the type is printed as part of
the message. The types in the example are:
o ZeroDivisionError
o NameError and
o TypeError
• The string printed as the exception type is the name of the built-in
exception that occurred. This is true for all built-in exceptions but
need not be true for user-defined exceptions (although it is a useful
convention).
• Standard exception names are built-in identifiers (not reserved
keywords).
• The rest of the line provides detail based on the type of exception
and what caused it.
• The preceding part of the error message shows the context where
the exception happened, in the form of a stack traceback. In general,
it contains a stack traceback listing source lines; however, it will not
display lines read from standard input.
The integer data type is not iterable and trying to iterate over it will
produce a type error.
Name Error
Name Error can occur when we try and refer to a variable that has not
been defined.
The below table lists the standard exceptions available in Python.
The ‘try … except’ statement has an optional else clause, which, when
present, must follow all except clauses. It is useful for code that must be
executed if the ‘try’ clause does not raise an exception. The use of the
‘else’ clause is better than adding additional code to the ‘try’ clause
because it avoids accidentally catching an exception that was not raised
by the code being protected by the ‘try … except’ statement.
a. OverflowError
b. RuntimeError
c. StandardError
4. Name the error that is raised when the user interrupts execution by
pressing Ctrl+c.
Other Topics
RE objects
A regular expression (RE) in a programming language is a special text
string used for describing a search pattern. It is extremely useful for
extracting information from text such as code, files, log, spreadsheets, or
even documents.
Regular expressions can contain both special and ordinary characters.
Most ordinary characters such as 'A', 'a', or '0' are the simplest regular
expressions. These characters simply match themselves.
Some characters such as '|' or '(' are special. Special characters either
stand for classes of ordinary characters or affect how the regular
expressions around them are interpreted.
Repetition qualifiers (*, +, ?, {m,n}, and so on) cannot be directly nested.
This avoids ambiguity with the non-greedy modifier suffix ‘?’, and with
other modifiers in other implementations. To apply a second repetition to
an inner repetition, parentheses may be used.
import re
Regular Expression
In Python, a regular expression, denoted as RE (REs, regexes, or regex
pattern), are imported through the ‘re’ module. Python supports regular
expression through libraries. In Python, regular expression supports
various things such as Modifiers, Identifiers, and White space characters.
\s = space (tab,
space, newline, ? = matches 0 or 1 \t = tab
etc.)
\S = anything but a
* = 0 or more \e = escape
space
\w = letters
(Match
$ match end of a \r = carriage
alphanumeric
string return
character,
including "_")
\W =anything but
letters (Matches a
^ match start of a
non-alphanumeric \f= form feed
string
character
excluding "_")
. = anything but | matches either or -----------------
letters (periods) x/y
\b = any character
[] = range or -----------------
except for new
"variance"
line
{x} = this amount of
\. -----------------
preceding code
• Scan through the string looking for the first location where this
regular expression produces a match and return a corresponding
match object.
Pattern matching
• Pattern.split(string, maxsplit=0)
Identical to the split() function, that uses the compiled pattern
• Pattern.flags
The regex matching flags. This is a combination of the flags given to
compile(), any (?...) inline flags in the pattern, and implicit flags such as
UNICODE if the pattern is a Unicode string.
• Pattern.groups
The number of capturing groups in the pattern
• Pattern.pattern
The pattern string from which the pattern object was compiled
>>> text = "He was carefully disguised but captured quickly by police."
>>> for m in re.finditer(r"\w+ly", text):
... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
07-16: carefully
40-47: quickly
search() vs. match()
Python offers two different primitive operations based on regular
expressions:
• re.match() checks for a match only at the beginning of the string.
• re.search() checks for a match anywhere in the string.
Parsing data
Parsing is the process of analyzing a string of symbols, either in natural
language, computer languages, or data structures, conforming to the rules
of a formal grammar. There are fundamentally three ways to parse a
language or document from Python:
• Use an existing library supporting that specific language:
o The first option is the best for well-known and supported
languages such as XML or HTML.
4. Find the prepositions and their position using finter() for the
following sentence: Ramesh could not go out of his house due to the
cyclone warning in his area.
Introduction to regression
Outliers
Suppose there is an observation in the dataset that has a very high or very
low value as compared to the other observations in the data, i.e., it does
not belong to the population. Such an observation is called an outlier. In
Multicollinearity
When the independent variables are highly correlated to each other, then
the variables are said to be multicollinear. Many types of regression
techniques assume multicollinearity should not be present in the dataset.
It is because it either causes problems in ranking variables based on their
importance or it makes job difficult in selecting the most important
independent variable (factor).
Heteroscedasticity
When dependent variable’s variability is not equal across values of an
independent variable, it is called heteroscedasticity. For example, as one’s
income increases, the variability of food consumption will increase. A
poorer person will spend a rather constant amount by always eating
inexpensive food. However, a wealthier person may occasionally buy
inexpensive food and at other times eat expensive meals. Those with
higher incomes display a greater variability of food consumption.
Overfitting
When we use unnecessary explanatory variables, it might lead to
overfitting. Overfitting means that our algorithm works well on the training
set but is unable to perform better on the test sets. It is also known as the
problem of high variance.
Underfitting
When our algorithm works so poorly that it is unable to fit even the training
set well, then it is said to underfit the data. It is also known as the problem
of high bias.
Types of regression
There are various kinds of regression techniques available to make
predictions. These techniques are mostly driven by three metrics (number
of independent variables, type of dependent variables, and shape of
regression line).
Linear Regression
Linear Regression is one of the most widely known modeling techniques.
In this technique, the dependent variable is continuous, independent
variable(s) can be continuous or discrete, and nature of regression line is
linear.
Linear Regression establishes a relationship between dependent variable
(Y) and one or more independent variables (X) using a best fit straight
line (also known as regression line). It is represented by an equation
Y=a+b*X + e, where a is the intercept, b is slope of the line, and e is error
term. This equation can be used to predict the value of target variable
based on given predictor variable(s).
Logistic Regression
Logistic regression is used to find the probability of event=Success
and event=Failure. We should use logistic regression when the
dependent variable is binary (0/ 1, True/ False, Yes/ No) in nature.
Here the value of Y ranges from 0 to 1 and it can have represented by
following equation.
the sample values rather than minimizing the sum of squared errors (like
in ordinary regression).
5. It’ll also depend on your objective. It can occur that a less powerful
model is easy to implement as compared to a highly statistically
significant model.
6. Regression regularization methods (Lasso, Ridge, and ElasticNet)
works well in case of high dimensionality and multicollinearity
among the variables in the data set.
3. Why is the Least Square Method the correct method for finding the best
fit line?
EDA in Python
Multiple libraries are available to perform basic EDA, but we are going to
use pandas and matplotlib.
• Pandas for data manipulation
• Matplotlib for plotting graphs
We shall, now, look at various exploratory data analysis methods:
• Descriptive Statistics, which is a way of giving a brief overview of
the dataset we are dealing with, including some measures and
features of the sample
• Grouping data, which is, basic grouping with group by
Descriptive Statistics
DF.describe()
DF[“<category>"].value_counts()
Feature Engineering
Feature engineering is the process of using domain knowledge of the data
to create features that make machine learning algorithms work. Feature
engineering can be used to increase the predictive power of learning
algorithms by creating features from raw data that will help the learning
process. This can be done by creating additional relevant features from
the existing raw features in the data.
Feature engineering is something that will cost some time to get the hang
of. It’s not always clear what you can do with the raw data so that you can
help the predictive power of the data.
Correlation Matrix
Correlation is a simple relationship between two variables in a context,
such that, one variable affects the other. Correlation is different from act
of causing. For example, sales might increase when the marketing
department spends more on TV advertisements, or a customer's average
purchase amount on an e-commerce website might depend on a number
of factors related to that customer. Often, correlation is the first step to
understanding these relationships and subsequently building better
business and statistical models.
import pandas as pd
df = pd.DataFrame({'a': np.random.randint(0, 50, 1000)})
df['b'] = df['a'] + np.random.normal(0, 10, 1000) # positively
correlated with 'a'
df['c'] = 100 - df['a'] + np.random.normal(0, 5, 1000) #
negatively correlated with 'a'
df['d'] = np.random.randint(0, 50, 1000) # not correlated with
'a'
df.corr()
plt.matshow(df.corr())
plt.xticks(range(len(df.columns)), df.columns)
plt.yticks(range(len(df.columns)), df.columns)
plt.colorbar()
plt.show()
Installing Matplotlib
• Using Anaconda, we can install Matplotlib from terminal or
command prompt using:
conda install matplotlib
Anatomy of a Plot
There are two key components in a Plot, namely Figure and Axes.
• The Figure is the top-level container that acts as the window or page
on which everything is drawn. It can contain multiple independent
figures, multiple Axes, a subtitle (which is a cantered title for the
figure), a legend, a color bar, and so on.
• The Axes is the area on which we plot our data and any labels/ticks
associated with it. Each Axes has an X-Axis and a Y-Axis (as in the
image above).
Now that we have a plot, let’s go on to name the x-axis and y-axis, and add
a title using .xlabel(), .ylabel(), and .title() using the following:
Matplotlib allows us easily create multi-plots on the same figure using the
.subplot() method. This .subplot() method takes in three parameters,
namely:
1. nrows: The number of rows the Figure should have
2. ncols: The number of columns the Figure should have
3. plot_number: Which refers to a specific plot in the Figure
Notice how the two plots have different colors. This is because we need
to be able to differentiate the plots. This is possible by simply setting
the color attribute to ‘red’ and ‘green’.
As you can see, we have a blank set of axes. Now let’s plot our x and y
arrays on it:
We can further add x and y labels and a title to our plot the same way we
did in the Function approach, but there’s a slight difference here. Using
.set_xlabel(), .set_ylabel(), and .set_title(), let us go ahead and add labels
and a title to our plot:
We noted that a Figure can contain multiple figures. Let’s try to put in two
sets of figures on one canvas:
Now let’s plot our x and y arrays on the axes we have created:
fig.savefig(‘my_figure.png’)
Plot Appearance
Matplotlib gives us a lot of options for customizing the appearance of our
plots. By now, you should be familiar with changing line color using
color=’red’ or ‘red’ like we did in previous examples. Now we want to
change linewidth or lw, linestyle, or ls, and mark out data points using
marker.
Scatter plots
They offer a convenient way to visualize how two numeric values are
related in your data. It helps in understanding relationships between
multiple variables. Using .scatter() method, we can create a scatter plot:
Bar graphs
These are convenient for comparing numeric values of several groups.
Using .bar() method, we can create a bar graph:
Plot these lists using a scatter plot. Assume xs as the independent variable
and ys as the dependent variable.
To see the slope and intercept for xs and ys, we just need to call the
function slope_intercept:
Objective
We will use customer data that closely mocks the telecom industry data to
predict the potential churn. We will also try to predict the ARPU (Average
Revenue Per User) using the data.
Methodology
Churn Analysis:
1. Logistic Regression
2. SVM
ARPU Analysis:
1. Linear Regression
Data Curation
Curate the data by properly treating the categorical values and treating the
missing and null values
Data Preparation
Prepare the data by creating a Train-set and Test-set. Sometimes, we
also have a blind-set.
Data Ingestion
Ingest the data from the CSV or Excel Sheet
Data Curation
Curate the data by properly treating the categorical values and treating the
missing and null values
Data Preparation
Prepare the data by creating a Train-set and Test-set. Sometimes, we
also have a blind-set.
4. Write down the codes used for creating the following plot types:
a. Scatter plots
b. Bar graphs
Advance
This machine learns from past experience and tries to capture the
best possible knowledge to make accurate business decisions.
Examples of Reinforcement Learning include Markov Decision
Process.
Commonly used Machine Learning Algorithms include:
• Linear Regression
• Logistic Regression
• Decision Tree
• SVM
• Naive Bayes
• kNN
• K-Means
• Random Forest
• Dimensionality Reduction Algorithms
• Gradient Boosting Algorithms including GBM, XGBoost, LightGBM,
and CatBoost
Now, we will find some line that splits the data between the two
differently classified groups of data. This will be the line such that the
distances from the closest point in each of the two groups will be farthest
away.
In the example shown above, the line which splits the data into two
differently classified groups is the black line, since the two closest points
are the farthest apart from the line. This line is our classifier. Then,
depending on where the testing data lands on either side of the line, that’s
what class we can classify the new data as.
Now let us look at a Python code for this.
#Import Library
from sklearn import svm
#Assumed you have, X (predictor) and Y (target) for training data set and
x_test(predictor) of test_dataset
# Create SVM classification object
model = svm.svc() # there is various option associated with it, this is simple for
classification.
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)
Random Forest
Random Forest is a trademark term for an ensemble of decision trees. In
Random Forest, we have a collection of decision trees, hence the term
“Forest”.
To classify a new object based on attributes, each tree gives a
classification and we say the tree “votes” for that class. The forest
chooses the classification having the most votes (over all the trees in the
forest). Each tree is planted and grown as follows:
Conclusion
The purpose of this learning material is to make the reader accustomed to
python language. This material has been written in a manner to explain
key concepts of the language. The reader should now have an
understanding of the following:
• Basics of python language
• Programming in python
• Various python libraries
• Handling various types of error in python
• Basics of regression analysis
• Overview of machine learning
Glossary