Download as pdf or txt
Download as pdf or txt
You are on page 1of 235

Exam Prep Session

chanddra.p@gmail.com
UVBL5MQSJ8

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Agenda

1. Python Practice Questions

2. Model paper - 30 MCQ questions.


chanddra.p@gmail.com
UVBL5MQSJ8

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Subjective Question :Example 1
Q: Illustrate the use of break keyword using an example

chanddra.p@gmail.com
UVBL5MQSJ8

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Subjective Question :Example 2
Q: Write two difference between Series and DataFrame .With an example

chanddra.p@gmail.com
UVBL5MQSJ8

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Subjective Question :Example 3

Q: What is the Seaborn function for colouring plots?

Ans: color_palette() is a Seaborn function that can be used to give colours to plots
and give them additional artistic appeal.

chanddra.p@gmail.com
UVBL5MQSJ8
Q: What is Histograms in Seaborn?

Ans: Histograms show the distribution of data by constructing bins throughout the data's range
and then drawing bars to show how many observations fall into each bin

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Subjective Question :Example 4
Q: Define the term kurtosis and discuss the important of kurtosis ?

• Kurtosis refers to the degree of presence of outliners in the distribution.

• It is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data
sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or
chanddra.p@gmail.com
UVBL5MQSJ8 lack of outliers. A uniform distribution would be the extreme case.

• A kurtosis greater than three will indicate Positive Kurtosis. The value of kurtosis will range from 1 to infinity.
• A kurtosis less than three will mean a Negative Kurtosis. The range of values for a negative kurtosis is from
-2 to infinity. The greater the value of kurtosis, the higher the peak.
This file is meant for personal use by chanddra.p@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Subjective Question :Example 5

Q: Discuss the importance of Categorical Data Encoding . Illustrate with any example

Categorical data is a type of data that is used to group information with similar characteristics, while numerical data is a
type of data that expresses information in the form of numbers.
Example of categorical data:
chanddra.p@gmail.com
UVBL5MQSJ8
– Weather conditions: “sun”, “rain”, “overcast”, “snow”, etc.

– Car manufacturers: “Toyota”, “Ford”, “Mercedes”, etc.

– Exam grades: “A”, “B”, “C”, “D” and “F”

• Categorical variables can be divided into two categories:


Nominal: The nominal data called labelled/named data. Ex: Gender (Male/Female/Other),Age Groups (Young/Adult/Old)
Ordinal: Represent discretely and ordered units. Ex: Ranks: 1st/2nd/3rd, Education: (School/UG/PG/Doctorate)
This file is meant for personal use by chanddra.p@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
chanddra.p@gmail.com
UVBL5MQSJ8

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Subjective Question :Example 6

Q: Define List data type . Display few examples

A list is an ordered sequence of values.


Lists data type is Mutable (changeable)
The values inside the lists can be of any type and are called as elements or items.
The elements of lists are enclosed within square brackets separated by commas.
chanddra.p@gmail.com
UVBL5MQSJ8

Ex 1 fruits= [‘grapes', ‘kiwi’, ‘Mango’]

Ex 2 numbers = [76, False, ‘hi’, 3.14,None]

Ex 3 emptylist = [ ] # The value [ ] is an empty list that contains no


values, similar to “ “, the empty string.

print(fruits,numbers,emptylist) 🡪 [‘grapes', ‘kiwi’, ‘Mango'] [76, False, 'hi', 3.14, None][ ]

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Format of Paper

Total Marks : 100


• Subjective Section - 4 questions = 20 Marks .
Subjective Section - Total 6 questions. Out of which 4 have to be attempted. These might
chanddra.p@gmail.comcontain coding questions as well.
UVBL5MQSJ8

• Objective Section - 80 questions = 80 Marks . All have to be attempted.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
chanddra.p@gmail.com
UVBL5MQSJ8
Any Doubts ?

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
chanddra.p@gmail.com
UVBL5MQSJ8
Thank You

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Code Prep Session - Python Language
chanddra.p@gmail.com
UVBL5MQSJ8

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Anaconda https://www.anaconda.com/products/individual#windows
Anaconda is an open-source software that contains Jupyter, spyder, etc that are used for large data
processing, data analytics, heavy scientific computing. Anaconda works for R and python programming
language. Spyder(sub-application of Anaconda) is used for python. Opencv for python will work in spyder.
Package versions are managed by the package management system called conda.

chanddra.p@gmail.com
UVBL5MQSJ8

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Anaconda Navigator

chanddra.p@gmail.com
UVBL5MQSJ8

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Jupyter Notebook

• Jupyter Notebook is an open-source web application that allows you to create and share
documents that contain live code, equations, visualizations, and narrative text. Uses include data
cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine
learning, and much more.

chanddra.p@gmail.com
UVBL5MQSJ8

• Jupyter has support for over 40 different programming languages and Python is one of them.
Python is a requirement (Python 3.3 or greater, or Python 2.7) for installing the Jupyter Notebook
itself.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Install Jupyter

chanddra.p@gmail.com
UVBL5MQSJ8

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Launch Jupyter

chanddra.p@gmail.com
UVBL5MQSJ8

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
IDLE (Python Shell) https://www.python.org/downloads/

chanddra.p@gmail.com
UVBL5MQSJ8

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Google Colab

If you want to create a machine learning model but say you don’t have a computer that can take
the workload, Google Colab is the platform for you. Even if you have a GPU or a good computer
creating a local environment with anaconda and installing packages and resolving installation
issues are a hassle.

Colaboratory is a free Jupyter notebook environment provided by Google where you can use free
chanddra.p@gmail.com
UVBL5MQSJ8
GPUs and TPUs which can solve all these issues.

Getting Started
To start working with Colab you first need to log in to your google account, then go to this
link https://colab.research.google.com.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Home Page

chanddra.p@gmail.com
UVBL5MQSJ8

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
chanddra.p@gmail.com
UVBL5MQSJ8

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
chanddra.p@gmail.com
UVBL5MQSJ8

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
chanddra.p@gmail.com
UVBL5MQSJ8

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
chanddra.p@gmail.com
UVBL5MQSJ8

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
chanddra.p@gmail.com
UVBL5MQSJ8

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
chanddra.p@gmail.com
UVBL5MQSJ8

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
chanddra.p@gmail.com
UVBL5MQSJ8
Any Doubts ?

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
chanddra.p@gmail.com
UVBL5MQSJ8
Thank You

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Disclaimer: The questions given here are indicative of the format of questions in the MCQ section of the end semester exam, and not exhaustive. The final question paper for your exam will be set by
the University and will have a wide range of questions. Please use this model paper in conjunction with your other course materials to prepare well for your exams.

Correct Answer
Sl. No. Images Question A B C D
[1]
What is the output:
def function1(number):
1 return number + 25
25 5 10 Name Error Name Error
function1(5)
print(number)
What is the output for the code:

def function(name, age=20):


2 Ananda 25 Ananda 20 Error Ananda Ananda 25
print(name, age)

function('Ananda', 25)

3 What’s the output of the following SyntaxError, this


123420 123 420 123
code snippet? program will not run

chanddra.p@gmail.com
UVBL5MQSJ8
self' is not
person1 and person2 The __init__ needed in
self' is not needed in person2 has a different
4 Which of the following statements is are two different method is used to statement--
statement-- "def value for 'name' than
incorrect about the following code? instances of the set initial values for "def
name_print(self):" person1.
Student class. attributes. name_print
(self):"

5 Whick keyword is used to create a


class def self init class
class in python?
What will be the output:
class A():
def display(self): Error because when
print("A display()") Invalid syntax for object is created,
6 Nothing is printed A display() A display()
class B(A): inheritance argument must be
pass passed
object = B()
object.display()

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Disclaimer: The questions given here are indicative of the format of questions in the MCQ section of the end semester exam, and not exhaustive. The final question paper for your exam will be set by
the University and will have a wide range of questions. Please use this model paper in conjunction with your other course materials to prepare well for your exams.

Correct Answer
Sl. No. Images Question A B C D
[1]

AttributeError:
7 What’s the output of the following 'Teacher' object
Walking Speaking! Lecture! Walking
code snippet? has no attribute
'walk'

What will be the output:


class Person:
def __init__(self, fname, lname,age):
self.firstname = fname
self.lastname = lname
self.age=age

def Details(self):
8 print(self.firstname, self.lastname, Mike, "Olsen","19" 19 Mike Olsen 19 Error Mike Olsen 19
chanddra.p@gmail.com self.age)
UVBL5MQSJ8
class Student(Person):
pass

x = Student("Mike", "Olsen","19")
x.Details()

Suppose you have a list as follows:

Marks = [24,25,26,30]
marks_array =
9 marks_array = np.array. marks_array = np.array marks_array = np. marks_array = np.
np.array
You need to convert the above list to Marks (Marks) Marks Marks(array)
(Marks)
an ndarray named "marks_array".
Which of the following is the correct
syntax?
Suppose you have following code:

A = [20, 30, 40, 50]

B = [10, 10, 10, 10]


10 [2,3,4,5] [18, 28, 38, 48] [200, 300, 400, 500] error error
Result = A/B

Result

What would be the output?

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Disclaimer: The questions given here are indicative of the format of questions in the MCQ section of the end semester exam, and not exhaustive. The final question paper for your exam will be set by
the University and will have a wide range of questions. Please use this model paper in conjunction with your other course materials to prepare well for your exams.

Correct Answer
Sl. No. Images Question A B C D
[1]
Suppose you have the following
code:

A = np.array([20, 30, 40, 50, 60])

B = np.array([2, 3, 4, 5])
11 [22,33,44,55,60] [22,33,44,55] [202,303,404,505,60] Error Error
Result = A+B

Result

What would be the output?


Consider the following:

12 array([90 12 13 15 array([90 13 14 15
arr = np.array([90,12,13,14,15,16]) array([90 14]) array([90 13 15 16]) array([90 14])
16]) 16])
arr[0:6:3]
What is the correct syntax for finding
chanddra.p@gmail.com
13 out the element with maximum value multi_arr.max
multi_arr(max) multi_arr.max() multi_arr_max() multi_arr.max()
UVBL5MQSJ8 in a multi dimensional array named (element)
multi_arr?
Consider the following code snippet:

arr=np.arange(1,26)

14 Natural numbers from


arr_reshaped = arr.reshape(5,5) 1 2 5 2
1 to 25
arr_reshaped.ndim

What will be the output?

How many values should be there in


15 "data" ,for the above code to shows 4 5 3 20
no error

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Disclaimer: The questions given here are indicative of the format of questions in the MCQ section of the end semester exam, and not exhaustive. The final question paper for your exam will be set by
the University and will have a wide range of questions. Please use this model paper in conjunction with your other course materials to prepare well for your exams.

Correct Answer
Sl. No. Images Question A B C D
[1]

16 What will be output of the above


code?

import pandas as pd
b= {“Name” : [“Amita”, “Any”,”Ravi”],
“RollNo” : [10,20,3]}
17 Row - 2 Column - Row - 3
Data = pd.DataFrame(b) Row - 3 Column - 3 Row - 3 Column - 2 Row - 2 Column - 3
2 Column - 2
In given code dataframe ‘Data’ has
how many rows and columns?

chanddra.p@gmail.com
UVBL5MQSJ8 18 The above dataframe has how many
3 rows, 2 Columns 2 rows, 3 Columns 1 row, 1 Column 3 rows, 3 Columns
3 rows, 2
rows and columns Columns

Deletes the
Deletes the entire row with
Deletes the entire Deletes the row with Deletes the row with
19 What is the after effect of the above column 'Subject' 'Score' value =
column 'Score' from 'Score' value = 87 from 'Score' value = 88
code from dataframe 87 from
dataframe "df1" dataframe "df1" from dataframe "df1"
"df1" dataframe
"df1"
import
matplotlib.
import matplotlib.pyplot import matplotlib.pyplot import matplotlib. pyplot as plt
Consider the following lists: import matplotlib.
as plt as plt pyplot as plt
pyplot as plt
x=
x = [11,22,33,44,55]
x = [11,22,33,44,55] x = [11,22,33,44,55] x = [11,22,33,44,55] [11,22,33,44,5
x=
5]
20 y = [9,18,27,36,45] [11,22,33,44,55]
y = [9,18,27,36,45] y = [9,18,27,36,45] y = [9,18,27,36,45]
y=
Which of the following is the correct y = [9,18,27,36,45]
plt.scatter(x,y) plt.barh(x,y) plt.bar(x,y) [9,18,27,36,45]
syntax for plotting a scatter plot using
the above lists? plt.show()
plt.show() plt.show() plt.show() plt.scatter(x,y)

plt.show()

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Disclaimer: The questions given here are indicative of the format of questions in the MCQ section of the end semester exam, and not exhaustive. The final question paper for your exam will be set by
the University and will have a wide range of questions. Please use this model paper in conjunction with your other course materials to prepare well for your exams.

Correct Answer
Sl. No. Images Question A B C D
[1]
import
matplotlib.
import matplotlib. import matplotlib.
import matplotlib.pyplot import matplotlib.pyplot pyplot as plt
Consider the following lists: pyplot as plt pyplot as plt
as plt as plt
x = [apple,
x = [apple, banana, orange, guava, x = [apple, banana, x = [apple, banana,
x = [apple, banana, x = [apple, banana, banana,
grapes] orange, guava, orange, guava,
orange, guava, grapes] orange, guava, grapes] orange, guava,
grapes] grapes]
21 grapes]
y = [9,18,27,36,45]
y = [9,18,27,36,45] y = [9,18,27,36,45]
y = [9,18,27,36,45] y = [9,18,27,36,45]
y=
Which of the following is the correct
plt.histogram(x,y) plt.bar(x,y) [9,18,27,36,45]
syntax for plotting a bar chart using plt.scatter(x,y) plt.pie(x,y)
the above lists?
plt.show() plt.show() plt.bar(x,y)
plt.show() plt.show()
plt.show()

chanddra.p@gmail.com Select the suitable plot for the above


UVBL5MQSJ822
code

Consider the following dataset:


import seaborn as import seaborn
import seaborn as sns import seaborn as sns sns as sns
Seaborn.png import seaborn as sns
sns.lineplot(data=data, sns.lineplot sns.lineplot sns.lineplot
23 Here if we wish to plot a graph with sns.lineplot(data=data, x=
x= 'year', y = (data=data, x= 'year', (data=data, x= (data=data, x=
passengers on y axis and year on the 'year', y = 'passengers',
'passengers', hue = y = 'passengers', hue 'year', y = 'year', y =
x axis with different colours for each hue = 'passengers')
'year') = 'month') 'passengers', hue 'passengers',
month, what would be the correct
= 'data') hue = 'month')
code?

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Disclaimer: The questions given here are indicative of the format of questions in the MCQ section of the end semester exam, and not exhaustive. The final question paper for your exam will be set by
the University and will have a wide range of questions. Please use this model paper in conjunction with your other course materials to prepare well for your exams.

Correct Answer
Sl. No. Images Question A B C D
[1]

24 Which plot does this picture


boxplot violin plot pairplot hexbin plot violin plot
represent?

Johhny is working on the dataset that


contains information about all the
customers that have purchased any
product from his company in the last
12 months. In this dataset he found fillna (method =
25 fillna (method = bfill) replace interpolate replace
that there are a number of null values pad)
chanddra.p@gmail.com present for each column. He wants to
UVBL5MQSJ8 replace these null values with "0".
Which of the following would come
handy in this scenario?

Which of the following statements


26 dataframe.
would give result to an output like this dataframe() dataframe.info() dataframe.describe() dataframe_info()
info()
one?

Either
Which of the following statements dataframe.
27 Either dataframe.notnull() dataframe.isnull(). dataframe.
would give result to an output like this dataframe.notnull() notnull() or
or dataframe.isnull() sum() notnull().sum()
one? dataframe.
isnull()

28 Which of the following is not shown Min value of each Min value of
Column name Not null count Datatype
by dataframe.info statement? column each column

29 Which one of the following is not a


info shape calender Return calender
Python keyword?

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Disclaimer: The questions given here are indicative of the format of questions in the MCQ section of the end semester exam, and not exhaustive. The final question paper for your exam will be set by
the University and will have a wide range of questions. Please use this model paper in conjunction with your other course materials to prepare well for your exams.

Correct Answer
Sl. No. Images Question A B C D
[1]

30 Which is the following is an Arithmetic


// | << >> //
operator in Python?

chanddra.p@gmail.com
UVBL5MQSJ8

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
1. Write a program to replace the consonants in the entered string with #
Sample Example
Input:
Enter a String: Data

Output:
The original String is: Data
The modified String is: #a#a

2. Write a program to check whether the entered number is a palindrome or not

Sample Example
Input :
Enter a Number: 1221

Output :
The given Number 1221 is a palindrome

3. Consider a number n, Write a program to create a function named printingDict( )


which can print a dictionary where the keys are numbers between 1 and n (both included) and the
values are cube of keys.

Input Format:
The first line contains the number n.
chanddra.p@gmail.com
UVBL5MQSJ8
Output Format
Print the dictionary in one line.

Example :
6
{1: 1, 2: 8, 3: 27, 4: 64, 5: 125, 6: 216}

8
{1: 1, 2: 8, 3: 27, 4: 64, 5: 125, 6: 216, 7: 343, 8: 512}

4. Write a program to find whether a given number is a power of 3 or not.

Input format:
The first line of the input contains the number n for which you have to find whether it is a power of
2 or not.

Output Format:
Print 'YES' or 'NO' accordingly without quotes.

Sample Example 1
Input :
Enter a number: 90

Output :

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
NO

Sample Example 2
Input :
Enter a number: 216

Output :
YES

5. Write a Python program to create an array of 5 integers and display the array items. Access
individual element through indexes.
Sample Output:
2
4
6
7
9
Access first three items individually
2
4
6

6. Write a Python program to reverse the order of the items in the array.
Sample Output
chanddra.p@gmail.com
Original array: array('i', [1, 3, 5, 3, 7, 1, 9, 3])
UVBL5MQSJ8
Reverse the order of the items:
array('i', [3, 9, 1, 7, 3, 5, 3, 1])

7. Write a Python program to calculate the length of a string


8. Write a Python program to get a string from a given string where all occurrences of its first char
have been changed to '$', except the first char itself

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Supplementary Learning Material
chanddra.p@gmail.com
UVBL5MQSJ8

Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: Introduction to Python

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Unit 1
INTRODUCTION TO PYTHON
Overview:
Python is used for general purpose programming. It is easy to use and understand. It can be
used on a variety of operating systems. It forms a gateway into the realm of programming. A
lot of start-ups and small businesses use python since it is free of cost and any product is
easy to build with lesser codes. Time management is better with python. The scope of
python is ever increasing, and it becomes important to know the subject when one chooses
to delve into web development or data science.

Learning outcomes:
 Knowing about python
 Features of python
 Installing Anaconda Navigator
 Launching Jupyter notebook
 Basics about Jupyter notebook

chanddra.p@gmail.com
UVBL5MQSJ8
What happens to be amongst the most in-demand programming languages was in-fact
started as a hobby by its creator Guido Van Rossum to keep him occupied during Christmas.
Today, almost all big companies use python for their services in some way or the other.
Amongst the renowned ones are Google, Pinterest, Netflix, Quora, etc.
First things first, what is Python?
Python is a programming language, just like C, C++ and Java. It is a scripting language. It is a
Object-oriented- this means that its paradigm is based on ‘objects’ and ‘classes’. Python is
dynamically typed, meaning, the interpreter gives the variable a type during runtime based
on its value and it does type checking during the same.

Features of Python:
Python has various features, major ones of which are:

 Easy to understand: The python code is easy to understand because the syntax is
uncomplicated and in English. Python does not use braces for different functions, it
uses indentation which makes the code look clean and neat, thus making it readable.
 High-level language: a high level programming language is that which is user-friendly
and resembles natural human language.
 It is an interpreted language: The python code is executed one line at a time unlike
C++ which is executed all at once. The interpreter displays the output one line at

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
once, which means that if there is any error, unless the error is resolved, the code
will not be executed any further.
 Object-oriented program: as mentioned before, an OOP means that python treats it
as an object.
 Open-source: Python is an open source programming language. It’s codes are freely
available for usage. It is an interpreted language that can be used by anybody for any
purpose.
 Platform-independent: Python can be run on platform such as Windows, Linux, Mac.
The code for a program on each of these platforms will be the same.
 Extensible and embeddable: One can run codes from other languages on Python,
which makes it an extensible programming language and the other way around- they
can also run python codes on C++, Java or any other programming language. Hence,
it is also embeddable.
 Large Standard Library: Python has a collection of modules that make it easy for
people to code in it. Modules are sets of code that are pre-written so that one
doesn’t have to re-write commonly used commands every time. Modules can be
used by importing them.
A python file is saved with a .py extension. It is easy, fast and efficient. It has a wide range of
applications some of which are web development, scripting, data science, prototyping and
programming a database. All of python’s features like simplicity, easy of use, flexibility,
portability, development speed and programmer friendliness puts its use above other
chanddra.p@gmail.com
UVBL5MQSJ8programming languages’.

Python IDEs
 What is an IDE?
Integrated Development Environment, in easy words, allows programmers to combine
various parts of a program in a single GUI based application. An IDE ideally constitutes of a
source code editor, build automation tools and debugger. There are some IDEs that are
multi-language, like Eclipse and Visual Studio. IDEs are easy to setup, they make
development faster and easier, thus, saving efforts. IDEs also help correct errors and show
where the code is wrong.
In Python, the most frequently used IDEs include Spyder, Jupyter, PyCharm, IDLE and Atom.
For the course, we will be using Jupyter, which is part of the Anaconda distribution.

 Anaconda Distribution:
Anaconda distribution is a Python and R data science distribution. It is easy to download and
is open source. It has over 7500 packages. A package is a collection of modules. All of it
freely available and Anaconda also provides community support which is available for all
python related queries one has.
Steps to Install Anaconda:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
 Go to the anaconda website. https://www.anaconda.com/products/individual-d

On the website, click on download for your respective operating system (i.e., Windows,
Mac, Linux)

 The site should give you a prompt to save the file, select the location where you
chanddra.p@gmail.com
UVBL5MQSJ8 want to place the file.
 Once downloaded, open it. You should see a prompt like this. Click on next.

 Click on ‘I agree’ and do not change any settings/presets that are there. Click on
Next. Specify a destination on the computer. Click Next and it should start the
installation. Once done, click on Next.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
 Click on Next once again. And click on Finish.

chanddra.p@gmail.com
UVBL5MQSJ8

Once you are done downloading the Anaconda Navigator, you will be redirected to a
website. For tutorials you can glance over the website and explore.

LAUNCHING A JUPYTER NOTEBOOK


 Go on your search bar, and search for the Anaconda Navigator you just downloaded.
It takes a while to open.
 You must see a screen like this:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
These are some of the applications that are part of the Anaconda distribution by default.

 Go to Jupyter Notebook and click on Launch.


 You should be redirected to a website

chanddra.p@gmail.com
UVBL5MQSJ8

The site’s URL is http://localhost:8888/tree

The ‘8888’ part in the URL might change if another notebook is open in the background.
The files shown on the page are ones that are there on your computer

 To open a Jupyter notebook:

Go to New, click on ‘Python 3’ under the Notebook tab.


Once you click Python 3, another tab opens and you can see the Notebook window.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Parts of the notebook window:
 ‘Untitled1’ here is the name of the notebook, you can edit it by double clicking on it
or saving the notebook.
 The checkpoint shows the last time changes were made and saved on the notebook.
 The menu bar is one directly below it, the pane which begins with File. All of those
tabs are used to make changes in the way the notebook works.
chanddra.p@gmail.com
UVBL5MQSJ8  The Toolbar lies below the Menu bar. It gives icons to select most used operations by
simply clicking on them, such as new notebook for the ‘+’, Run to run the cell, etc.
 The kernel shows the type of kernel the current notebook uses (i.e. here, Python 3).

There are three types of cells in the Jupyter notebook, namely, Code,
Markdown, and Raw Cells.
 The Code cells are used to write the code and program. It has to be properly
indented and must have clear syntax.
 The Markdown cells are used to document what you write, it is descriptive text.
 Raw cells are a place where you can write the output directly. These cells are not
evaluated. They are like comments.
Every cell is a Code cell by default. One can change its type by the drop down on the
Toolbar.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Code cell when executed, gives the solution as above. Markdown cell when executed
appears as a note/text.

SHORTCUTS
Operation Shortcut Key
Run Ctrl + Enter
Create a new cell Shift + M
Copy a cell c
Paste cell Shift + v
Delete cell Double click ‘d’
Change type of Cell to: Code Y
Change type of Cell to: Markdown M
Change type of Cell to: Raw Cell R
Save (edit checkpoint) Ctrl + S

chanddra.p@gmail.com
UVBL5MQSJ8

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Supplementary Learning Material
chanddra.p@gmail.com
UVBL5MQSJ8

Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: Data Structures in Python

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Unit 2
Data Structures in Python
Overview:
Data structures are built-in structures in python that help categorize data which aides in its
storage and manipulation. They help to know if operations can be performed on the data. In
easy terms, they are like shelves. Every shelf has certain type of data that has a purpose and
is organised in a manner that helps for it to be used efficiently. There are 4 basic data
structures in Python, namely Lists, Tuples, Dictionaries and Sets.
These data structures can be differentiated based on their mutability. Mutability is the
ability of changing the internal state of an object in python. Objects whose state cannot be
changed are called immutable objects.

Learning Objectives:
1. Variables:
- What are variables?
- Types of variables
2. Operators:
- What are operators?
chanddra.p@gmail.com
- Different types of operators- arithmetic, logical, assignment, comparison,
UVBL5MQSJ8
identity, and bitwise.
3. Lists:
- Defining lists
- Adding, deleting, duplicating lists, and other functions of lists.
- Slicing in lists.
4. Tuples:
- Defining tuples.
- Various functions in tuples
- Discussing the difference between tuples and lists.
5. Sets:
- Defining sets.
- Functions of sets.
6. Dictionaries:
- Defining dictionaries
- Functions of dictionaries
7. Converting from one data structure to another.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
What are variables?
A reserved memory location used to store a particular value is called a variable. Variables
are used when the values keep on changing or are not constant. Values that are given to the
computer processor to perform various operations are called variables.

Conventionally, every programming language follows one amongst the following type cases:
1. snake_case: it is a naming type where every word is separated by an underscore.
Ex- shoe_color, city_location
2. camelCase: naming type where the first word is in lowercase and the initial of every
new word is in uppercase. Ex- shoeColor, cityLocation
3. PascalCase: it is a naming type in which the initial of every word is in uppercase and
all other characters are in lowercase. Ex- ShoeColor, CityLocation.
Usually, in python, for variable names snake case is used. However, it must be noted that it
is a convention, meaning people over the world use snake_case, but it is not a compulsion.
Although, one can name a variable in whichever way they see fit, there are some rules
that are followed while naming variables:
1. Variable names are case sensitive. Therefore, apple and Apple are treated as two
different variables.
chanddra.p@gmail.com
2. The name of a variable must start with alphabets ( a-z in lowercase or uppercase) or
UVBL5MQSJ8
underscore ( _ ). For ex- alpha, Alpha, _alpha
3. Special characters like +, - , * , /, etc. are not allowed while naming variables.
4. Variable names cannot begin with a number. Ex- 1Alpha will not be a valid variable
name.
5. Python keywords cannot be used as variable names. Keywords include break,
continue, end, etc.

Keywords are specially reserved words in python. These keywords have a specific function.
For instance, the ‘end’ keyword is used at the end of a loop to break the cycle of iterations,
likewise, ‘break’ is used in situations when the desired output is obtained and the loop is to
be stopped.

Operators:
Operators are symbols that carry out arithmetic and logical calculation. So all variables and
numbers are operands and the symbols are operators.
Types of operators:
1. Arithmetic operators: these are used to perform basic mathematical operations.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Operator Function Example
+ Addition c+d
- Subtraction c–d
* Multiplication c*d
/ Division c/d
% Modulus- gives the c%d
remainder of a division
problem
// Floor division- gives the c // d
quotient without the
decimals.
** Exponential- raised to the c ** y
power

An example of each is given below:

chanddra.p@gmail.com
UVBL5MQSJ8

2. Assignment operators: operators used to assign values to variables are called


assignment operators.

Operator Function Example


= assignment box = 5
-= Subtraction, equivalent to c-d box -= 2
= > box = box -2
+= Addition, equivalent to c+d Box +=2
= > box=box+2
*= Multiplication, equivalent to c*d Box *=2
= > box=box*2
/= Division, equivalent to c/d Box /=2
= > box=box/2
%= Modulus- gives the remainder of box % =2
a division problem = > box = box % 2

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
//= Floor division- gives the quotient box //= 2
without the decimals. = > box = box // 2
**= Exponential- raised to the power Box **=2
= > box ** 2

An example of each is given below:

chanddra.p@gmail.com
UVBL5MQSJ8

3. Comparison operators: as the name suggests these operators are used for
comparison of two variables or values. Returns Boolean values.
Operator Function Example
== Equal to c == d
!= Not equal to c != d
> Greater than c>d
< Less than c<d
>= Greater than equal to c >= d
<= Less than equal to c <= d

An example of each is given below:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
4. Logical operators: they are used to compare two conditional statements. Returns
Boolean values.

Operator Function Example


And If both conditions are true, c > d and c > e
chanddra.p@gmail.com returns True
UVBL5MQSJ8
Or If either one of the conditions c > d or c > e
are true, then returns True
Not If c>d

An example of each is given below:

5. Identity Operators: they are used to compare two objects, if they are infact the same
object or not.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Operator Function Example
is If both objects are the same, c is d
returns True
is not If both objects are different, c is not d
then returns True

An example of each is given below:

chanddra.p@gmail.com
UVBL5MQSJ8Declaring a variable:

To declare a variable, type a variable name and then assign a value to it. When you print the
variable name, it displays the value stored inside the variable.

Casting of Variables: to specify the type of data being used is called casting of a variable.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Datatypes of variables:
Python datatypes are divided into two parts: Primitive datatypes and non primitive
datatypes. Pre-defined data types in python are called primitive data types. Examples of
primitive datatypes are int, float, strings. Non-primitive are those that are made by deriving
primitive data types and are user-defined. Examples of non-primitive datatypes are tuple,
list, dictionary and set.
1. Numeric:
they store numeric values. Supported numeric data types in Python:
i. int - integer value datatype. They can be positive, negative or zero. They are whole
chanddra.p@gmail.com
UVBL5MQSJ8
numbers.
ii. float - decimal point value datatypes. Fractions are represented by the float type.

2. Boolean:
Boolean type of data stores only two values- True and False. It is also
interchangeably used with 0 and 1.

3. Strings: store string/text values. They are enclosed within double quotes or single
quotes.
Examples of primitive data type:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
4. Lists:
Lists are data structures that help to store multiple items in one variable. Lists are mutable
in nature. Which means its structure and order can be changed. The data type of elements
can be any, ie, integer, float, Boolean, etc.
To create lists, we use square brackets. Example:

All elements inside the list are indexed. Indexing in python starts from 0. Therefore, “Alpha”
in the above example has the index 0 in variable “letters”, while the position of “gamma” is
2.
To view a particular element from the list use the index number along with variable name
while displaying it.

chanddra.p@gmail.com
UVBL5MQSJ8Since lists are mutable, we can add, delete, duplicate and change list elements.

#1. To add elements:


The .append() function adds elements to the end of the list.

#2. To add an element at a specific position:


The .insert() function is used to add elements at a specific position in the list.

#3. To delete elements:


The .clear() function removes all elements from the list.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Similarly, the .pop() function deletes elements by position.

Lists are indexed so they can have duplicate values. A few other functions of lists include:
len() – displays the length of the list.
type() – displays the variable datatype.
sort() – displays the list in ascending order.
reverse() – reverses the order of the list
copy() – creates a duplicate of a specific element or series of elements.
count() – counts the number of elements inside the list, etc.

Slicing in lists:
If you want to get, say, all the elements beginning from the 3rd until the end, then you can
slice the list.
Using slicing you can specify the index range, the
chanddra.p@gmail.com point where you want to begin until the
UVBL5MQSJ8end.
For instance, if you want to display 1 to 10 in a list of 1 to 100 elements, then:

To print all numbers from 90 until the end:

5. Tuples:
Tuples are another type of data structures that are used to store multiple items in one
variable. Unlike lists, tuples are immutable, i.e. they cannot be changed. Tuples are also
ordered. They cannot be shuffled neither can the positions of items inside a tuple be
changed. The way we used [] to create lists, here, we use (). For instance:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
A list can be converted into a tuple. Simply enclose the list variable name inside tuple().
Example:

Tuples also have a lot of functions, a few amongst them are:


#1. max() – returns the element that has the maximum value in the tuple.
#2. min() – returns the element that has the minimum value in the tuple.
#3. len() – returns the length of the tuple.
#4. type() – returns the datatype of the variable, in this case, tuple.
Slicing can be used to display part of all the elements of the tuple.

chanddra.p@gmail.com
UVBL5MQSJ8

All of these functions look very similar to Lists’. So how are these two different you might
ask?
The main difference is that tuples take up less memory than lists do. Hence, tuples are
faster to execute. They are immutable unlike lists. Lists can easily be reordered, while tuples
cannot. And the easiest difference to identify between them is the type of bracket usage.
Lists use [] while tuples use ().
One must use tuples when they are sure of the order of items, and when they are certain
that items would not be changed. Lists, however, due to their ease of editability can be used
when the elements need to be manipulated.

6. Sets:
Sets are the third type of data structures in Python. The key feature of sets is that they are
unordered and unindexed. Like usual math problems, sets are represented inside curly
braces {}. When we say that lists are unordered, we mean that everytime a variable of

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
datatype set is executed, the order of elements listed can change. For the same reason, they
cannot have duplicates, unlike lists and tuples. Since they are unindexed, you cannot use the
slicing options of python either.
Sets have a lot of functions.
#1. add() – it adds a value to the set.
#2. remove() – removes a specific element from the set.
#3. pop() – removes any element from the set
#4. union() – returns the union, i.e. joins elements of the two sets.
#5. intersection() – returns the intersection of two elements of two sets.
#6. difference() – returns the difference between two sets, i.e. returns A-B for value of A.
#7. symmetric_difference() – returns elements from both sets minus elements common
between the two. In other words, compliment of A intersection B.
An example of each of the above methods is given here:

chanddra.p@gmail.com
UVBL5MQSJ8

Apart from the mentioned functions, sets have numerous other functions like .isdisjoint(),
.issubset(), .issuperset(), .update(), etc.
Sets are helpful when you want you require to do mathematical operations like combine or
separate items from two different sets. They help remove duplicity from lists and tuples.
sets are also faster than lists.

7. Dictionary:
Dictionary is a data structure. It is ordered. It cannot be duplicated. It can be
changed, hence, they are mutable. They are written within curly braces. Dictionaries
store data in the key : value format. Every key has a value. Just like indexing, keys
are used to identify values here. They can be used for bivariate data.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
For example you have the heights of individuals and the frequency of people of the
the same height. To store these values, you use dictionaries. The height(in inches)
forms the key and the frequency of individuals becomes a value.
Note, one key cannot contain two values.

How is a dictionary different from a list?

A list is collection of values that can be identifies by their indexes. For the same
reason, lists are ordered. However, in case of dictionary, you have ‘key’s that do the
work of indexes. These keys help in identifying values. Thus, the dictionary is not
always ordered. Another important differentiation will be the use of brackets. While
the lists use square brackets [], dictionaries use curly braces {}. Inside a list, every
element is separated by a comma. In dictionaries, a key is written, followed by its
value which is then separated by a comma.
Declaring a dictionary:
To declare a dictionary, you enter the variable name and enter the elements in key:value
form.

chanddra.p@gmail.com
UVBL5MQSJ8

To print a specific value, enter the key name after the variable. For example:

To add elements to the dictionary:


Enter dictionary name followed by key name in enclosed square brackets and then enter the
value of the key.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
To delete elements from the dictionary:
To delete the last item, the .popitem() function is used.
To delete a specific item, the .pop(‘key_name’) function is used.
To delete all items of the dictionary, the .clear() function is used.

chanddra.p@gmail.com
UVBL5MQSJ8

To determine the number of elements in the dictionary:


We use the len() function to determine the number of items inside a dictionary

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Other methods of dictionaries:
.get() – gives the value associated with the key name that is mentioned.
.keys() – returns an object containing all the keys of the dictionary.
.values() – returns an object containing all the values of a dictionary.
.items() – returns a tuple for all key:value items.
.update() – returns an object with specified key:value items.
.copy() – returns a copy of the dictionary.

Converting a list to set:


A list can be converted to a set simply by using the set() method. Once done, the output
screen will show the result in curly braces { } instead of square brackets [ ].
chanddra.p@gmail.com
UVBL5MQSJ8
Converting a set to list:
A set can be converted to a list simply by using the list() method. Once done, the output
screen will show the result in square brackets [ ] instead of curly braces { }.

Converting a list to tuple


You can use the above method to convert a list into a tuple and vice versa. There is another
approach to get the same output, that is, by using (*list_name, ).

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Converting Dictionaries to lists:
The .items() method converts a dictionary to a list of tuples.

chanddra.p@gmail.com
UVBL5MQSJ8
Similary, for most part, you can convert one data structure to another easily. It must also be
noted that sometimes loops are also used to convert an object from one data structure to
another.

In a gist:
Category Lists Tuples Sets Dictionaries
Mutability Mutable Immutable Mutable Mutable
Ordered Ordered Ordered Unordered Unordered
Index-access Yes Yes No No
Braces [] () {} {}
Duplicates Can contain Can contain Cannot contain Cannot contain
duplicates duplicates duplicates duplicate keys

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Doubts:
In numeric:
iii. long - 32 bit numbers ranging between -2147483648 and +2147483648.
iv. complex - complex numbers having a real part(x) and an imaginary part(y),
represented as (x+yi).
Should I write Bitwise operators
Are dictionaries ordered? Because version 3.6 is ordered, before that unordered.

chanddra.p@gmail.com
UVBL5MQSJ8

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Supplementary Learning Material
chanddra.p@gmail.com
UVBL5MQSJ8

Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: Conditional Statements

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Unit 3
Conditional Statements

Overview:
Most times, even the simplest questions we ask have two answers- yes or no, this or that. In
situations like these, conditional statements come in. when there are multiple solutions to
problems or picking out the most appropriate solution for a bunch of options, etc.
conditional statements are used in every domain to get one optimal solution.

Outcomes:
1. What are conditional statements?
2. Examples of statements.
3. If statements:
- Meaning of If-Else.
- Syntax of If-else.
- Meaning of If-else if/elif statements.
- Syntax of else if statements.
chanddra.p@gmail.com
UVBL5MQSJ8 - Meaning of nested If-else statements.
- Syntax of nested else-if statements.
- pass statement
4. Loops:
- While loop.
- Syntax of while loop.
- range() function
- For loops.
- Syntax of for loops.
- break statement
- continue statement
- nested while loops
- nested for loops
- difference between for loop and while loop.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Conditionals Statements:
As the name suggests conditional statements are those which are dependent on certain
conditions. Meaning if a person had to choose between two or more options, conditions
would play a role in determining the output of the program. They are used to control the
flow of the program based on certain conditions. To see if the condition is true or not,
comparison operators are used. Keywords in conditional statements are: if, else, and elif.
Indentation in python is important when conditional statements are used.

 If-else:
When there are questions (maybe one, two or more) having one solution, and the rest
having another solution, then an if-else statement comes in handy. A block of code is
executed only when the condition is true, otherwise the second block contained inside the
‘else: ‘ is executed. If the condition is false, the interpreter does not enter the specified
block of codes.
For example, if a child is told that he would get a chocolate if he does his homework. Here
the condition is homework, if it is done, then the boy gets a chocolate, otherwise he
doesn’t. Similarly, in the world of coding, if-else statements help to execute programs where
multiple either-or or this-or-that that type of problems exist.
Here are a few more examples of If-else statements in programs:
chanddra.p@gmail.com
UVBL5MQSJ8 1. A person must get grade ‘Pass’ if they score above 45 in a 100 marks test. If they
score below it, they must get a ‘Fail’.
2. If a person’s salary is more than Rs. 200000, then display ‘Pay Tax’, if it is less than
that, display ‘Not liable to Pay tax’.
3. For whether a number is positive or negative. If the integer is more than zero, then it
is positive, if it less than zero, then it is negative.

Syntax of an If -else statement:

if (condition):
code of statements if the given condition is true
else:
code of statements if the given condition is false.

Example 1

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Example 2

chanddra.p@gmail.com
UVBL5MQSJ8

 If-elif conditional statements:


The above given syntax is easy to implement in cases where we have two outcomes, but
what if there are more than two? For instance, if we have to divide the grading system of a
subject, say, into more than two parts. Those who score between 80-100 get “Grade A”,
those who score between 60-80 get “Grade B”, those scoring between 40-60 get “Grade C”
and the rest get “Not passed”. For such questions we use something called if-else if
statements. In python, we call it if-elif statements.
Syntax of if-else if statements:
if (condition1):
Statements if condition 1 is true.
elif (condition2):
Statements if condition 2 is true.
elif (condition3):
Statements if condition 3 is true.

elif (condition N):

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Statements if condition N is true.
else:
Statements if all the above conditions are false.

Example 1

Note that the code will run even without the else part of the else if function. You can also
compare two conditions in the same condition statement using logical and identical
chanddra.p@gmail.com
UVBL5MQSJ8operators.

 Nested if-else statements.


A nested if statement is an if statement within an if statement.
Syntax of a nested if statement:
if (condition1):
if (condition 1.1):
Set of statements 1.
else:
Set of statements 2.

elif (condition2):
if (condition 2.1):
Set of statements 3.
else:
Set of statements 4.
else:
if (condition 3.1):
Set of statements 5.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
else:
Set of statements 6.
Example 1

In the example, when the first condition is true, the statements inside it are executed.
chanddra.p@gmail.com
Sequentially, the if code inside it is checked. Incase the if within the bigger if is true, the
UVBL5MQSJ8
conditions inside the former are executed, otherwise the else block is executed. The
interpreter enters any block of code if and only if the condition on which its execution is
dependent is true.

Pass Statement:
Pass statement can be used when you don’t have statements inside a set of codes. A pass
statement does not have any impact on the program. The interpreter just continues and
goes to the next statement when it reads a pass statement.
pass is a keyword in Python.

The example shows how pass is used. The program shows no display, meaning it runs
without encountering errors but there are no statements inside the if statements, only an if,
therefore, the output is empty. It is used to execute nothing.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Loops:
When you have a condition that needs to be executed multiple times, typing it everytime
will be a task. Let’s say you want to print numbers from 1 to 100, typing all numbers will
take a lot of human time. But the program is fast, and it doesn’t take as long as humans do.
This is where the concept of using loops comes into the picture. A loop is dependent on a
variable that is initialised before and runs the code the number of times specified. Every
time the loop runs it is called an iteration. For n iterations, the loop will run n times and the
codes within the loop will also run n times, making it possible for us to run multiple
statements again and again. All of it in a giffy!
chanddra.p@gmail.com
UVBL5MQSJ8

 While loop:
In a while loop, the set of statements keeps on repeating till the condition is true. As soon as
the condition is false, the loop ends and the set of codes written after the loop is executed.
It is used in cases when the number of iterations is unknown.
For a while loop, you first initialise a variable (conventionally i=0). Then the while condition
is written. Inside the loop, the set of statements that are to be printed in every iteration are
written. One must note that it is important to increment or decrement the initialised
variable otherwise it will become an infinite loop.

Increment means increasing the value of the variable, decrement means decreasing the
value of a variable.

Syntax of a while loop:


i=0 #you can use any variable instead of i
while(Condition1):
#code

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
i+=1

Example 1

The above example says that i is initialised with value 0, and n with value 10. The while
condition is then checked. According to the example, if i <= n, which in this case is true, it
enters the loop. It prints Hello for the first time, and goes to the next line. i+=1 means that i
chanddra.p@gmail.com
UVBL5MQSJ8is incremented. Now the value of i changes from 0 to 1. Again, the while loop is checked. If
the condition i <= n is satisfied, then the statements inside the loop are executed again.
Since it is true, hello is printed the second time and i is incremented again. This continues
until the value of i become 11. In that case, the condition i <= n will not be satisfied because
11 is not less than 10. Thus the loop will end there.

Other examples where while loop can be used:


1. To keep printing numbers from 1 to 100.
2. To print prime numbers between 1 and 500.
3. To print multiples for 18 between 200 and 500
4. Printing factorials of numbers in descending order, etc.

range() function:
before doing for loops, one must know what the range() function do. The range() function is
an in-built python function. In a range(n,m), the code will run from the nth element to the
(m-1)th element.
We can also increment the variables by different integers in a range by specifying it in the
range. For example, the statement i in range(1,20,2) will run from 1 to 20 and will increment
by 2.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
 For loop:
Like any other conditional statement, the for loop is used to keep repeating a block of
statements multiple times. The number of iterations in the for loop is known, that is the
advantage of using for loops over other conditional statements.
You usually use a sequence or range or a data structure like a list, or tuple.
Syntax of a for loop:

i =0
for i in sequence/datastructue :
Block of statements.
….
Statements.
Parts of the code:
1. i = 0 : it is called the iterator variable. It is a variable initialised before the for loop.
2. for i in seq/data_structure : in is a keyword in python. The statement reads as ‘for
loop with iteration variable i in the sequence specified’. Here i begins from the
initialised element position in the sequence and then the interpreter enters the
block of codes within the loop, the iteration variable increments after every loop
until the condition specified is false.
3. Set of codes: as long as the iterative variables satisfy the range/sequence mentioned,
chanddra.p@gmail.com
UVBL5MQSJ8
the set of codes inside the for loop is executed.

Example 1

In the above example, i is initialised as 0. Then the next line containing the for loop is
executed. i starts from the sequence mentioned and executes the block of codes within the
for loop. After the loop is over, i increments and repeats executing the code inside the loop

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
until the iterative variable exceeds the last index number of range/sequence mentioned. For
the given loop, the execution will be done 10 times beginning from 0 to 9. The loop runs
until the (n-1)th number, which in this case is 9.

break statement:
The break statement is used when the condition of the loop is specified but one wants to
terminate the loop in between.
Syntax:

for i in range/sequence_Name/list_Name:
set of codes
break

statements.

According to the above mentioned syntax, the for loop runs per usual but when it
encounters a break statement then it exits the loop without completing all its iterations and
starts executing the statements mentioned after the loop. Break is a keyword to break the
flow of statements inside a loop.
chanddra.p@gmail.com
UVBL5MQSJ8Example 1

In the above example, the iterative variable is i, the loop begins from 1 and has to be
repeated until 20. When it enters the code, the ith iteration number is printed. Then the if
block is checked. If the ith number is divisible by 7 then the interpreter enter that block,
otherwise it enters the else block. Note that the keyword continue is used here to continue
the flow of the loop without passing any other statements. This loop will continue until 7,

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
after which when it satisfies the if-statement, it enters the block. First it prints “code ends
here” then it reads break. The break statement ends the loop. the interpreter exits the loop
even if the iterative variable is within the sequence/range limit. It will print the codes that
are written after the loop.
Thus, break terminates the loop.

Nested loops:
A loop within a loop is called a nested loop.
Example of a nested while loop:

chanddra.p@gmail.com
UVBL5MQSJ8

In the above example, I is initialised at 1. According to the first while loop, 1<=30 is true,
hence the interpreter enters the loop. The second while loop is checked. Since j <= 15, the
block of statements inside the loop are executed. i+2 and j are printed and then both the
variables are incremented. These types of loops- one within another, are called nested
loops.

Example of a nested for loop:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
In the above example, the i and j variables are the iterative variables. They are initialised at
0. When the first for loop condition is satisfied, the interpreter enters the loop and then the
codes within the loop are executed. When the second loop condition is satisfied, then the
codes within the loop are executed. when the condition for j is no longer satisfied, the loop
terminates and then the outer loop runs until the specified condition is no longer satisfied.
Loops within loops are called nested loops.
chanddra.p@gmail.com
UVBL5MQSJ8Difference between While and For loop:
Factor While Loop For loop
Syntax i=0 #or any iterative variable i =0
while(Condition1): for i in sequence/data structure:
Block of statements.
#code ….
i+=1 Statements.

Format Variable initialisation, condition Variable initialisation, condition


checking and iteration statements checking are written at the
are written at the beginning. beginning.

Condition The iterations continue infinite If the condition is not mentioned it


times if it is not mentioned. will return an error. The condition is
essential.

Iterations The number of iterations is The number of iterations is


known. unknown.

Speed For loop is faster than while loop. While loop is slower than for loop.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Supplementary Learning Material
chanddra.p@gmail.com
UVBL5MQSJ8

Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: Functions in Python

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Unit 4
Functions in Python
Overview:
Let’s say you want to invoke/call a piece of code multiple times in your
program. When it is to be called consequently, we use loops. But what about
instances when you have to invoke a set of codes in the beginning of the
program, and then at the end. It becomes tedious to keep copying the same
code again and again. It also makes the program slow. To solve issues like
these, functions are used. To reuse the same piece of codes in different parts
of the program.
Unit Outcomes:
1. Functions:
- What are functions?
- What are pre-defined functions in Python?
2. User-defined functions:
- What are user defined functions?
chanddra.p@gmail.com
UVBL5MQSJ8 - Syntax of a user defined function.
- Parts of a function.
- return keyword
-
3. Lambda functions:
- What are lambda functions?
- Why are lambda functions used?
- Syntax of lambda functions.
- Examples.

Functions:
Functions are used when we want to call a particular sets of codes in different
parts of a program. For instance, when we want to check if the addition of two
numbers more than once in a program. It is easy after function declaration,
which is done once, and then the function is simply called wherever it is
Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
needed. Functions help make the code reusable. Functions are also called
methods in a lot of OOP languages. Functions need to be properly indented for
the proper flow of statements.

Pre-defined/ Built-in functions in Python:


For ease of use, python has some built-in functions, which means, there are
functions in the python interpreter that are already present. These functions
are available to be used readily. Even print() is a function. In Python 3 there are
some 68 predefined functions.
Some of those functions are abs() which gives the absolute solution of tha
value entered within the parenthesis. range() is also a function which returns
the range of sequential integers from a beginning point to an end. str()
converts the object to a string/character datatype. These are functions we
have already used, there are so many more functions that are pre-defined and
haven’t been used in the course but help make the code concise and direct.

chanddra.p@gmail.com
UVBL5MQSJ8User-defined functions:
While there are a lot of pre-defined functions, there can be a lot of other sets
of code that a user might want to repeat in a program, that are not pre-
defined. For purposes like these, user-defined functions come in handy.
Through a user-defined function, a programmer can write his set of code in the
block function and call it everywhere in a program, whenever it has to be
reused.

Syntax of a function:

def functionName(argument1, argument2, …):


set of statemens
return valueToBeReturned

functionName()

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Explanation of the syntax/Parts of a function:

- This block is called the function definition. It is the part where you
write all the features and codes of the function.
- ‘def’ is a keyword in the python. It is short for define/definition.
- It is followed by functionName which is the name of the function that
is going to be defined by the user.
- functionName is followed by a parenthesis which contains
arguments. Arguments are values of variables that are used within
the function.

These arguments are also called local variable since they can only be
used within the block of codes of the function. They go unrecognised
outside the function. However, one must note, it is not compulsory to
write arguments in a python function.
- The arguments or the parenthesis followed by a colon. Inside the
block enter all the conditions/statements for the case that the
chanddra.p@gmail.com
function was created.
UVBL5MQSJ8

- return is a keyword in python. It is used to return a variable value


once the function code is executed.
- Once you are done writing the code, you must call the function.
Come outside the indented block. To execute the function, you must
call it. To call a function, just enter functionName(). If you mention
arguments in the function definition, then the called function must
also contain values for those arguments. When the function is called,
it is referenced back to the block of codes inside the same
functionName.

- One must note that the number of arguments in the function


definition must match the number of variable values mentioned
when the function is called. If there are no arguments, then there will
be no values inside the called function parenthesis.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Example (without arguments, without return type):

Example (without arguments, with return type):

chanddra.p@gmail.com
UVBL5MQSJ8

Example (with arguments, with return type):

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
return keyword:
return is used inside a function. It is assumed that once return is written, the block ends. A
function can have multiple return statements, however, if any one of them is executed, the
function is terminated. It cannot be written outside the function. The expression contained
in a return statement is evaluated, executed and then returned. When a return statement
has no value ‘None’ is returned.

lambda function:
lambda function is a function written in a single line. it does not have any
name, and can take any number of arguments but it can only have one
expression in it. It is called Anonymous function because it is nameless.
One must note that the output for a one line expression given by both, the
lambda function and a normal def function are the same. def is usually used
when the function has multiple lines of code.

Why use lambda functions?


chanddra.p@gmail.com
UVBL5MQSJ8 1. lambda reduces the length of the code when compared to a normal
function that is written using the def.
2. It can be written and called immediately. Unlike def functions, they
don’t have to be called separately.
3. It is used when a function is temporarily required or for a short time.
Defining Lambda functions:
Instead of using def, here only the keyword lambda is used.

lambda arguments : expression

A lambda function does not have a return statement. Since it is a single line
function, when executed, the value of the expression is displayed in the
output.

Example:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
In the above example, variable x stores the lambda function which is defined
by variables a, b, c, and d. it is followed by a semicolon and the expression that
is to be evaluated.
In order to display the output, we write print(x(10,12,14,16)) for example 1,
which means that we specify the variable to be printed along with the
temporary values of a,b,c,d.
The example can be written through a normal function.

chanddra.p@gmail.com
UVBL5MQSJ8

In the above example, the lambda function is written as a def function. The
output is the same, however, the number of lines are more when written
inside the def function.
Example: With arguments, without return
1. Calculator, factorial of a number, and reverse of a number
To make a basic calculator using functions.:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
To make calculator, we can use four functions or simply one function. One
function can contain all the arithmetic operations, or multiple functions can
contain different arithmetic operations and then can be executed separately.

chanddra.p@gmail.com
UVBL5MQSJ8In the above example, the function calculator contains numerous arithmetic
operations. Then the function is called, with arguments as a=20 and b=5. When
executed, it gives the output for each operation. A lambda function is stored in
variable x. since only one expression can be evaluated in the lambda function,
an amalgamation of operations is used with local variables a and b. Evaluation
of arithmetic operations happens in a left to right order. The solution is then
printed.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Another form of writing the same program using functions is given above.
Instead of using one function, numerous small functions have been used
where each function performs an arithmetic operation. Either way the output
is the same.
2. Factorial of a number.
Factorial of a number n means multiplying all n numbers in decrementing
order. For instance, for number=6, you multiply 6*5*4*3*2*1.
Therefore, for any number, factorial= num*(num-1)*…*3*2*1

chanddra.p@gmail.com
UVBL5MQSJ8In the above example, when function factorial is executed with 8 as argument
value, the interpreter goes inside the function defined factorial, with local
variable a. Thus, a=8. Another local variable is assigned s=1 inside factorial. The
loop is then executed, since the condition is fulfilled, i.e. 8 !=0, the loop is
entered. s is reassigned as s*a, and then a is decremented. The loop keeps
repeating until the value of a is 0, when that condition is not fulfilled, the loop
will be terminated and then the factorial stored in s is displayed.
Examples: Without arguments, with return
1. Finding the sum of all multiple of 5 between 50-200:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
In the above example, a function summation is defined containing the suum of
all multiples of 5. For the same, a local variable sum is initialised as 0. A for
loop is run between the numbers 50 and 200. Whenever the iterative variable i
is completely divisible by 5, i is added to the previous value of sum otherwise
the loop continues without any changes. Once the loop is terminated, the
function returns the sum.

2. Factorial (without arguments, with return)

chanddra.p@gmail.com
UVBL5MQSJ8
The same factorial problem is executed without arguments and with return.
num is used as the local variable instead of entering arguments.

Examples: Without arguments, without return


1. Finding if a number is palindrome or not.

A number is said to be palindrome if the reverse of the number is the


same as the original number.
For instance, 35653 is a palindrome number since the reverse of the
number is 35653, which is the original number.
292 is another example of a palindrome.

To find palindrome we use a function and follow a logical order:


i. Define the function
ii. Initialise all the local/global variables that we might use.
iii. Write a loop to reverse a number:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
iv. End the loop, compare the reversed number to the original
number.
v. If the reverse = original number, then we call it a palindrome
number.
vi. Call the function.

Number
chanddra.p@gmail.com
UVBL5MQSJ8
1: 12321

Number 2: 12345
Example 2:
Armstrong number (without arguments, without return)
When the sum of the cube of every digit of a number is equal to the original
number, it is called an Armstrong number.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
For instance, 153 = 13 + 53 + 33 = 1 + 125 + 9= 153.
Similarly, 1634, 9474, 371, are few other examples of Armstrong numbers.
Just like finding the palindrome of a number, to find if a number is Armstrong,
we follow a logical order and use functions here to find it:
i. Initialise global variables.
ii. Define a function (here, without arguments, without return).
iii. Initialise local variables.
iv. Run a loop to find the cube of every digit.
v. Run a loop to find the sum of the cubes.
vi. Check if the sum is equal to the original number.
vii. If the sum and the original number are equal, then it is an Armstrong
number.

Output:

chanddra.p@gmail.com
UVBL5MQSJ8

Number= 153
= > 13 + 53 + 33
= > 153. Thus, the number is an Armstrong number.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Number= 1256
= > 13 + 23 + 53 + 63
= 350 ≠ 1256. Therefore, the number is not an Armstrong number.

chanddra.p@gmail.com
UVBL5MQSJ8

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Supplementary Learning Material
chanddra.p@gmail.com
UVBL5MQSJ8

Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: OOPs in Python

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Unit 5:
OOPs IN PYTHON
Overview:
All programming languages are part of a certain paradigm. One paradigm
might lay emphasis on step by step execution or a different approach over the
other. For instance, Java follows Procedural Programming, and Python follows
Object Oriented Programming. Even though they are two languages, their way
of approaching situations is different.

Learning Outcomes:
- Procedural Programming.
- Object Oriented Programming
- OOPs vs Procedural Programming
- Features of OOPs
- Classes:
chanddra.p@gmail.com
UVBL5MQSJ8
 Syntax of classes
 Example
- Objects
 Syntax of objects
 Example
- Difference between classes and objects
- Mini project.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Procedural Programming:
Procedural Programming is a programming paradigm. It breaks down a
program into parts and evaluates it. It carries out the program based on a
certain procedure. There is less abstraction between the code and the
machine. It treats data and modules/operations separately. Programming
languages like Java and C use procedural programming.
When you give your machine the procedure and the steps, it executes
programs in a top-down approach. The reason why Python is not an entirely
procedural programming language is because they don’t protect the data as
well as programs in other paradigms do. It gives emphasis on how the
operation is supposed to be conducted, than the data that has to be sought
after.
For such reasons, Object Oriented Programming is used.

Object Oriented Programming:


chanddra.p@gmail.com
UVBL5MQSJ8
Python is a multi-paradigm programming language. It uses classes and objects
in its programs. Object Oriented Programming approaches programming
problems by creating objects. Unlike procedural programming, which is not
based on the real world, it aides in using real-world structures like
polymorphism, encapsulation, etc.
These structures are usually used to increased readability and code efficiency.
OOPs concepts also help us in reusing the code, thus reducing the size of the
program and helps in finding solutions to problems. It is also easy
maintenance. A lot of applications today are developed using an object-
oriented approach.
When we say real-world structures above, we mean how one thing that is
distinct has its own features and attributes. For instance a bike is a distinct
entity and its attributes include size, color, mileage, etc .

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
The major difference between both the approaches are:
1. The way they divide a program. OOPs divides a program into objects.
Procedural divides them into functions.
2. OOPs follows a bottom-up approach for its programming problems
unlike procedural which follows top-down approach.
3. In OOPs there are access specifiers which tell us where and in what parts
of a program will a function be accessible. Access specifies are keywords
like private, public, etc. this feature is not available in procedural
programming.
4. OOPs are more secure than procedural programming because they
provide the option of data hiding.
5. OOPs lays emphasis on the data over the procedure. Procedural
programming lays emphasis on the procedure over the data.
6. OOPs uses real-world entities, whereas procedural is based on the
unreal world.

Object Oriented Programming has 5 main features:


chanddra.p@gmail.com
UVBL5MQSJ8 - Classes
- Objects
- Inheritance
- Polymorphism
- Encapsulation
A few other features include method, instance, dynamic binding, message passing and data
abstraction.

Class:
A class is said to be a blueprint for an object. It is a user-defined data type. It is
a collection of objects. Data members and functions are held within a class.
For example, in a company, we evaluate two departments- HR and Accounting,
and we want to check which department a person works in. We write the
program. If say we enter the employ ID of person XYZ, we need to retrieve the
name of the employee and the department in which he works. For one entity,
it is easy to do so, but for multiple entries, in 1000s, it becomes very difficult.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
In such situations we group entries in something called as a class, which is a
feature of OOPs.
In the above example, the class can be employee_details() and it could contain
the employee ID, name, and the department in which he works. Therefore, we
can say that a class is a blueprint because it contains the objects and attributes
of a function. It helps organise the data which defines the capability of an
object helps a programmer to reuse elements while making new instances of
the class.

Syntax of Class:

class employee_details():

To define a class, you simply write class, which is a keyword and the
className() fillowed by a semicolon. Inside the class you can write multiple
functions that are part of the class. According to access specifiers, the
functions within the class are accessible or inaccessible to rest of the program.
chanddra.p@gmail.com
UVBL5MQSJ8Inside a class, make sure that everything is properly indented, just like in a
conditional statement.

Example of a class:

Objects:
Unless an object is created inside a class, no memory is allocated to the class.
Instantiation is creating an object of a class. The objector instance contains the
actual information.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Syntax to create an object

object_name = class_name()

Creating an object of a class:

In the above example, to assign an object to a class, simply equate the object
name to the classname(). To display the value of the string inside
employee_details, you must simply print objectName.variableName.

chanddra.p@gmail.com
UVBL5MQSJ8

Difference between an object and class:

Class Object

A class is like a template for object Object is an instance of class.


declaration.
Classes are not allocated memory Objects are allocated memory when
upon creation. they are created.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Classes are declared once. Objects have to be declared
numerous times as and when
needed.
Classes cannot be manipulated. Objects can be manipulated.
class is a keyword. There is no keyword to create
objects.
Syntax: Syntax:
class class_name(): object_name = class_name()

Class is used to bind all the data into Objects are like variables of class.
a single unit.
It is a logical entity It is a physical entity.
Example: Example:
employee_details is a class employee_id, employee_name,
department.

The above examples are simple and are not usually used in python. For real
world applications, we use the __init__() function.
__init__() function:
chanddra.p@gmail.com
UVBL5MQSJ8
All classes have an __init__() function that is executed when the class is
intialised. It is called automatically whenever a class is used to create a new
object.
The __init__() function is similar to constructors. Constructors in java are used
to initialise the state of the object. It contains a set of statements that are
executed when the object is created and is run when the object is instantiated.

Example:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
In the above example, a class is called employee_details. An employee’s name
is created via the object obj. While creating the name, “Yash” is passed as an
argument. This argument will be passed in the __init__ method to initialise the
object. The ‘self’ keyword represents the instance of the class and binds the
attributes with the mentioned arguments. Similarly, many objects of
employee_details class can be created by passing different employee names as
arguments.
The first argument of every class, including __init__ is always a reference to
the current instance of that class. Conventionally, it is always named ‘self’. In
__init__ , self means the newly created object and in other classes, it means
the instance whose method was called. It represents the object which inherits
the properties of the class. One must note, that this is not a compulsion, it can
have any name. however, the first argument in a method is a reference to the
object. For methods, in general, you do not need to include self, however if
you don’t provide self in the __init__ function then you will get an error.

Mini Project:
chanddra.p@gmail.com
UVBL5MQSJ8

Example 2: Display 3 book names and their page counts using classes and
objects.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
In the above example, we declare a class called book. Inside the class we
define the __init__() function. We define self and other arguments within
__init__ and then we define the description function which displays the book
name and the page count of the book. We make 3 objects with their respective
names and page counts. Then we call the object.
Book name and page count for the book is created via the object b1, b2, b3.
While creating the book name and page count, (“Gone with the wind”,
“1448”), (“Romeo and Juliet”,”92”), (“Merchant Of Venice”,”173”) are passed
as arguments. These arguments will be passed in the __init__ method to
initialise the object. The ‘self’ keyword represents the instance of the class and
binds the attributes with the mentioned arguments.

chanddra.p@gmail.com
UVBL5MQSJ8

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Supplementary Learning Material
chanddra.p@gmail.com
UVBL5MQSJ8

Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: Inheritance

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Unit 6
Inheritance
Overview:
In real life, it is easy to say that a child takes after their parent, or guardian because of genes
and their social environment. This is true for all species, an off-spring is much too similar to
their parent or under who they’ve grown up. But when it comes to programs, a function
does not always take after a function even after we use it multiple times. But there are
times when we want one class to have the features of another class. In python and every
other programming language, we write a set of codes to do this- for one class to inherit the
quality or features of another class.

Unit Outcomes:
- Inheritance: Introduction.
- Uses of inheritance.
- Advantages of using inheritance
- Types of Inheritance:
1. Single inheritance:
chanddra.p@gmail.com
UVBL5MQSJ8 I. definition, syntax.
II. example
2. Multiple inheritance:
I. Definition. Syntax.
II. Example.
3. Multi-level inheritance:
I. Definition, Syntax.
II. Example
- Difference between different types of inheritance.
- Method overriding:
1. Meaning
2. Example: in Simple Inheritance, in Multiple Inheritance.
- Diamond Problem:
1. What is it?
2. Problem explanation
3. Pseudo-code
4. Code.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Inheritance:
When one class derives the qualities of another class, we call it inheritance. It increases
reusability of the code. Everytime you write a code, you need not write all of it again and
again, you can just use features of another class by inheriting it in the new class.
For example, let’s say you write a class called calculator which does basic arithmetic
operations and you define another class ‘numbers’ which checks if a number is Armstrong
or Ramanujan, etc. or its factorial. You can use features of the calculator class like addition
and multiplication in the numbers class. All you need to do is make numbers class inherit the
calculator class’s properties.
There are two parts in a program while you use inheritance:
- Parent class: the class whose methods and functions are used in other classes. It is
also called base class.
- Child class: the class that uses the methods of the parent class. The class that inherits
the functions of the parent class is called child class. It is also called derived class.

One must note that inheritance classes are transitive in nature. This means that if class X
inherits qualities of class Y, then all subclasses of class X will also contain qualities of class Y.
chanddra.p@gmail.com
UVBL5MQSJ8

Uses of Inheritance in a program:


1. It allows code reusability. When you reuse the code, it takes up less space which
automatically increases the speed and improves efficiency of the program.
2. Features of a class or method can be increased by overriding. Overriding in python is
an option that allows the subclass to provide specific implementations of a method
or function that is already present/mentioned by one of its parent classes. When a
method has the same name, parameter or signature, and the same return type as
that of its parent class, then the method in the child class is said to override the
method in the parent class.
3. Method riding is also called runtime polymorphism. Which means that even though
the parameters and method names are similar, the execution is based on the type of
object used.

Advantages of Inheritance:
1. It is used to share existing features of a class.
2. Inheritance can be used to arrange functions and methods in a hierarchical form.
When more than one class is derived from a base class, it can be called hierarchical
inheritance.
3. Due to reusability of code, data redundancy is also reduced, increasing elegance.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
4. It makes the code more flexible. In case you have to change one detail, you only
have to make changes in part of the code instead of numerous other places that you
might have had to, had you not used inheritance.
5. Code length and duplicity is reduced.

The syntax is dependent on the type of inheritance you use.


Types of Inheritance: the difference mainly arises based on the number of child classes you
use in the program.

1. Single Inheritance:
When the features of the child class are derived from only one class(parent), then it is called
single inheritance.

Syntax of single inheritance:

class parentClassName:
# statements

class childClassName (parentClassName):


chanddra.p@gmail.com
UVBL5MQSJ8 # statements

objectName= childClassName()

Parent Class

Child Class

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Example:

To see an example of single inheritance, we make one class inherit the properties of another
class. We declare a parent class and then a child class, after which we create an object of
the child class. Once that is done, we can call methods of the parent class via objects of the
child class.
In the given example, we declare a parent class, called automobiles, and define arguments
as self. We write a function within it which tells the types of automobiles there are. Since
cars are a subtype of automobiles, we make another class cars, which can inherit the
properties of the automobiles class. Inside it we define a function that contains the types of
cars that exist. Then we declare an object that is associated with the child class, i.e. cars. We
call the function of child class via the object. But because of inheritance, we can also call
classes of the parent class via objects of the child class.

chanddra.p@gmail.com
UVBL5MQSJ8

Using the __init__() function:


Example 2:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
In the above example, we create a class Automobiles, which is the parent class. Inside it we
write the init() function and the disp function. we create another class, called Cars, which is
the child class. We define the __init__() function and the another method called types. The
class Cars inherits features from Automobiles. Then the object is called. The object redirects
the program inside the Cars function. the init function has arguments self. Then
Automobiles.__init__(self) redirects the program to the init function of the Automobiles
class where variable value gets assigned the automobile types. There is no return value so
the program continues. The control returns to the line after Automobiles.__init__(self) in
the Cars class, where variable value1 is assigned a string, “Car types: Hatchbacks, sedans,
SUV, XUV”. Since there is no return type, it does not display anything. Since obj=Cars() is
executed, the control goes to print obj.value, which we saw above was assigned
“Automobile types: Cars, Trucks, Buses, Rickshaws” and similarly, obj.value1 is printed.

2. Multiple Inheritance:
When there are two or more parent classes of one derived class, it is called multiple
inheritance. As the name suggests, one class takes after more than one class.
Syntax of multiple inheritance:

class parentClassName1:
# statements
chanddra.p@gmail.com
UVBL5MQSJ8class parentClassName2:

# statements

class childClassName (parentClassName1, parentClassName2):


# statements

objectName= childClassName()

Parent Class 1 Parent Class 2

Child Class / Derived Class


Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Example:
In multiple inheritance, we declare two or more classes and functions inside each of them.
These classes act as the parent class and then we declare the derived class which inherits
the features of the aforementioned base classes. Then the normal sequence is followed.
Functions/methods are mentioned within the derived class. An object of the derived class is
then made and functions of the parent class can be called using the same object of the
derived class. It enables reusability and makes the code more efficient.

In the example given below, a class called calc1 is defined with a function that returns the
multiplication of two local variable a and b. Similarly, calc2 is another class defined with a
function that returns the division of two local variables. Then a derived class is declared
which inherits features and functions of both classes calc1 and calc2. The derived class also
has a function which returns the modulus value of two local variables a and b. Then an
object of the derived class is made. The same object is used to print called functions mult()
and divide() of respective parent classes.

chanddra.p@gmail.com
UVBL5MQSJ8

Multi-level Inheritance:
In this case, the features of the Parent class and the child class are further inherited by
another class, let’s say grandchild class. Basically, there is an intermediary class between the
parent class and the derived class.

For instance, if a Daughter inherits the belongings of her Mother, the daughter (let’s say
Anna) inherits the belongings of her mother, i.e. Daughter.

Example:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
In the above example, we make a class Mother, which acts as the parent class. We make an
intermediary class called Daughter that inherits from the class Mother. Then we make a
third class which is the child class, calling it GrandDaughter. When the object of the
GrandDaughter class is created, values are passed inside as arguments. Then nameMother is
printed via the object of the child class created. then the name() function is called andthe
flow of statements goes to that class, there we initialise the value of nameGrandDaughter as
the input, then we call the Daughter class, the flow of the program movies to the Daughter
chanddra.p@gmail.com
UVBL5MQSJ8class, where the value of nameDaughter is assigned, and from where the code is redirected
towards the Mother class. nameMother is initialised as the value passed inside the object.
After all of these statements are executed, the flow goes to the name() function. The
appropriate statements are printed.

Parent Class 1

Parent Class 2

Child Class /
Derived Class

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
super() class:
Python is a dynamic language, which is why the super() function is called dynamically.
While inheriting a derived class, the base class can be referred to using the function super().
It returns a temporary object of the base class that gives access to all its methods and
functions to the derived class.
To gain access of the parent class from the child class, we use the super() function. it does
not accept any arguments. The method to be accessed needs to be specified.

Advantages of using super():


1. A programmer does not need to specify the base class to access its functions and
methods. This can be used in both types of inheritance.
2. It carries out modularity and allows code reusability.

Requisites to using the super function:


1. The class and functions or methods that are called by the super() function.
2. The arguments of the called function and the super() function must match.
3. Everytime the method occurs, one must include super() after it is used.

Using super function in a program:

chanddra.p@gmail.com
UVBL5MQSJ8

In the above example, we create a parent class by the name of Details. Then we create a
child class named Name, which inherits from the parent class. Then the object is created for
the child class, and values are passed in as arguments. The flow of statements goes to the
child class. The init function contains a super() function which redirects the flow to the
Details function, inside it values of name, address and age are assigned respectively. After
the execution of the init function is done, the flow of the program goes back to the next line
of the Name class, to the if statement. After age is checked, the respective statements are

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
displayed. one the execution of the entire block is done, then the name and address is
printed.

Differences between single inheritance and multiple inheritance:

Single Inheritance Multiple Inheritance


When there is only one base class for the When the derived class inherits from more
derived class to inherit from it is called single than two base classes it is called multiple
inheritance. inheritance.
Derived class uses features of the base class. Derived class uses features of all the base
classes jointly.
Requires less run time since there is less It requires more run-time because of more
overhead. overhead.
Simple. Complex.
Syntax: Syntax:
class parentClassName: class parentClassName1:
# statements # statements
class parentClassName2:
class childClassName (parentClassName): # statements
# statements class childClassName (parentClassName1,
chanddra.p@gmail.com parentClassName2):
UVBL5MQSJ8 objectName= childClassName() # statements

objectName= childClassName()

Method overriding:
Overriding is a feature of a class to change the implementation of a function provided by
one its base classes.
Inheritance can best be used with method overriding. It is the ability of an OOPs language to
allow the child class to provide a specific implementation of a method that is written in its
parent class also.
A real-life example that helps to easily understand method overriding would be how a
daughter gets the inheritance of her mother. She gets her mother’s land, her workspace,
and her cars. Maybe sometime later, the daughter chooses to discard the cars and buys
other cars, but still uses her mum’s land and workspace. In this case, she can use method
overriding to use her own set of cars while also using the land and workspace from her
inheritance.
Here, in python, you make a class called Mum, inside it you can mention functions like land,
workspace and cars. You call another function called Daughter which will inherit class

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Mum’s features and define another function within Daughter by the name cars containing
different cars she owns.

Example 2:
chanddra.p@gmail.com
UVBL5MQSJ8

In the above example, similar to single inheritance, a program is written. The only difference
here is that the function inside the child class has the same name as the function in the
parent class. This is where overriding comes in. Once the object of the child class is created,

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
the function with the same name is called but it prints values of the features mentioned
within child class.

Example: Multiple Inheritance:

Similar to single inheritance, in multiple inheritance, one function of one of the functions of
chanddra.p@gmail.com
UVBL5MQSJ8the parent class has the same name as the function in the child class. When the object of
the child class is made, python again overrides the function and instead of information
about cars, information about bikes is displayed.

Diamond problem:
In multiple inheritance, the diamond problem is found even in the simplest of programs.
In easy language, the diamond problem is a problem that arises due to ambiguity when two
classes U and V inherit from X, and class Y inherits from both U and V. if there is a method A
which is an overridden method in one of class U and class V or both of them then an
ambiguity arises which ‘A’ should class Y inherit from.

Simple Pseudo-code for the problem:


Begin
define ParentClass1():
#constructors and functions
define ParentClass2(ParentClass1):
#constructors and functions

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
define ParentClass3(ParentClass1):
#constructors and functions
define Child(ParentClass2, ParentClass3):
#constructors and functions

Create an object for function Child which inherit features of all classes ParentClass1,
ParentClass2, and ParentClass3.
Call methods via the object.

Display required details


End

In case of such problems there are three types of cases that are followed.

1. When the method is overridden in both classes.

chanddra.p@gmail.com
UVBL5MQSJ8

In this case, the same method name is common to all the classes. When the object of the
Child class is made, and the ‘method’ function of the Child class is executed. Thus, overriding
both it’s parent classes and subsequently even its super-parent class.

2. When every class defines the same method

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
In the above example, again the Child class is executed. To show the output of the method
of any other function, we use ClassName.FunctionName(ObjectName), i.e. in this case, the
chanddra.p@gmail.com
Parent1.method(obj), and likewise for Parent2 and Parent3.
UVBL5MQSJ8

You can get the same output without calling each of the classes separately via the object.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
You can similarly call the ‘method’ of Parent1 class by mentioning ClassName.objectName(),
along with Parent2 and Parent3 individually. Then call those classes via the Child class.

3. When the method is overridden in one of the classes


The statement means that when the function name is common to two classes of the
diamond structure, then one of them is overridden.
In the example given below, class Parent1 has a function ‘method’ and so does Parent3.
Similarly Parent2 and Child have a function ‘m’. when the object is created for class Child,
and method is run, the problem is overridden and the ‘method’ of class Parent3 and ‘m’ of
class Child are executed.

chanddra.p@gmail.com
UVBL5MQSJ8

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Supplementary Learning Material
chanddra.p@gmail.com
UVBL5MQSJ8

Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: NumPy Arrays

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Unit 7
NumPy – ARRAYS

Overview:
In the real world, when we solve math problems, it is easy. Our brain evaluates
what type the problem is and starts analysing and solving it. In python, it is
similar, except there is an entire library for solving mathematical problems that
help in complex calculation.
This library has an official website, www.numpy.org. it is an open source
product and here, you can get detailed information about everything related to
NumPy.
Unit Outcomes:
1. Modules: Meaning, creating modules, importing modules.
2. Packages in Python.
3. Libraries in Python: The Python Standard Library.
chanddra.p@gmail.com
UVBL5MQSJ8 4. NumPy:
i. Introduction
ii. Installing NumPy
iii. Importing NumPy.
iv. NumPy Arrays:
a. Meaning
b. Creating arrays
c. Single dimension, multidimensional arrays.
d. len() function, arange() function.
e. dimension of an array.
f. Shape and size of an array.
g. Creating arrays using tuples and lists.
h. Indexing of an array.
i. Slicing of an array.
j. Merging and Splitting an array.
k. Reshaping arrays.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
What is a module?
Python has a huge collection of modules. A module is a file that has Python
statements and definitions. It includes runnable codes, class, variables, and
functions. When similar pieces of codes are grouped together, they form a
module. It not only makes a code logically organised but also easier to
understand and use. They break the program into small manageable parts and
also provide reusability.
We can either write our own modules (i.e. user-defined) or use predefined
modules. The predefined modules are easy to use because Python is an open
source programming language. We just need to import it.
Creating a module in Python:

chanddra.p@gmail.com
UVBL5MQSJ8

Just like defining a function in python, we write a function and save it. Here,
we saved it as add.py.
Now sum is a function that accepts two numbers and returns their addition. It
is in a module named add.
Importing modules:
To import modules, all one needs to do is use the import keyword along with
the name of the module they want to import. And then call the function name.
Syntax to import:

import moduleName
moduleName.functionName(arguments)

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Packages in Python:
Packages in python are basically directories with Python files and a file named
__init__.py. In a python path, whenever there is a file named __init__.py, it will
be considered as a package. A bunch of modules can be part of the package .
Libraries in Python:
A library is a collection of modules. It contains in-built modules that can be
easily binded and referenced. The Python Standard library is extensive and
offers a plethora of facilities. The in-built modules provide system functionality
access like the I/O and other important modules that would generally be
inaccessible to python users. They also contain codes that provide
standardised solutions for problems that occur in everyday programming. They
provide the syntax and semantics for Python programming.
Amongst the most important ones or the widely used ones are NumPy,
Matplotlib, Pandas, SciPy, PyTorch, etc.
Libraries also reduce the size of the code, and increase reusability.
NumPy:
chanddra.p@gmail.com
UVBL5MQSJ8
NumPy stands for Numerical Python. As the name suggests it is used for
scientific computation or even basic mathematical problems. It is an open-
source package that is part of Python Standard Library. It is the most widely
used open-source package. Difficult mathematical problems like linear algebra,
discrete mathematics, etc. can be done using NumPy on python.
It supports matrices that are large in size and other multidimensional data. It
has inbuilt mathematical functions that help in calculations. Generic data can
be store as multi-dimensional data.
NumPy is an easy substitute for lists because it is convenient, fast and uses less
memory.
Installing NumPy:
For the Anaconda distribution, NumPy comes pre-installed. One does not have
to go out of their way to install it.
If you use a python version from python.org, you can use conda or pip to install
NumPy.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
You must have python installed on your computer and you must have access to
the command prompt.
Know what version of python you’re using.
Bring up a command prompt window and type

$ pip install numpy

This will install NumPy on your current working Python environment.


To use NumPy in a program, you have to import the module, the syntax of
which is:

import numpy as np

NumPy Arrays:
NumPy arrays are values in the form of a grid. All values are of the same type
and are indexed by a tuple of non-negative integers.
chanddra.p@gmail.com
UVBL5MQSJ8
The number of dimensions of an array is called the rank of an array.
the tuple of integers that gives the size of the array and its dimension is called
the shape of an array.
How are lists in Python different from NumPy arrays?
NumPy is faster, more efficient, and provides a wider range for array creation.
All the elements in a NumPy array should have the same datatype, unlike lists
where homogeneity isn’t necessary. The values of all elements can be of
different data types. NumPy is used when mathematical operations are to be
performed on the data.
NumPy arrays take up less memory and are convenient to use while also being
compact and thus, are faster than lists. They also store data and specify its
data type, optimising the data further.
Creating Arrays:
To create an array, we first import the NumPy module and give it an object
name. Conventionally, np is used as the object name. Once you are done
importing, you write the variableName=np.array([data]), as shown below .

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Similar to one dimensional arrays, you can also make multi-dimensional arrays.
Simply enclose the array data in square brackets.
Indexing of elements starts from 0.
n – dimensional arrays are also called ‘ndarray’. In simple words it means an
array with any number of dimensions. An ndarray class is used to show
matrices and vectors.
A vector is a single dimensional array. A matrix is a two dimensional array. Any
array with two or more dimensions is called a multidimensional array. Here we
will be working with both single dimension and multidimensional arrays.
chanddra.p@gmail.com
UVBL5MQSJ8As mentioned before, the shape of an array is a tuple with non-negative
integers specifying the size of each dimension. Dimensions are called axes. For
instance, in the example given above, variable num is a two dimensional array
with first axis having length 2 and second axis having length 3. Which loosely
translates to the matrix having 2 rows and 3 columns.
When the value entered is not in the form of a square matrix in a two
dimensional array, the rows are treated as different lists. For example:

Finding the length of an array:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Our usual len() function gives the number of elements in the array. It returns
the size of the first dimension.

However, to find the number of dimensions, or the length of each dimension


and the number of total elements, we use other NumPy functions.
The np.arange() function is an in-build function of NumPy that prints the first n
elements mentioned in the parenthesis. The numbers start from 0.
chanddra.p@gmail.com
UVBL5MQSJ8

For number of dimensions, it is:


numpy.ndarray : ndim

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
The .ndim function returns the dimensions of the matrix, in this case, 1 and 2
respectively.
Dimensions are the number of elements on the matrix. How is it different from
the len() function? it gives the number of elements, while len gives the length
of every element.
For the shape of the array:

chanddra.p@gmail.com
UVBL5MQSJ8
In the above example, the shape can be explained as the rows and columns of
the matrix. For instance for variable a, which has only one row with 4
elements, the output shows (4,) but for a 2x4 matrix it shows shape is equal (2,
4) for two rows 4 columns.
If you want to get the number of rows or columns you can get each element of
the tuple. variableName.shape[0] gives the number of rows of a matrix.
variableName.shape[1] gives the number of columns of a matrix.
For the type of the data, type(variableName) is used.
Size of an array:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
The variableName.size gives the number of total elements in an array. For
instance, in a 2x4 matrix, there are 8 elements, thus, the size is 8. In a 3x3, the
size will be 9, etc.
Creating Arrays using lists and tuples:
Lets say you enter a tuple or a list and you want to convert it into an array. The
.asarray() function of numpy allows you to do so.
Input can be lists of tuples, tuples of lists, lists, tuples, ndarray and tuples of
tuples.

chanddra.p@gmail.com
UVBL5MQSJ8

Indexing of Arrays:
Like lists and tuples, arrays can also be sliced. To access array elements, simply
type arrayName[indexNumber]. Indexing in arrays begins from 0.
In the example given below, a[0] gives the value of element 0 in a. similarly,
b[1,3] gives the value of element 3 in the first dimension, which translates to
position a23 of 2x4 matrix b.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Negative indexing can also be done while using arrays. In the example, b[0,-1]
gives the value of the last element in the zero-th dimension. It means, it gives
the value of a14 in the 2x4 matrix b.
Slicing of Arrays:
Slicing means displays the value of elements from a certain start point until a
fixed end point.
Like index, to pass slice, we write variableName[start:end].
We can also get alternate elements or every 3rd element or every 5th element,
using step. Write variableName[start:end:step] to jump steps or use only
certain elements from an array.

chanddra.p@gmail.com
UVBL5MQSJ8

In a one dimensional array, all you need to do is enter the index number. In a
two dimensional array, you mention the start and end elements and the index
number to slice it accordingly. In the above example, we have written b[0: ,
1:3] which means that we are asking for python to display the index 1 to 3 in
elements starting 0 until the end.
When we write 3: it assumes that everything is to be displayed beginning from
the 3rd element.
Merging Arrays:
We use the concatenate function to merge arrays.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
In the above example, there are two arrays, both having dimension 2. They are
concatenated using the np.concatenate() function. one must note that both
the arrays need to have the same number of dimensions to be concatenated.
Splitting of arrays:
Lets say you have a 2d array and you want to split it in a number of ways. We
chanddra.p@gmail.com
UVBL5MQSJ8use the .array_split()function to split an array.
The output/return value after using the array_aplit() function is an array
containing an array of each split of the array declared previously.
Input:

Output:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
In the above example a single dimension array is split into 3 parts. The result is
stored in a variable called split_array which is an array of all the split arrays
from variable a. Likewise, for variable b, which is a two-dimensional array. We
split it into 3 arrays.
To display one element of the new array we use indexes.
variableName[indexNumber] gives the value of the element.
chanddra.p@gmail.com
UVBL5MQSJ8
Then we split the same 2d array column-wise. To split it column-wise, one has
to use ‘axis=1’ which means vertical axis. The array has been split into 2 arrays,
column-wise. And to display the 1st element of the new split array we used
variableName[indexNumber]. The output will be displayed in a vertical order.
Instead of using array_split() you can also use .hsplit() for horizontal splitting
and .vsplit() for vertical splitting and it will give you the exact same output .

Reshaping an array:
Lets say you know the range of an array, and it begins from 20, until 40. You
want to reshape it in a 4x5 matrix. Doing this manually will be a big task. For
such reasons we use the reshape function of an array.
The shape of an array is the number of elements in each dimension of an array.
For example:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Input:

Output:

chanddra.p@gmail.com
UVBL5MQSJ8

Here, we created a variable to store numbers between 20 and 40. The list is
converted into an array. By default, it is a one dimensional array. This array is
then reshaped using NumPy functions. As a result, it is reshaped into a 4x5
matrix.
Similarly, for numbers between 105 and 150, a variable containing the range is
stored. It is then converted into a one dimensional array. This array is then
reshaped into a 9x5 and a 3x15 matrix.
Making a 3x15 matrix manually would be tedious, this is where .reshape()
function come sin handy.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
From 1Dimension to 3dimension and Unknown Dimension:
Input:

Output:

chanddra.p@gmail.com
UVBL5MQSJ8

In the first example, we reshaped the array from a 1 dimensional array to 3


arrays containing 3 arrays, each with 5 elements.
This conversion is called one dimensional to multi dimensional reshaping.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Then in the second example, when we don’t know how to reshape it, meaning
if do not have an exact number for one of the dimensions in the reshape
method, we pass -1 as the value and a NumPy calculates this number by itself.

chanddra.p@gmail.com
UVBL5MQSJ8

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Supplementary Learning Material
chanddra.p@gmail.com
UVBL5MQSJ8

Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: NumPy - Functions

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Unit 8
NumPy- FUNCTIONS

Overview:
While covering the basics was easy, NumPy also has a lot of other functions
that help us write programs with ease and provide efficiency.

Unit Outcomes:
- Operations on arrays:
1. Using one element of an array at a time.
2. Sorting of arrays.
- Functions in NumPy: Max, min, square, add, subtraction, product,
division.
- Linspace.
- Broadcasting arrays.
chanddra.p@gmail.com
UVBL5MQSJ8

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Apart from Indexing, Slicing, Splitting, Merging, there are a lot of operations
that can be performed on arrays.
You can manipulate and use parts of the array for a specific purpose as well.
This can be done using iterations. The procedure is simple- declare the array,
and then run a for loop to access every element inside the array. In case of 2-d
arrays, it will go through all rows and for an n-dimensional array, it will go
through the (n-1)th dimension in order, one after the other.
To return scalars, you will have to run a for- loop within a for loop.

chanddra.p@gmail.com
UVBL5MQSJ8

In the above example, the elements inside each dimension have been printed.
While, only using the first for loop will return every element of the array
individually.

Similarly, loops and functions can be used to manipulate arrays.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
To Search in an Array:
The where() method in NumPy is used to see if a certain value is present or
not. It can be used for one-dimension arrays or multi-dimensional arrays:

In the above examples, for a single dimensional array, it shows the indexes
where the number is present followed by the size of the data type. The value
stored in each column is a 64 bit integer.
For a 2-d
chanddra.p@gmail.com array, the number is present in both the dimensions, so the first part
UVBL5MQSJ8
of the output shows the array elements where the number is present, and the
next half shows the index of the respective element where it is present.
The searchsorted() method tells us the index where a new element would be
inserted in the array to maintain the order. It is performed on a chronological
array:
In the example below, for a one-dimensional array, we write elements of an
array and the searchsorted() method tells us in what position/index must the
given element be inserted. Similarly, for more than one element, the numbers
are written within square brackets. It gives the index positions where the
values can be inserted respectively.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Sorting Arrays:
Arrays can be sorted numerically and alphabetically. In python, it is very easy
to sort an array, one can do it by using the sort() method.
There are four parts to a numpy array that needs to be sorted:
1. The variable name/the array that has to be sorted
2. axis: that is, the axis along which the array must start. Axes are the
directions along columns and rows.
axis=0 means along the column, and axis=1 means along the row.
By default, the axis is set to -1. If you do not enter the axis parameter, it
will sort the data on the last axis.
3. order: fields that have to be compared are mentioned.
4. kind: the type of sorting that you want to perform on the array.

Example 1: Sorting a one-dimensional array

chanddra.p@gmail.com
UVBL5MQSJ8

In the above example, we simply use one variable to store the 1 dimensional
array and the another variable to store the sorted array.

Example 2: Sorting 2D arrays (Column-wise or row-wise)


In the above example of a 2-dimensional array, the axis is set at 0 which means
it sorts the array column-wise in a low to high order. In a 2d array, axis=0
direction runs downwards/vertically down the rows. And axis-1 runs
horizontally across columns. Thus, the given array is sorted downwards, i.e. the
smallest element is the first row. The rows are in ascending order.
Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
When the axis is set to 1, the array is sorted row-wise, that is why every row is
displayed in an ascending order.

Example 3: sorting an array in reverse order:

chanddra.p@gmail.com
UVBL5MQSJ8

To return an array in the descending order, one just has to use the – sign along
with the np.sort() method, i.e. -np.sort(-variableName).
Similar is the case for 2d arrays:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
NumPy Arithmetic Functions:
NumPy has functions that help in performing basic arithmetic operations. If we
perform operations on two dissimilar shapes, we get an output error called
“Value Error”.
1. The add function: it gives the same output as the + operator does.
Except, here, we add two arrays together.
chanddra.p@gmail.com
UVBL5MQSJ8

2. The subtract function: it gives the output by subtracting all elements of


one array by the corresponding elements of another array. The output
obtained by using the + operator is the same.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
3. Multiply function: Here, in case of 1 dimensional arrays, the two arrays
are multiplied with each other. For multi-dimensional array, each
element of array1 is multiplied by the corresponding element of array2.

chanddra.p@gmail.com
UVBL5MQSJ8

4. The division function: like multiplication of arrays, division is also done


element by element. The element aij in array1 is divided by the
corresponding bij-th element in array2.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
5. remainder and mod function:
The remainder and mod function both return the remainder while
dividing one number by the other. In this case every element in array1
returns the remainder on dividing by the corresponding element in
chanddra.p@gmail.com
UVBL5MQSJ8 array2.

6. power function: In this case, the first array is treated as the base and
every element in the first array is raised to the corresponding element in
the second array.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
7. reciprocal: the reciprocal function returns the reciprocal of all elements
of an array.

chanddra.p@gmail.com
UVBL5MQSJ8

Linspace:
If you were asked how many numbers lie between 1.0 and 2.0, you might say
infinite because the numbers could be any 1.001 to 1.999 or even
smaller/bigger than those on a real number line. This is where linspace is
helpful. In it we define a start point, an end point and the number of intervals
in between. It lists an array of all these numbers. For instance, in between 1.0
and 2.0 if we mention num=10, it will return 10 numbers in between 1.0 and
2.0 at equal intervals. So in easy words, it returns the numbers with respect to
the interval in between a start point and a stop point.
Syntax:

numpy.linspace(start, stop, num= someNumber, endpoint=T/F, retstep=T/F, dtype=arrayType)

In the above syntax, the attributes inside the linspace function mean:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Necessary parameters:
1. start: start point
2. stop: end point
Optional parameters:
1. num: number of intervals you want in between the start point and the
end point. By default num is 50.
2. endpoint: is a Boolean value. If yes, then the endpoint is included, if
false, end point is not included. By default it is set to false.
3. retstep: this is also a Boolean value. If true, the output is returned in
(samples, step) form. Which implies the output also displays the step(or
the interval).
4. dtype: it is type of output array. If not specified, the datatype is inferred.
5. axis: axis is used in the result to store the samples. Can be applied only
when the start and stop are of an array type.
Example 1: linspace() using necessary parameters:

chanddra.p@gmail.com
UVBL5MQSJ8

In the above example, start and stop point are set to 0 and 20. 50 numbers
between 0 and 20 are printed at equal intervals, including 0 and 20.
Example 2: linspace() using optional parameters:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
In the above example, the start and end points are 0 and 5 respectively. The
statement means that python must print 20 numbers in between 0 and 5 at
equal intervals where the endpoint must be included and the numbers must be
of float datatype. The retstep is set to true, thus, it gives the interval at which
all numbers are places. Here, it displays a number after every 0.26315 float
numbers.
Example 3: linspace() in an array:

chanddra.p@gmail.com
UVBL5MQSJ8

In the above example, parameter axis is used. The two sub-arrays are set as
the start and end point. 4 arrays are printed with 3 elements in each array. For
axis=1, column sequence is used to return elements in the range.

Broadcasting Arrays:
Usually, when we perform arithmetic operations on arrays of difference
shapes, we get an error. A way to overcome this problem is by duplicating the
smaller array to the size and dimension of the bigger array. This is called
broadcasting. Thus, broadcasting allows us to perform various arithmetic
operations on different array sizes. Although, numpy does not duplicate the
smaller array, it makes memory and efficiently uses already existent structures
in the memory to help get the same outcome.
Rules for Broadcasting:
1. If the arrays have different dimensions, the one which has fewer
dimensions is padded with ones on its left side.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
2. If the shape of two arrays is unequal in any dimension, an array with
shape set as 1 in that dimension is extended to match the size of the
other shape.
3. if the sizes are unequal in any dimension or are not equal to one, an
error is raised.
4. Arrays can be broadcasted together if and only if they are compatible
with all dimensions.
Algorithm:
Variables: array1, array2
dimensions: x,y
p=max(x, y)
if x < p :
left-pad array1’s shape with 1s until it has p dimensions
else if y < p:
left-pad array2’s shape with 1’s until it has p dimensions
result = new elements with p elements
for i in range((p-1),0) :
array1_dim_i= array1.shape[i]
chanddra.p@gmail.com
UVBL5MQSJ8 array2_dim_i= array2.shape[i]
if (array1_dim_i != 1) and (array2_dim_i !=1) and (array1_dim_i != array2_dim_i):
raise ValueError(“cannot broadcast”)
else:
#Array with maximum dimension and print
result[i] = max (array1_dim_i, array2_dim_i)

Example 1:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
In the above example, variable s is not of the size of variable new. Here, s is
stretched to the size of new and is broadcasted. Then, every element is
multiplied by the scalar 2.5. The multiplication of the array and the scalar is
printed

Example 2:

For the above example, in a single dimension array, broadcasting happened


because of mismatch in array dimension.
chanddra.p@gmail.com
UVBL5MQSJ8
Example 3:

Let’s say there is a room with dimensions 8x7x6. The room is being renovated
and reconstructed. The length of the room has to be increased twice its size
and breadth by thrice. Variable dimension stores the current room dimensions
and variable increase stores the size by which it needs to be multiplied.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Since dimension is a 1x3 array and increase is a 1x2 array, we first
convert/reshape dimension into a 3x1 vector. It is then broadcasted against
increase to give an output of size 3x2 which is the product of dimension and
increase.
Whenever we have matrices of unequal size, we use broadcasting to reshape
and fit all elements of a matrix to the other matrix so we can perform
mathematical operations.

chanddra.p@gmail.com
UVBL5MQSJ8

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Supplementary Learning Material
chanddra.p@gmail.com
UVBL5MQSJ8

Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: Pandas – Series and Dictionaries

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Unit 9:
Pandas- Series and Dictionaries

Overview:
In today’s day and age, data is gold. We use this data to analyse it and read it
better via visualisations and make conclusions. As people of data science, we
know datasets are huge. To make visualisations on Python using these
datasets, we need something called Pandas. Like NumPy, Pandas are a module
in python.

Unit Outcomes:
- Introduction to Pandas
- Series:
a. Creating an empty series
b. Creating a series from an array
chanddra.p@gmail.com
UVBL5MQSJ8 c. Creating a series from a list.
d. Converting a dictionary to a list.
e. Accessing elements from a series.

Pandas, like NumPy is a module in python. We use it to import datasets on


which we want to perform operations. It is an open-source Python package
that is used for data analysis and machine learning tasks. It is a very popular
data wrangling package.
Instead of manually performing tasks on huge sets of data, pandas does it for
us. Data cleaning, normalisation, merging, splitting, joining, statistical analysis
and data inspection are some of the many functions of pandas.

Calling Pandas:

Import pandas as pd
Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Assigning an object pd is optional. Before we use the functions of python, we
need to call the module.

Features of Pandas:
- Data handling: Pandas is a fast and efficient library that helps us explore
data. Series and Data Frames help us in handling the data efficiently
while also helping us manipulate it.
- Handles missing data: the dataset is quite often very difficult to read
and interpret, especially crude data. Sometimes data is also missing. The
Pandas library comes in handy here because it has features that
integrate the missing values.
- Cleaning of the data: crude data is data on the 1st stage, very raw and
messy. Pandas gives importance to cleaning of the data, it has features
to do so. It not only makes the code clean but makes it tidy too. For
better and accurate results, the data needs to be better.
- Input – Output tools: Pandas has built-in tools that help in reading and
chanddra.p@gmail.com
UVBL5MQSJ8
writing the data. The data will have to be read in data structures,
databases, etc. all of this can be done via inbuilt tools easily.
- Time series: Pandas provide tools like moving window statistics and
frequency conversion.
- Mathematical operations: Pandas helps in performing mathematical
operations on the data as a whole or a part of the data.
- Maintaining uniqueness of the data: The pandas library helps in
reducing redundancy by considering only unique values. It also masks
data that is not needed in the analysis.
- Alignment, indexing, grouping: as the header speaks for itself, Pandas
help

Advantages of using Pandas:


- It has an extensive set of features: the availability of a large number of
commands that help in analysing the dataset in a better manner. It also
has functions that help segregate the data so the it can be read with
ease.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
- It makes data flexible: since data can be customised and segregated, the
data becomes more flexible to work with.
- Handles large data efficiently: datasets can be in 1000s or more,
meaning, they can be huge. All of this data can be handled by Pandas in
python efficiently, while also giving us accurate results for the data.
- Less writing: since these are in-built commands, most of the commands
are already part of it.

Series:
Series is like a column in python. It is a one-dimensional array that holds values
of any data type. The labels of each value of a series is called an index.
Example, in excel, for the details of a class of students, you write say, Name
and Email-id as heads of two columns. These two columns in python can be
called two series.
You can name your labels, but if nothing is specified, the series is labelled by
it’s index number.
chanddra.p@gmail.com
UVBL5MQSJ8

Syntax of a series:

import pandas as pd
objectName=pd.Series([“datapoints”]) # the data points can be of any datatype
print(objectName)

Example:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
In the above example, a series name has been created containing names of
people and it is labelled by its index number from 0 to 3.

Creating an empty Series:

An empty series is a series with no values.

Syntax:
import pandas as pd
objectName=pd.Series()

Example:

chanddra.p@gmail.com
UVBL5MQSJ8

Creating a series from an array:


You can also convert your array into a series.

Syntax:

import pandas as pd
import numpy as np
objectName=np.array([“array elements”])
objectName1=pd.Series(objectName)

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Example:

In the above example, two arrays are considered. The first one has numeric
chanddra.p@gmail.com
UVBL5MQSJ8values and the other one has string values. Both are converted into a series
simply by using the pd.Series(objectName) syntax. One can also see the
difference between the arrays and the series.

Creating a series from Lists:


You can convert a list into a series as well.

import pandas as pd
list1=[“listValues”]
ser=pd.Series(list1)

Example:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
In the above example we write two lists, one with numerical value, and
another one with string values. We can easily convert them into a series, just
like we did in the case of arrays.

chanddra.p@gmail.com
UVBL5MQSJ8
Converting dictionaries to series:
Like lists and arrays, dictionaries can also be converted into a series.

In case of dictionaries, the elements are of the form key:value. Here, the key
becomes the label by which the value can be recognized, instead of its indexes.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Accessing elements from a series:
To access elements of an array, we just enter the index number of the element
in the series.
For example, to get the value of the second index from the list we just enter
objectName(position/index).

Example:

chanddra.p@gmail.com
UVBL5MQSJ8

In the above example, we find the value on the 2 nd index for list1 and the value
of the 4th index for the dictionary named dict1.

Example2:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
In the above example, we define the index labels for series y and the data for
both series. To show the index of the series of the data in ‘x’, we used the
.index feature of pandas series, while the y.value displays the values of
different indexes that are randomly numbered.

chanddra.p@gmail.com
UVBL5MQSJ8

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Supplementary Learning Material
chanddra.p@gmail.com
UVBL5MQSJ8

Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: Pandas – Data Frames

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Unit 10:
Pandas- Data Frames

Overview:
As data scientists, we deal with loads of data. This data is often in the
form of rows and columns. There can be over 100s and 1000s of rows
which have 100s of attributes in the form of columns. To analyse,
evaluate and find different types of correlations and connections
between two columns or rows in python, data frames are used.

Unit Outcomes:
- Data frames
- Empty data frames
- Creating data frames from dictionaries
chanddra.p@gmail.com
UVBL5MQSJ8 - Creating data frames from lists
- zip() function
- Creating data frames from arrays
- loc function
- iloc function
- The describe function
- info function

Data Frames:
Data frames in python are two-dimensional data structures. They are
mutable. They are said to be two-dimensional because they are divided
into labelled axis which are called rows and columns. One can also
perform arithmetic operations on these rows and columns. Data frames
are like an excel sheet. They are used for storing tables. All files with a

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
.xlsx, .csv, or similar extensions can be evaluated in python by converting
them into data frames.
Creating a data frame is easy, and it uses the following syntax:

pandas.DataFrame(data/filename, index, columns, dtype, copy)

In the above code:


pandas.DataFrame() – it is used to declare that the filename/data we are
going to enter has to be converted into a data frame which is part of the
pandas library. Hence, we first call the library followed by the function
name.
data/filename – the data we enter in the form of row columns is stored
in a variable, or else we use a .csv file or .xlsx file that is read by python
beforehand.
index – by default it is set to RangeIndex. It means the index to be used
for resulting frame.
chanddra.p@gmail.com
UVBL5MQSJ8
column – by default it is set to RangeIndex. It is used for the resulting
frame.
dtype – the datatype of each column.
copy – it is used to copy the data. By default, it is set to False.

Creating an Empty data frame:


An empty data frame is the most basic of its type. To create one, the
syntax is:

dataFrameName= pd.DataFrame()

Example:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
In the above example, we import the pandas library and create a variable
that will store the data frame. In case of an empty data frame, one simply
has to write pd.DataFrame with the opening and closing parenthesis.
Then, the print command is written to print the entered data frame.
The output returned says that it is an empty data frame. It has no
columns and no indexes, which means no columns and rows,
respectively.

chanddra.p@gmail.com
UVBL5MQSJ8Creating a data frame:
Let say we use a dictionary to create a data frame.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Here, after we make a dictionary, we convert it into a dataframe using
the pd.DataFrame(dictionaryName) function. We then print the data
frame. At noticing, the different can be seen. Name, Age, and Core Study
Subject have become columns and the data on each index is part of the
row. Number of rows are the same as the indexes in the dictionary.

Creating data frames from lists:


The simplest way to make a data frame from a list is by creating a list and
then converting it into a data frame.
Example:

chanddra.p@gmail.com
UVBL5MQSJ8

In the above example, we create a list with seven names. We call the list
‘names’. We convert this list into a data frame with the column head as
‘Names’. It is stored in the variable ‘dataframe’. dataframe is then
printed.
To make a data frame using two lists:
zip() function:
The zip() function can be used when there are two iterables that can be
clubbed into one tuple. Tuples, lists, sets or dictionaries can be passed
through the zip() function. the output of the zip() function contains an

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
element from each iterable that is passed. The function continues to run
until all objects of the iterations are passed, i.e. when all possible pairs of
objects are made.

Syntax:

objectName= zip(iterable1, iterable2, …)

Example:
Creating a data frame from two or more lists using the zip() function:

chanddra.p@gmail.com
UVBL5MQSJ8

In the above example, we first declared three lists. These lists are stored
in objects ‘names’, ‘age’ and ‘subject’. One must note that the default
output datatype of a zip() function is a tuple, which is why it is converted
to a list. The object ‘dataframe’ stores the zipped list as a data frame with
column names as given above.

Creating dataframes using Arrays:


Dataframes can also be made using arrays.
Example:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
In the above example, we first import the pandas and numpy library.
Object name ‘names’ contains an array of names of people, and
‘marksEng’ contains an array of the marks people received in the English
subject. To make it into a data frame, we simply write pd.DataFrame and
zip names and markEng together and give them column headings. The
chanddra.p@gmail.com
UVBL5MQSJ8data frame is then printed.

loc function:
The pandas library has an loc function that helps us retrieve data from an
array or data frame. It is used to access groups of rows and columns by
their labels or indexes.
The permitted index labels can be:
- a single label
- a list of an array of labels
- a slice object with labels
- a Boolean array of the same length as the axis that is sliced
- a function with one argument that can be called and returns a valid
output for indexing.
Syntax:

pandas.DataFrame.loc[index label]

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
To execute the .loc function, lets create a data frame first.
The data frame created here contains 7 names, their respective age,
favourite color and favourite food. 3 arrays have been zipped together to
form one dataframe. Names have been assigned as labels for all rows.
Dataframe:

chanddra.p@gmail.com
UVBL5MQSJ8

We use the .loc function to extract parts of the data frame.


1. To access a single label of the data frame:
It permits one to extract all values of the mentioned row.
Syntax:

DataFrame.loc[‘labelName’]

Example:
In the below example, we extract the Age, Color and Food preference of
the label named ‘Carl’. A single label is used to retrieve information from
the data frame.
Likewise, similar things can be done with the rest of the labels as well.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
2. To access a multiple labels of the data frame:
It permits one to extract all values of the mentioned rows.
Syntax:

DataFrame.loc[ [ ‘labelName1’, ‘labelName2’ ] ]

Example:
In the below example, we extract the Age, Color and Food preference of
the label named ‘Brandon’ and ‘Greg’. Both labels are used to retrieve
their respective information from the data frame.
Likewise, similar things can be done with the other labels as well.

chanddra.p@gmail.com
UVBL5MQSJ8

3. To access a single label of the row and column from the data
frame:
It permits one to extract the column value of the mentioned row.
Syntax:

DataFrame.loc[‘labelNameRow’ , ‘labelNameColumn’]

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Example:
In the below example, we extract the Color preference of the label named
‘Dexter’. A row label along with its respective column label is used to
retrieve information from the data frame.
It also returns the column header that is mentioned, along with its
datatype.
4. To access a column’s value for multiple rows:
It permits one to extract mentioned column values of their respective
rows.
Syntax:

DataFrame.loc[‘labelNameRow1’: ’labelNameRow2’ , ‘labelNameColumn’]

Example:
In the below example, we extract the Food preference of the labels
chanddra.p@gmail.com
UVBL5MQSJ8named ‘Franco’ and ‘Greg’. Row labels along with their respective column
label is used to retrieve information from the data frame.

5. To access information from the data frame using Boolean values:


It permits one to extract data via the Boolean values for row labels. ‘True’
meaning the row must be displayed, ‘False’ meaning the row must not be
displayed.
Syntax:

DataFrame.loc[ [ True/False, True/False ] ]

Example:
In the below example, we write True and False for every alternate row
label. It thus, displays the rows that have been assigned True.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
iloc function:
Like the loc function helps us to access and retrieve data from rows, and
columns, similarly, iloc helps retrieve information based on the position
of elements in the data frame.
So how are they different?
chanddra.p@gmail.com
UVBL5MQSJ8In case of loc one has to mention the row label or column label, but in
case of the iloc function, integer-based position values are mentioned to
retrieve the data of a particular cell of a dataframe.

Syntax:

dataframeName.iloc[indexPosition]

There are two arguments in an iloc function called the row selector and
the column selector.
For the iloc function, we will use ‘names’ as a part of the zipped data
frames’ column labels instead of using them as row labels.
Data frame:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
1. For single row selection:
To select only one row, you mention the index number within the square
brackets of the iloc syntax.

dataframeName.iloc[indexPositionNumber]

chanddra.p@gmail.com
UVBL5MQSJ8
You can also use negative indexing. It starts with the last row. -1 is the nth
row of the data frame.
Example:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
2. For single column selection:
To select only one column, you mention the index number after a
semicolon and a comma, within the square brackets of the iloc syntax.
You can also use negative indexing. It starts with the last column. -1 is the
nth column of the data frame.

dataframeName.iloc[ : , indexPositionNumber]

Example:

chanddra.p@gmail.com
UVBL5MQSJ8

3. To select multiple rows and columns simultaneously:


You can select multiple rows and columns to be displayed, based on their
integer-positions.

dataframeName.iloc[ startingIndexPositionNumber : endingIndexPositionNumber ]


dataframeName.iloc[ : endingIndexPositionNumber ]
dataframeName.iloc[ [rowIndexNumbers], [columnIndexNumbers] ]

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
To code below shows you can access multiple specified rows and columns
together.
In the first part, we use [ :4] to print all rows from the beginning until the
(4-1)th index.
Next, we print all rows from the starting point as the 0th index until the (3-
1)th index.
In the last part, we print the rows in the 1st, 3rd and 5th index and their
corresponding values from columns with index 1, and 2.

chanddra.p@gmail.com
UVBL5MQSJ8

One must remember that when one mentions the end index, the rows
are printed until the (end index-1)th index.

describe() function:
The describe function is used to calculate statistical data of numerical
values of a dataframe. It can be used to find percentiles, central
tendency, standard deviation, etc. of the numerical values. It returns the
statistical summary of the dataframe, like summary() in R.
Syntax:
DataFrame.describe(percentiles=None, include= None, exclude= None)
Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Parameters of the syntax:
percentiles: percentiles should lie between 0 and 1. They are percentage
values, that is, if you want the 25th percentile, you write 0.25. These
values cannot exceed 1. It’s default value is [0.25,0.5,0.75] which gives
the 25th percentile(1st quartile), 50th percentile(median) and 75th
percentile(3rd quartile).
include: includes a list of datatypes while describing the dataframe. It’s
default value is None.
exclude: excludes a list of datatypes while describing the dataframe. It’s
default value is None.
For a series:
In case of a series, we enter the series and just write
seriesName.describe() to get the statistical summary of the numerical
values inside the series.
Example:
chanddra.p@gmail.com
UVBL5MQSJ8

The output gives the count of the elements in the series, its mean,
standard deviation, minimum value, its 25th percentile, 50th percentile,
75th percentile and maximum value along with the datatype.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
For a data frame:
In case of a dataframe, we first declare what our dataframe is. We print
dataframeName.columnLabelName.describe() for a statistical summary
of a particular column.
Example:

chanddra.p@gmail.com
UVBL5MQSJ8

In the above example, we print the dataframe called marksheet, and then
we find the statistical summary of subject English. One must note, that
you write the name of the column label in the syntax of the describe
function.

For categorical values:


To only evaluate categorical data from the data frame, we use
dataframeName.columnLabelName.describe(). Since categorical data
does not return statistical values, it returns the count, number of unique
data points, topmost data value, and the frequency.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
info() function:
The info function prints a concise summary of the dataframe. It prints
information about the dataframe like its datatype, index types, column
types, etc.
Syntax:
chanddra.p@gmail.com
UVBL5MQSJ8 DataFrame.info(verbose=None,
buf= None, max_cols= None,
memory_usage=None, null_counts=None)

Parameters of the syntax:


verbose: if the info() function must print the whole summary or not. It is
a Boolean value, by default it is set to None.
buf: stands for buffer. It decides where to send the output.
Max cols: if the data frame has more columns than the number
mentioned in max_cols, a truncated output is used. Tells us when the
output must shift from verbose to a truncated form.
Memory usage: tells us whether the memory used by the dataframe is
mentioned. It is a Boolean value.
null_counts: tells us whether null counts must be displayed. It has a
Boolean value.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Example:
Input:
The data frame is created. It contains a marksheet with Names and marks
in the English and Math subjects. Then we type the .info() function with
its parameters.

Output:
The output below shows the data frame first, and then the summary of
chanddra.p@gmail.com
UVBL5MQSJ8the data. It shows the range of the dataframe, the number of columns. It

shows the number of columns with their corresponding column labels. It


shows the number of non-null elements under each column label along
with the datatype of every column. After displaying a table of column
details, it shows the total number of datatypes used in the data frame.
Since we have set memory usage to True, we also get the memory used
by the dataframe in the output.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Supplementary Learning Material
chanddra.p@gmail.com
UVBL5MQSJ8

Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: Data Visualization - Matplotlib

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Unit 11:
Data Visualisation- Matplotlib

Overview:
As data scientists, data visualisation plays an important role while we analyse
and interpret our data. This visualisation can be in the form of charts, graphs,
plots, etc. In python, like Pandas and Numpy there is a library called matplotlib
that helps us make all of these graphical charts and enables easy interpretation
of the data. It can be used to make analysis on the entire dataset overall or
specifically on particular columns and rows.

Unit Outcomes:
- Matplotlib- introduction.
- Need for matplotlib
- Line plots
chanddra.p@gmail.com
UVBL5MQSJ8 - Scatterplot
- Bar plot
- Histogram
- Boxplot
- Pie Charts
- Doughnut Charts

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Introduction:
Matplotlib is amongst the most used Python packages, it helps plot low level
graphs in python and is a visualization utility.
John D. Hunter created the matplotlib package. It is open source and can be
used freely. The script of matplotlib is structured so a graph/visualisation can
be made using only a few lines of code.
To install matplotlib, one can use pip. Enter the following command on the
command prompt.

pip install matplotlib

However, the Anaconda distribution has matplotlib pre-installed and the above
step is not necessary for the same reason.

Advantages/Uses of Matplotlib:
1. It is open source: one does not need a licence to access the matplotlib
chanddra.p@gmail.com
UVBL5MQSJ8 package and can easily be accessed by everybody.
2. It is extensible and can be customised: since matplotlib has a lot of
graphs and features, it can fit in almost any circumstance.
3. It is portable and cross-platform: if you write the code on Linux,
Windows can read it too, which makes code interpretation easy.
Matplotlib, being part of the python library, can run on any platform.

Pyplot:
Pyplot is a submodule of matplotlib. It is a collection of functions that make
matplotlib work like MATLAB, which is a programming language that helps in
plotting data into graphs, implement algorithms, create user interfaces and do
matrix manipulations.
We use pyplot to make graphical visualisations, each of its function makes
changes to a figure.
To import pyplot:

import matplotlib.pyplot as plt

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Data visualisation in Python:
Line Chart:
By its meaning, line chart is one that shows a line that connects all
datapoints on a graph. It is most feasible for a time series data, since line
charts are used to track changes over brief or long periods of time.

To make a basic line chart:

chanddra.p@gmail.com
UVBL5MQSJ8

In the above example we first created to two arrays which act like the points
for the x axis and y axis respectively. We created a line chart by using the
pyplot function. To display the chart, plt.show() function is supposed to be
used, like we use print() to display strings and numeric data.

Scatter plot:
As the name suggests, a scatter plot shows all data points scattered on a graph.
It shows the relationship between two types of data. For instance, the height
and weight of students in a class. Lets say there are 50 students, the height is
on the y-axis and the weight on the x-axis. The datapoints are marked on the
corresponding height-weight point on the graph. Scatterplots are often used to
graphically show the correlation between two variables.
Syntax:

matplotlib.pyplot.scatter(x, y, s=None, c=None, marker=None, cmap=None, norm=


None, vmin=None, vmax=None, alpha=None, linewidths=None, edgecolors=None,
plotnonfinite=False,
Proprietary content. Alldata=None)
rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Here, all the parameters except x and y are optional. By default, all are set to
None.
s: the marker size
c: the marker colour.
marker: marker style
cmap: short for colormap. It is only used if c is an array of datatype float.
norm: short for normalize. It is used to scale colour data.
vmin and vmax: used jointly with the default norm to map colour array ‘c’ to
colormap ‘cmap’.
alpha: alpha blending value
linewidths: the linewidth of the marker edge
edgecolors: sets the edge colour of the marker. Can only have values ‘face’,
‘none’ or a colour
plotnotfinite: it is a Boolean value.
For example:
Here, we have combined two scatter plots to compare them together. Blue
coloured dots represent one scatter plot and the brown dots represent
another scatter plot. Basic arguments have also been used in the plot. To
chanddra.p@gmail.com
UVBL5MQSJ8
display them individually, one must write the plt.show() command after its
respective plt.scatter() method. In this case random numbers have been
chosen to make them

Bar Plot:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Bar plots are used to represent categorical data with rectangular bars of
different heights depending on their respective values.
Syntax:
plt.bar(x, height, width, bottom, align)

Where the parameters:


plt.bar(): calls the bar plot function
x: the sequence of sequence of scalars
height: height of the bars
width: the width of the bars
bottom: the y coordinate of the bars bases. By default it is set to 0.
align: can be ‘center’ or ‘edge’. By default it is set to center.
Example:
In the diagram below, we made a a bar graph of students and their favourite
subjects. There are 6 subjects and the number of students who like a particular
subject. The no. of students is shown on the y axis and the subjects on the x-
chanddra.p@gmail.com
UVBL5MQSJ8axis.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
One must remember that attributes like color, linewidth, edgecolor, etc. are
general attributes that can be used as a part of all plots.

Histogram:
A histogram, like a bar plot is a graphical representation of categorical data
that can be divided into buckets. Histograms are used when the frequency
distribution of a variable is given, unlike bar plots, which are generally used for
discrete data.
Syntax:
plt.hist()

The parameters plt.hist() can have are:


x: array
bins: it is optional, it contains an integer. Used to specify the number of
buckets on the plot.
chanddra.p@gmail.com
UVBL5MQSJ8density: optional, contains Boolean values.
range: contains the upper and lower range of bins.
histtype: type of histogram, can be ‘bar’,’barstacked,’step’,’stepfilled’. By
default it is set to bar.
align: controls the plotting of the histogram. Can be ‘left’,’mid’,’right’.
weight: contains array of weights having the same dimensions as x.
bottom: location of baseline of each bin
rwith: contains relative width in according to bin width.
log: used to set histogram along the log scale.

Example:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
The above example shows a set of numbers divided into buckets and displayed
according to its frequency in each bucket.

Box Plot:
The main
chanddra.p@gmail.com
UVBL5MQSJ8
idea of a box plot is using to represent the data using a five-number
summary. This includes the minimum of the range, the maximum, the median,
and the first and third quartiles.
A box plot also helps to realise how many outliers our dataset might have. The
extending lines from the boxes show how the data is variable outside the
upper and lower quartiles.
Syntax:

plt.boxplot(data, notch=None, vert=None, patch_artist=None, widths=None)

The parameters include:


data: array to be plotted
notch: accepts Boolean values. Allows you to evaluate confidence intervals.
vert: accepts Boolean values. If true, displays the boxplot vertically. If false,
displays the boxplot horizontally.
bootstrap: accepts int and specifies intervals around notched boxplots.
patch_artist: if true, tells that boxes are patches, not paths.
widths: sets the width of the box.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Example:

chanddra.p@gmail.com
UVBL5MQSJ8

Pie Charts:
The basic idea of a pie chart is that it looks like a pie, it is circular. It is divided
into slices to show the quantity it represents out of the total. A circle has 360
degrees, each slice represents a percentage according to its
representation/frequency in the dataset.
Syntax:

plt.pie(data, explode= None, labels= None, color=None, autopct=None, shadow=False)

The parameters used are:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
data: array or sequence of arrays
explode:
labels: list of strings to label each slice.
color: to provide color to the slice/wedge
autopct: to label the slices with their numerical value
shadow: to create a shadow of the wedge.

Example:

chanddra.p@gmail.com
UVBL5MQSJ8

In the above example, we how a pie chart with 5 slices divided proportionately.
They show the number of people who like a respective dryfruit. Let’s say we
want to know number of people who like raisins, to do so we use the explode
parameter to separate it from the rest of the pie.

Doughnut Charts:
Doughnut Charts are similar to pie charts. It has categorical data that are
divided into parts of a whole chart. The only difference here is the center part
of the chart is absent. It uses the area of arc to represent the information,
unlike pie charts which focuses on comparing the proportion area between the
wedges. The center part can be used to display additional information about
the chart.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Data must be presentable in pie chart form to be made into a doughnut chart.
The latter eliminates the comparison of size/area of each slice and instead,
uses the length of the arc to measure the slice, making it easy to measure.
To create a doughnut chart:
1. Make a pie chart.
2. Make a circle with suitable dimensions.
3. Make a circle at the center of the pie chart.
Syntax:

plt.pie() #create a pie chart


objectName=plt.Circle() #create circle
objectName1= plt.gcf() #get current figure
objectName.gca().add_artist(objectName) #add circle in pie chart
plt.show() #display chart

The parameters for the pie chart remain the same.

chanddra.p@gmail.com
UVBL5MQSJ8Example:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
In the above example, we first create a pie chart, then we create a circle which
is white in color and add it in a pie chart. We set the parameters as desired and
add a title to it. We then display the chart.

chanddra.p@gmail.com
UVBL5MQSJ8

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Supplementary Learning Material
chanddra.p@gmail.com
UVBL5MQSJ8

Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: Data Visualization - SeaBorn

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Unit 12
Data Visualisation – SeaBorn

Overview:
The motto of matplotlib is known- to make visualisations. Now imagine a
library, bigger than matplotlib, or might it be said, a superset of the matplotlib
library that can be used to not only make visualisations, but also help in
statistical analysis on time series data, or even help picture linear regression
models. There is one library that does all of these things and more, it is called
seaborn.

Unit Outcomes:
- Introduction
- Difference between seaborn and matplotlib
- Joint plot
chanddra.p@gmail.com
UVBL5MQSJ8 - Scatterplot
- Box plot
- Pair plot
- Hex bin plot
- Violin plot
- Strip plot
- Pie chart
---------------------------------------------------------------------------------------------------------

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Introduction:
First things first, what is seaborn? It is a Python library that is open source. It is
built on top of matplotlib. Like the latter, it is also used for exploratory data
analysis and visualising huge chunks of data. It complements the Pandas library
and can be used for manipulating and analysing dataframes. Since it is used for
statistical analysis, making graphs and other visualisations become a very
important part of seaborn.
It is advised that one be familiar with Numpy, matplotlib and Pandas to get
familiarised with seaborn easily.
A few functionalities of seaborn include its help to visualise not only univariate
distributions, but also bivariate. It can be used to create multi-plot grids since it
favours high-level abstraction. It also helps perform regression plots and help
in automatic estimation. Its API is oriented in a way that it can help find the
relationship between two variables.

chanddra.p@gmail.com
UVBL5MQSJ8
Difference between matplotlib and seaborn:
While being very similar, matplotlib and seaborn vary in a lot of aspects, in
their functionality, flexibility, visualisations, etc.
Feature Seaborn Matplotlib

Functionality It contains plots, patterns It helps in making basic graphs.


and themes for data
visualisation. All the data
can be combined in one
plot. It helps in deciphering
the distribution of the data.

Plots and graphs All plots in matplotlib plus Bargraphs, histograms, piecharts,
usually made heatmap, factor plot, scatter plots, lines, box plot, ect.
density plot, joint plot, etc.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Syntax Easy syntax as compared to Relatively complex syntax.
matplotlib.

Syntax Example For a boxplot: For a boxplot:


seaborn.boxplot(x,y,hue,dat Matplotlib.pyplot.boxplot(data,la
a,ax) bels)

Flexibility It avoids plot overlapping by It is robust and highly customised.


using default themes.

Libraries used Seaborn is an extension of It uses NumPy and Pandas for


matplotlib. It uses visualisation.
matplotlib, Pandas, and
NumPy for visualisation.

Arrays/Datafra It is organised and more It is efficient with arrays and


mes functional. It treats the datasets. It contains stateful API
dataframe as a single thing. for plots, because of which
Although, parameters are methods like plot() in matplotlib
required to call methods like can work without parameters.
chanddra.p@gmail.com
UVBL5MQSJ8 plot() because it is not as Objects in matplotlib are the
stateful. figures and axes.

Dealing multiple It sets a time for creation of Multiple figures can be opened
figures every figure. It can also lead and closed simultaneously.
to out of memory issues. Although, figures are closed
distinctly.
Visualisation It is more congenial to It is connected to NumPy and
handle Pandas data frames. Pandas well. Pyplot in matplotlib
Basic methods are used to provides similar features like
provide graphics. MATLAB. Matplotlib also acts as a
graphics package for data
visualisation.

Importing the seaborn library:


The syntax to import is just like other library packages, import the library and
assign an object name to it (optional).

import seaborn as sns

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
You can apply the default seaborn theme, scaling and color palette. Seaborn
also has 18 built-in datasets. To get a list of these datasets, one just has to
enter:
sns.get_dataset_names()
sns.set() #for the theme and other aesthetics

For the above statement, you will get an output displaying all 18 dataset
names:

chanddra.p@gmail.com
UVBL5MQSJ8

Plotting using seaborn:

1. Joint plots:
A plot of two variables with bivariate and univariate graphs. It has three
plots. One displays the bivariate data which shows how the dependent
variable varies with respect to the independent variable. Above the
bivariate graph is another horizontally places graph that shows the
distribution of the independent variable. Yet another plot is on the right
margin of the bivariate plot. Its orientation is vertical, and shows the
distribution of the dependent variable. All in all, one bivariate chart has
a univariate analysis done of its independent variables, above and on the
right margin.
Jointplot() creates a scatter plot of the bivariate data and two
histograms above and on the right margin, by default.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Syntax (with some of the main parameters):

sns.joinplot(x,y,data, kind, color, height, ratio, Return, dropna)

Although the given parameters are not the only ones, but these are
amongst the important ones, where:
x,y – these are names of variables in the ‘data’.
data – it takes the dataframe where x and y are variable names.
kind – takes into account the kind of plot to draw. For eg- hex, scatter,
etc.
color – assigns the color used for the plot element.
dropna – takes Boolean values. If it is true, then removes missing
observations from ‘x’ and ‘y’.
Return – returns jointgrid object with the plot on it.
Example:
chanddra.p@gmail.com
UVBL5MQSJ8
Input:
here, we take into consideration a built-in dataset called ‘tips’. The
dataset has 244 rows and 7 columns that in a gist show the total bill, the
tip, the sex of the person, etc.
We write a jointplot function where we compare the ‘total_bill’ with the
‘tip’. We perform a regression analysis and give it orange color. The
expected output chart must show whether the two variables are
correlated and we can make judgements according to the same.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Output:
We see a bivariate chart showing the correlation between total_bill and
tip. There is also a regression line the is displayed. Since the datapoints
are closely spaced to the line, we can say that the two variables are
relatively positively correlated. Above the bivariate chart is the
histogram of the dependent ‘total_bill’ variable. We can see that the
distribution is right tailed. However, incase of the independent variable,
the histogram shows the distribution is not normally distributed and the
frequency of the data is more on the higher end. a similar analysis of the
plots can be done using jointplots.

chanddra.p@gmail.com
UVBL5MQSJ8

Scatter Plot:
One can plot a scatter chart with only the bivariate data as well. A scatter plot
can be used to understand semantic groupings better. It is a 2-dimensional
graph that can be mapped with three additional variables using size,hue and
style parameters.

Syntax:
seaborn.scatterplot(x, y, hue, style, size, data, palette, markers)

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
The above mentioned parameters are most used common ones and their need
in the plot is:
x,y – data variables that are inputted. They should be numeric.
data – the dataframe where ‘x’ and ‘y’ are column heads.
size – produces points with different size. It is a grouping variable.
style – produces points with different markers. It is a grouping variable.
palette – helps determine how a point is produced with different markers.
markers – helps draw markers for different levels.
alpha – determine the opacity of points.
Returns – gives the Axes object with the plot drawn on it as output.
Some of the other arguments that can be part of the scatterplot() method are
hue_order, hue_norm, size_order, x_bins, y_bins, x_jitter, y_jitter, estimator,
etc.

Example:
In this case,
chanddra.p@gmail.com for ease of availability, we use a built-in dataframe, one can use
UVBL5MQSJ8
any independent dataset as well.
We use the ‘Tips’ dataframe. We provide the following input where the
independent variable ‘y’ is ‘tip’ and the dependent variable ‘x’ is total_bill. The
dataset is loaded on the variable named data. The hue is set to ‘smoker’. Hue
segregates elements with different colors. Style is set to ‘smoker’, it shows
elements with different styles, and lastly, the size is set to ‘size’ it shows the
datapoints according to its size.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
We get the following output:
Here, the tip and total bill datapoints are plotted, the data has been
segregated according to the size and whether a person is a smoker or not.

Box Plot:
chanddra.p@gmail.com
UVBL5MQSJ8
A box and whiskers plot is mainly used for categorical data. It shows the
distribution of the quantitative data in a manner where its variables can be
compared. The five main parts of a box-whisker plot are median, first quartile,
third quartile, maximum observed value, and minimum observed value. Thus,
the boxplot shoes quartiles, and the whisker shows the rest of the distribution.
There are points on the whiskers, which are also called ‘outliers’.
Syntax:

seaborn.boxplot(x,y,data, hue, palette, color, orient, saturation, width, whis, width)

These are amongst the main arguments of the boxplot() method. Their use is:
x, y – names of variables from the ‘data’
data – the dataframe which have x and y as column heads
hue – name of grouping variable that segregates data with different colors.
Contains nominal type of data.
orient – to set the orientation of the box-whisker plot. Can be horizontal or

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
vertical.
palette – different colors that can be used for different levels of the hue
variable.
saturation – proportion or saturation to draw colors at.
width – to set the width of all elements of one level incase the plot has hue,
otherwise to set the width of the plot in general.
whis – to set the proportion of inter-quartile range past the first and third
quartiles to extend plot whiskers. Points beyond the range are called outliers.
Example:
We again consider the ‘Tips’ built-in dataset. Here we compare the total bill of
a smoker and that of a non smoker on Thursday, Friday, Saturday and Sunday.

chanddra.p@gmail.com
UVBL5MQSJ8

For the given input we get an appropriate box and whisker plot where the days
are on the x-axis and the total bill on the y-axis. It takes into account whether a
datapoint/person is a smoker or not. The palette is set to pastel shades, so all
different levels are shown in pastel colors. The output:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Pair plot:
A pairplot() method can be used to plot various pairwise bivariate distributions
in a dataframe. The diagonal plots of a pairplot are univariate plots while the
pairwise bivariate distribution plot shows the relationship of (n,2) combination
of variable in dataframe as a matrix of plots.
Syntax:

seaborn.pairplot (data, hue, palette, dropna, kind, height, {x,y}_vars, markers)

These are amongst the most used arguments, however, are many more. The
function of these arguments is:
data – the dataset where each column is a variable name
hue – grouping variable to plot aspects of the data into different colors
palette – to use a different color palette for different aspects mapped in the
hue variable
dropna – to drop missing values from the data before it is plotted. Boolean
chanddra.p@gmail.com
UVBL5MQSJ8type of variable.
kind – can be ‘scatter’, ‘kde’, ‘hist’, ‘reg’ as per requirements
height – to provide the height of each faucet
{x,y}_vars – variables from the dataframe that can separately be used for rows
and columns of the figure
markers – like for scatterplot, markers in pairplot are also used to color all
levels differently.

Example:
We use the tips dataset and draw a pairplot with hue ‘smoker’.
Input:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Output: As a result we get the following chart, where multiple charts are
plotted. For all the plots, the variable ‘smoker’ is taken into consideration and
is compared with ‘total_bill’, ‘tip’ and ‘size’ and a histogram and scatterplot is
drawn for each, respectively.

chanddra.p@gmail.com
UVBL5MQSJ8

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
hexbin plot:
As the name suggests, the hexbin plot draws a data range with an array of
hexagons and colors each hexagon according to the number of observations
that it covers. When a scatter-plot is overplotted, i.e. when it has a lot of
datapoints, the hexbin plot comes in handy as it helps in visualisation large
datasets without overlapping points. A hexbin plot can be created by defining
the ‘kind’ argument of jointplot as ‘hex’

Syntax:

sns.joinplot(x,y,data, kind= ‘hex’, color, height, ratio, Return, dropna)

Example:
We take the ‘Tips’ dataset into consideration and draw a hexbin plot for
‘total_bill’ and ‘tip’ variables.
Input:
chanddra.p@gmail.com
UVBL5MQSJ8

For the given input, we get an output as: the bivariate hexbin plot shows that
more datapoints are concentrated where the total bill amount is between 10$
- 20$ and when the tip is between 0$ - 4$. The univariate chart above shows
the histogram for the x-axis data, and the histogram on the right margin of the
chart shows the data of the ‘tip’ variable.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Violin plot:
Like box-and-whisker plots, violin plots show the data across categorical
varirables. This data is quantitative in nature. It is an aesthetically better way
chanddra.p@gmail.com
UVBL5MQSJ8
of showing multiple data at various units.
The difference between a violin plot and a boxplot is that the former takes into
account and features the kernel density estimation of the underlying
distribution. To maintain the numeric column that is plotted, a ‘wide form’
dataframe is used.

Syntax:

sns.violinplot(x,y,data, color, scale, palette)

These are few of the many parameters that can be used. Where these
parameters mean:
x,y – variables that are used for plotting datapoints.
hue – grouping variable to plot aspects of the data into different colors.
data – the dataset that has variables x and y as column heads
scale – to scale the width of each violin
Example:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
For the tips dataset, we plot violins on the basis of the day. The input is as
follows:

Output: in the output, we see 4 violins that are made according to the day and
their respective ‘total_bill’ values.

chanddra.p@gmail.com
UVBL5MQSJ8

Strip plot:
A strip plot compliments a box-and-whisker plot or violin plot. It is used to
draw scatter plot on the basis of categories.

Syntax:

seaborn.stripplot(x,y,data,hue,palette)

All parameters function in a similar pattern as that of violin plot and boxplot.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Example:

chanddra.p@gmail.com
UVBL5MQSJ8In the above example, we have taken the ‘Tips’ dataset and we consider the tip
on the y axis and days on the x-axis. As the name suggests, the plot is like a
strip. The jitter attribute provides displacements on the horizontal axis.

Pie Chart:
We use matplotlib to create a pie chart, but we can use Seaborn palettes to
make it aesthetically better. There is no specific method in seaborn for pie
charts.

Example:
Here, we have randomly assigned values to different subjects and used
seaborn for the color palette of the pie chart.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
chanddra.p@gmail.com
UVBL5MQSJ8

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Supplementary Learning Material
chanddra.p@gmail.com
UVBL5MQSJ8

Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: Introduction to EDA

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Unit 13
Introduction to Exploratory Data Analysis

Overview:
As data scientists, our main aim is not to just draw graphs based on the
dataset. Albeit super important, it is a step that comes after EDA does. EDA
helps analyse the data, and determine the goals behind the analysis. Data
scientists realise the relationship between factors/variables in the dataset and
then use the tools and softwares like R and Python to explore the data and
make inferences on the same.

Unit Outcomes:
- EDA: Understanding
- Data Manipulation
- Skewness
chanddra.p@gmail.com
UVBL5MQSJ8 - Kurtosis
- Pair plot analysis
- Categorical Data encoding

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Exploratory Data Analysis
Tukey defined data analysis in 1961 as: "Procedures for analyzing data,
techniques for interpreting the results of such procedures, ways of planning
the gathering of data to make its analysis easier, more precise or more
accurate, and all the machinery and results of (mathematical) statistics which
apply to analyzing data." (source: Wikipedia)
Inferences from large sets of data cannot be drawn from just looking over at
them, that will neither be accurate nor appropriate. EDA is short for
Exploratory Data Analysis. EDA includes exploring the dataset and finding
patterns and trends, or making observations about outliers. The most
fundamental part of EDA involves creating a statistical summary of the dataset
and using numerous forms of representations to interpret the data better.

Need for EDA:


- To understand the main features of the dataset.
- To understand the correlation between variables and relationships that
chanddra.p@gmail.com
UVBL5MQSJ8 hold between them.
- to decipher which variables might be of utmost used with respect to our
problem statement.
- To know the direction of mistakes, the variation of data.
- To check for assumptions.
- For selection of appropriate models.

Numerous tools can be used for EDA. Infact, all the libraries that we have
understood about, until now, are direct or indirect tools used for EDA. Various
graphical techniques include:
- Bar graphs
- Scatter diagrams
- Stem and leaf charts
- Pareto charts
- Boxplots and histogram
- Odds ratio, etc.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Four types of EDA methods are:
- Univariate non-graphical: the goal of a univariate non-graphical EDA is
to make tentative conclusions on the population distribution from the
sample distribution. It also gives importance to outliers of a distribution.
Mean, median, mode, skewness and kurtosis, etc. are all part of
univariate non-graphical EDA.

- Univariate graphical: graphical and non-graphical are compliments of


one another. Observations that are made using univariate non-graphical
method are graphically represented in this method. Histograms,
boxplots, stem and leaf plots, quantile normal plots, etc. form the
graphical method of presentation.

- Multivariate non-graphical: multivariate non-graphical shows the


relationship between two or more variables. It can be shown via cross
tabulation or statistics. Finding the correlation, covariance, etc. is a part
of multivariate non-graphical method of EDA.
chanddra.p@gmail.com
UVBL5MQSJ8

- Multivariate graphical: representing multivariate non-graphical data


pictorially is the multivariate graphical method of EDA. Grouped bar
plots, and scatter plots that are multivariate graphical tools for EDA.

Data Manipulation:
For any type of data collected, discrepancies, fallacies and other human errors
have to be accounted for. There are many datapoints where the information is
simply not available, or incorrect information is entered intentionally, or
unintentionally. To do true justice to the data collected and make proper and
accurate inferences from them, the data is manipulated.
Data manipulation is basically organising data so that it becomes better to read
and understand and its format becomes more structured and uniform. It is
used to optimise the working of the firm/industry. Business operations can be
better carried out if the data and surveys a company collects are organised.
Some of the major objectives of data manipulation include:
Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
- Consistency: organising data does not only help us read the data better
but also have consistent content. One of the main reasons to manipulate
data is to have a unified view of the data.

- Project data: when data is collected over a period of time, it is called


historical data or time series data. When data is manipulated it provides
more meaningful and in-depth analysis, helping us project future data.
For instance, stock prices collected over a period of 5 years can be used
to predict the price of the stock next year.

- To get rid of redundant data: sometimes when data it collected, there


are a lot of unnecessary repetitions that can hinder in the proper
analysis. Data is manipulated to get rid of this redundancy and to delete
useless information.
All in all, data manipulation includes operations such as editing, deleting,
updating, converting and incorporating correct data.
To manipulate the data, one has to have a dataset and then one must clean
chanddra.p@gmail.com
UVBL5MQSJ8the data. While cleaning the data, all redundant information is deleted, the
format of the datapoints must be consistent, the dataset must be made
efficient and more meaningful than when it was collected.

Data Manipulation Techniques:


When it comes to techniques, it is usually done with the help of methods and
functions that are part of the Pandas library. A few of them are:

1. Apply function:
The Apply function in Pandas is used for creating variables and
manipulating DataFrame. It returns a certain value after each
row/column of a data frame is passed with a function. once it takes an
input function, it applies it to all datapoints of a DataFrame. For tabular
data, specify the axis on which you want your function to act on, i.e., 1
for rows and 0 for columns.
Example: For the given technique, we use an in-built iris dataset.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
In the above example, we compare whether the total_bill of each person
is lesser of greater than the mean. This can further be used to find the
skewness of the total_bill column of the data.

2. Sorting a DataFrames:
Data can be sorted based on columns:

chanddra.p@gmail.com
UVBL5MQSJ8

3. Merging DataFrames:
Two dataframes can be merged into one using the Pandas library. The
concat and merge function give the same output.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Output:

4. Crosstab in Pandas
A cross-tabulation creates a contingency table of all
elements/datapoints of a
DataFrame, incase of categorical data. For instance, in the given
example, we compare the time and the day on which most people visit
the place in the ‘tips’ dataset.
The output gives us the frequency of visitors on a particular day and
whether they visited the place during lunch or dinner. The margins
attribute gives the total frequency of the table.
chanddra.p@gmail.com
UVBL5MQSJ8

5. Pivot table:
In excel, we often use pivot tables to summarise huge chunks of data.
Numerical data can easily be analysed using pivots. These tables can also
be created in python using the .pivot() function.
Example: in the given example, we see the sum of tips and total_bill
collected on a particular time, on each day, by Males and Females. This

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
can be done to analyse whose spending is more, and on which day they
are they more likely to spend.

Other data manipulation techniques:

Imputing missing values:


The fillna() function is used to update datapoints that have missing values.
chanddra.p@gmail.com
UVBL5MQSJ8These datapoints are filled with the mean/median/mode of the given column.

Boolean Indexing:
Boolean indexing includes using actual values of the data in a dataframe. We
can access the data by masking the data based on its index value, or the
column value or applying a Boolean mask on a dataframe or accessing it with a
Boolean index. Usually, we created an index which contains the Boolean value,
which can then be accessed using the .loc[], .iloc[] or .ix[] function.

Skewness:
In statistical analysis, skewness means the degree of asymmetry in a
probability distribution or the shift of the distribution curve of a given
dataframe. the distortion that deviates from the bell curve that is symmetric,
or a normal distribution, in a set of data is called skewness. If the curve is
shifted towards its right or the left, it is said to be skewed. A normal
distribution is not skewed, ie. its skewness is zero.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
In case of a bell curve, the mean is the mode, ie. the observation with the
highest frequency, but in-case of a skewed distribution, the mean, the median
and the mode tend to vary from each other.

The curve that appears on the right side with its tail tapering towards the
left is called a left-tailed curve and is negatively skewed. On the other hand,
when a curve is longer and fatter on the left, tapering and decreasing on the
right, it is a right-tailed curve or a positively skewed curve.
For a positively skewed curve, the mean will be greater than the median, and
reverse is true for the negative skew, ie., the median is greater than the
mean.

The skewness can be found using Pearson’s first and second coefficients. In
Person’s coefficient of skewness, we subtract the mode from the mean upon
the standard deviation.
The general formula is:
chanddra.p@gmail.com
UVBL5MQSJ8 ̅−𝑴
𝒙
𝒔𝒌𝟏 =
𝒔𝒅
Where Sk1 is the first coefficient of skewness,
𝑥̅ is the mean
M is the mode
sd is the standard deviation.
̅ − 𝑴𝒆𝒅𝒊𝒂𝒏
𝟑𝒙
𝒔𝒌𝟐 =
𝒔𝒅
Where Sk2 is Pearson’s second coefficient of skewness
𝑥̅ is the mean
sd is the standard deviation.

When the mode is strong, the first coefficient of skewness is used. However,
when a distribution has multiple modes or a weak mode, then Pearson’s
second coefficient is used.
A typical skewed graph looks like:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Source- Wikipedia

Why is skewness important?


It shows the symmetry of the probability distribution.it is used to judge the
likelihood of events falling in the tails of a probability distribution. It gives the
direction and magnitude of lack of symmetry. It can be positive, negative or
undefined. Which also makes it unpredictable.
Skewness is of great importance in the field of finance. Modern finance bases a
lot of its theories on the assumption of a normal distribution. In the real world,
chanddra.p@gmail.com
UVBL5MQSJ8this assumption is highly difficult to follow, and consequently estimates
incorrect returns and risk.
In this case, we base performance predictions on the skewness of the
distribution due to skewness risk. In financial modelling, when observations
are not symmetrically spread around the mean, and have a skewed
distribution, it is called skewness risk.
If a dataset is skewed, the model will always misconstrue the skewness risk in
its predictions. As a result, it can be concluded that more the skewness, the
less accurate is the financial model. Thus, it is important to measure the
performance on investment returns.

Kurtosis:
The way skewness measures the magnitude of the distribution, kurtosis
measures the heaviness of the tail. It is used to identify whether the
distribution contains extreme values. The kurtosis of a normal distribution is
equal to 3. Excess kurtosis is the difference between the kurtosis of a normal
distribution from the kurtosis of a given distribution.
Excess Kurtosis = Kurtosis - 3

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Based on the excess kurtosis of a distribution, they can take three types of
values:
- Mesokurtic: when a distribution has kurtosis equal to 0. If a distribution
follows normal distribution, it is said to be a mesokurtic distribution.

- Leptokurtic: when a distribution has excess kurtosis, i.e., kurtosis > 3 it is


said to be a leptokurtic distribution. The tail is fatter than that of a
normal distribution. It has heavy tails on either side which indicate the
presence of outliers.

- Platykurtic: when a distribution has negative excess kurtosis, i.e.,


kurtosis < 3 it is said to be a platykurtic distribution. The tail is flatter
than that of normal. It is indicative of small outliers in the distribution.

Why is kurtosis important?


Like skewness, kurtosis is very important in the field of finance. A platykurtic
chanddra.p@gmail.com
UVBL5MQSJ8distributionin-case of investment returns indicate that a small probability of
the investment would yield extreme results, making it desirable by investors.
Whereas, a leptokurtic distribution shows that the investment is susceptible to
extreme values, making the investment risky.

Pair plot Analysis:


Pairplot is amongst the most powerful features of the Seaborn library. For any
dataset and the variables that are compared, you can see its joint distribution
and marginal distribution. Joint distributions are usually in the form of
scatterplots because they represent bivariate data, while the marginal
distributions are shown via histograms.
Pairplots are very valuable since they show the correlations between multiple
variables.
A ‘tips’ dataset pairplot without any other attributes looks like:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
The pairplot function creates a grid. Each numeric value will be on the Y-axes,
over a single row and the X-axes over a single column, by default. The diagonal
plots are univariate distributions showing the marginal distribution of the data
chanddra.p@gmail.com
UVBL5MQSJ8in each column.

If we want a subset of the variables used we use the following code:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
The subset chosen here is ‘total_bill’ and ‘tip’. A pairplot is displayed where the
diagonals show the histogram for total_bill and tip respectively while the
scatterplots show the relationship between the two.

We use the ‘hue’ attribute of pairplots in case of categorical data to


differentiate between levels/categories.

For the ‘Tips’ dataset, the variable ‘day has categorical data with values as days
of the week.

chanddra.p@gmail.com
UVBL5MQSJ8

To plot a different kind of graph for either marginal distributions or joint


distributions, the ‘kind’ and ‘diag_kind’ attributes are used, respectively.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
chanddra.p@gmail.com
UVBL5MQSJ8We can also add categorical data information to it:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Categorical data encoding:
Machine learning is a big part of data science. A machine learning model is a
file that recognises certain patterns and trends, which it is trained to do.
To better understand this, lets say you are told to see and memorise three
objects - apple, ball, and a car. Later, if you see any moving vehicle, you will be
able to recognise it as a car. The way the brain gets accustomed to recognising
objects, so can a model. You can train a model if you provide images of a car to
the model, where each part of the car is tagged by its parts, and then the
model can be used in an application where it can recognise if a certain object is
a car or not.
The accuracy and performance of the model depends on the hyperparameters,
and the way different variables and fed and processed to the model. One must
note that these models only accept numerical values. Incase of categorical
data, converting them to numbers is important so that it can be interpreted
properly and valuable data extraction by the model is possible.
All of this brings us to the next part of the topic, what is categorical data? How
can it be encoded? How can it be used to help make the model better?
chanddra.p@gmail.com
UVBL5MQSJ8
Categorical variables are ones that are in the form of strings or categories and
are finite. For instance, services of a restaurant can be classified as ‘good’,
‘better’, ‘excellent’. All the datapoints can take any of these three values. Or in-
case of grades of a school student which can be ‘A’, ‘B’, ‘C’, ‘D’, etc.

Categorical data is usually of two types-


- Ordinal data: when the categories follow a particular order, they are
called ordinal data. While encoding, information about the order in
which the categories are provided must be retained. For instance, incase
of the salary of a person eligible for a loan. The salaries follow an
ascending order, which help to determine whether the person will be
able to pay back the loan or not, thus making him eligible for it or not.

- Nominal data: when the categories don’t have a particular order that
they follow. While encoding nominal data, one must check if the feature
is present or not. For example, the department of employees of a

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
company. They can be part of any department and there is no order or
sequence that is followed.

Types of data encoding:


For data encoding, we import the category_encoders library.

- Ordinal encoding:
it is used in case of data that is ordinal in nature. Each label is assigned
an integer value. A variable is created containing categories.
For instance, in the given case, we transform the days via ordinal
encoding, and assign an integer to each day.
category_encoders is not a pre-installed library in Anaconda. One must
install it by writing the first line, i.e.,

Once written, run the code. Close the kernel and reopen it. Execute the
code and you will receive an output that is such:

chanddra.p@gmail.com
UVBL5MQSJ8

- One hot encoding:


In-case of nominal data, where no order relationship exists, one cannot
use integer encoding. It works well unless the categorical values take on
a large number of values. It creates new binary columns which show the
presence of each possible value from the original dataset. It is important
that one stays consistent with these values so that it is easy to go back
to the original data.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Example:
In the given input code we use the tips dataset and encode categorical
variables ‘sex’ and ‘smoker’. We first import all necessary libraries, and
load the dataset. We first look at the data for categorical parameters.
Then we look at the label counts in the parameters. We use the
get_dummies function of the Pandas library for data manipulation. It can
be used to convert categories to indicator variables or dummies.

We get the following output: it shows a 0 when the condition in


question is false, and 1 when it is true.

chanddra.p@gmail.com
UVBL5MQSJ8

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Supplementary Learning Material
chanddra.p@gmail.com
UVBL5MQSJ8

Program: MCA
Specialization: Data Science
Semester: 2
Course Name: Python for Data Science
Course Code: 21VMT5S204
Unit Name: Outlier detection and Treatment

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Unit 14:
Outlier Detection and Treatment

Overview:
Outlier by definition is ‘a person or thing situated away or detached from the
main body or system.’ It is self-explanatory. In real life, we know when
something is different from usual, and cannot be a significant object to draw
conclusions from. For instance, if out of 100, 90 students receive marks
between 60-80, then the other 10 students who receive marks anywhere
between 0 to 45 and above 90 are called outliers because those datapoints are
far away from the main group of data.
Unit Outcomes:
- Introduction: What are outliers?
- Detecting outliers
- Visualisation of outliers
chanddra.p@gmail.com
- Missing value detection and treatment
UVBL5MQSJ8
- Outlier treatment
- Project

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Introduction:
When we analyse the data, or check for the distribution that a given variable
follows, we often find that the data is skewed or does not follow normality
because of a few datapoints that are far and distant from all the other
datapoints. These are called outliers. An interesting example would be in a
class of students pursuing their Masters, say there are 44 students who are
aged between 21 and 24. There was also a 45th student whose age is, say, 30. If
you had to draw a distribution function with all the students, your data would
be positively skewed because of the person who is of age 30. Had we made a
distribution only with 44 students of demographic 21 to 24, then our
distribution would follow normality. Here the 45th student is an outlier. The
outlier in the data could be due to various reasons, in this case variability with
respect to age, there is a considerable gap between 24 and 30.
These outliers cause problems in statistical analysis. They are abnormally
distant/unusual from other values in the sample. As Data Scientists, it is at our
discretion to decide what is normal and what is abnormal based on the data.
Once that is decided, abnormal datapoints are singled out, these datapoints
are called outliers.
chanddra.p@gmail.com
UVBL5MQSJ8

Examples of outliers:
- When we talk about the revenue of a company, we measure it by the
sales of each type of product and its cost. If product A yields 1200 lakhs,
product B yields 1400 lakhs, product C yields 1100 lakhs, product D
yields 1300 lakhs and the best selling product, product E yields 2100
lakhs. While evaluating the company’s performance or prospective
growth trends or patterns they must follow, product E will be considered
as an outlier. If it is considered, the calculations will not be based on the
objective of increasing sales of the other 4 products but on the general
sales increase. This variation might hinder in the company’s actual
growth.

- Another example would be the average height of the people in a


country. Since the population size is huge, the sample might be selected
on the basis of some prerequisites. While evaluation, the heights that
are closer to the mean will be considered normal, whereas the heights

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
that have higher variation will be considered unusual, thus becoming
outliers in that sample.

Detecting Outliers:
1. The simplest way to detect outliers is by sorting the data. It is effective
as it will highlight all the unusual datapoints. Sorting in ascending or
descending order shows the highest and lowest values and will give you
a gist of how spread out the datapoints are with respect to the given
variable/parameter. However, this method doesn’t quantify the degree
of unusualness of an outlier.

2. Graphing your data is another easy way of identifying outliers.


Histograms, Scatterplots and Boxplots are amongst the widely used plots
to identify outliers in the data.
Boxplots display dots/asterisks on the graph wherever the outliers lie.
Histograms show isolated bars on far right or left, showing where most
of the data is and where the outlying data is.
The scatterplot will show all datapoints for bivariate data. All the
datapoints that are normally located will be situated close to the
chanddra.p@gmail.com
UVBL5MQSJ8
regression line, but the outliers will be far from the regression line since
the variation between them is more.

3. Using Z-test one can identify outliers in the data. The unusualness of
the outliers can be quantified using z-test for a normal distribution. It
shows number of variations above and below the mean for each
datapoint. A positive z score ‘x’ indicates that an observation is ‘x’
deviations above the mean. A negative z-score ‘-x’ indicates that an
observation is ‘-x’ deviations below the mean. When z-score is 0, it
represents no deviation from mean. Farther the z-score, the more
unusual the datapoint is said to be. A standard cut-off for z-scores would
be ±3 or further from 0. Z-scores beyond ±3 are extreme.
Z-scores assumes that populations are independent of each other and
that they are normally distributed. It is also assumed that the sample
size are greater than or equal to 30.

4. Creating outlier fences using the inter-quantile range can also help
detect outliers. one can use the interquartile range, various quartile

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
values and an adjustment factor to find boundaries that consists of
major and minor outliers. This will show the unusualness of the
datapoints as compared to the overall distribution of datapoints. Major
outliers are extreme, minor are mild.
IQR is the portion between the 25th and 75th percentile, i.e. the middle
50 percent of the data. The 1st quartile, 3rd quartile and the IQR is used
to calculate the outlier fences, which are, lower outer, lower inner,
upper inner, and upper outer. Percentiles are more robust than other
quantitative methods.

5. Outliers can also be found using hypothesis testing. We consider a


hypothesis test called the Grubb’s Test for Outliers or the Maximum
Normed Residual Test. it is used to find one outlier in a normal
distributed data set. It checks if the minimum value or the maximum
value is an outlier. It assumes that the distribution is normally
distributed. To check whether the data is normally distributed, one can
perform normality tests like the Shapiro-Wilk test.
H0: All values are drawn from the single population which has the same
normal distribution.
chanddra.p@gmail.com
UVBL5MQSJ8
Against
H1: one value in the sample was not drawn from the same normal
distribution population as other values.
The hypothesis is rejected if p-value is less than the level of significance
and it can be concluded that the value is an outlier .

Visualisation of Outliers:
As mentioned before, outliers can be detected by visualising them. they are
mostly visualised using histograms, scatterplots and boxplot.
For bivariate data we use scatterplot since they show the relation between two
variables. Let’s consider the ‘tips’ dataset. We compare if the total_bill and tip
are correlated. Here, it is observable that the datapoints that are far off from
the regression line are ones with varying total_bill and tip values. In this case,
the circled point are clearly the outliers, which means that it is unusual from
the rest of the datapoints.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
For a single variable, we can use histograms or boxplots to evaluate whether a
dataset has outliers. For instance, in the same tips dataset, we take into
consideration the ‘total_bill’ variable, we make a boxplot to see how many
outliers does it have. According to the dataset, the mean total_bill value is
about 18 while it has multiple outliers that go up to 50.0. All the outliers in the
boxplot are displayed by rings above or below the boxplot fences, depending
chanddra.p@gmail.com
UVBL5MQSJ8on where the outliers lie. In this case, most lie in the 40+ region .

One variable outlier detection can also be done using histograms. Incase of
histograms, the point on the far left or right is an outlier. In general, it can be
said that outliers are points that fall above the third quadrant or below the first
quadrant, more than 1.5 times the interquartile range.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
In the tips dataset, when we use .hist() instead of .boxplot() we obtain a
histogram which if noticed, shows outliers at the extreme right of the data. The
distribution peaks between 10 and 20.

Missing Value Detection and Treatment


Missing values
chanddra.p@gmail.com are those where there is no data or NaN values in the dataset.
UVBL5MQSJ8
This issue usually arises when no information is provided during the primary
data collection process. People can choose to not answer questions, resulting
in the cell/datapoint being blank. These values arrive when the never existed
or it existed but was not collected. Missing values in pandas are called NA (not
available) values.
None is a singleton object in python that can be used for missing data.
NaN stands for Not a Number, it is a float value that can be understood by all
IEEE float represented systems.
Both these keywords can be used interchangeably to indicate that values in
that datapoint are missing. Multiple functions can be used to recognize and
replace these missing values.
The first step however would be to find whether there are null values in the
dataset. For the same, we check the data using functions like isnull() and
notnull(). Both these functions are of Boolean type and return True or False
values when there are null values and there aren’t null values, respectively.
The .any() function checks if the value is True over an axis.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
In the given example, we check if there are null values detected in any variable
of the ‘Titanic’ dataset. According to the output, there are missing values in the
age, embarked, deck, and embark_town variables of the dataset, while all
other variables don’t contain null values.

Figure 1

For columnwise evaluation, or to know which values in a particular column


have null values:
chanddra.p@gmail.com
UVBL5MQSJ8In the given dataset, we check if there are null values in the ‘age’ variable of
‘titanic’ and then we print all the rows with null values of ‘age’, from ‘titanic’.

The .isnull() checks whether the values are null. The .notnull() does the exact
opposite and checks whether the values are not null or if they are.
In the following example we check if values in the variable ‘deck’ are null or
not. If they are not null, the output would print the respective rows. According

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
to the data, there are 203 rows across 15 columns that have non-null values in
their ‘deck’ variable.

Once we have found out whether our dataset has null values or not, we move
on to decide what can be done with these rows or those particular datapoints:

1. Ignoring
chanddra.p@gmail.com missing values: this can be done when the samples without
UVBL5MQSJ8
missing data are sufficient for analysis even if incomplete cases are not
considered.
2. Deleting rows: The next easiest way to deal with null values is deleting a
particular column if it has more than 70-75 percent missing values or if a
row, if there is a null value for a particular characteristic. One has to be
cautious that after deletion there is no bias in the data. They must also
remember that deletion will cause loss of information which might affect
the anticipated results while prediction. It does not work properly if the
missing values are high. But this process will also be robust and highly
accurate.

We can use the .dropna() function to drop or delete all NA values. for
instance, incase of the titanic dataset, we first check the number of null
values in each specified variable (figure 1). Then we use the .dropna()
function and check whether there are null values in the dataset. We can
notice that there are no null values, which means all rows with null
values have been dropped/deleted.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
3. Assigning a unique category: For categorical data, there is a definitive
number of possibilities and values. In this case, we can assign another
class for missing values. This adds more information to the dataset and
will affect the variance. One hot encoding will help one convert it from
categorical data to numeric data. It will eliminate the problem of loss of
data by adding a new category. But it adds less variance, and adding
chanddra.p@gmail.com
another feature might result in poor performance.
UVBL5MQSJ8
Similar to the .dropna() function is the .fillna() function which enables us
to fill NA values with some value that the user wants.
Incase of the titanic dataset, we can replace all NaN values of the
variable ‘deck’ with ‘C’. As a result, we see that all NA values are filled
with ‘C’ and while checking if any variable has null values, incase of
‘deck’ we get false instead of true.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
4. Using algorithms that support missing values: K-nearest neighbor (KNN)
imputation helps in estimating and replacing the data by using k-nearest
neighbor algorithms. This estimation can be done for quantitative
attributes, for instance the mean of k nearest neighbors and for
qualitative attributes like the most frequent value in the k-nearest
neighbors. It works on the principle of distance measure.
Another algorithm is RandomForest. It works on non-linear data as well
as categorical data. It also takes into consideration high variances and
bias, based on which it gives better results. It neglects correlation of
data, but it can be very time consuming.
5. Replacing with Mean/Median/Mode: for numeric data,
mean/median/mode can be used as substitute for missing values. it
gives an approximation and consequently, can add variance. Loss of data
is negated in such a method, giving better results in comparison with
deletion of rows. Replacing values with these three measures to handle
data is also called leaking the data while training. It is a better approach
when the dataset size is small and linear. But it works poorly compared
to multiple imputations method.
Lets take an example of numeric data in the ‘titanic’ dataset. We
chanddra.p@gmail.com
UVBL5MQSJ8
observe that the age is numeric data and also has missing values. We
can use the mean of the age and replace all missing values with the
mean.

After using the .replace() function, the same output changes:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
The .interpolate() function is also used to fill in the missing values of a
variable of a dataset.
For example, in the ‘titanic’ dataset, we use the linear method to find
the missing values. The method ignores the index and treats all values as
equally spaced. In the given dataset, we interpolate the ‘age’ variable
and find missing values in the backward direction, when the number of
consecutive values of NA that can be filled is 1. The direction can be
forward, backward, or both. However, the default is forward.

6. Predicting
chanddra.p@gmail.com missing values: machine learning algorithms can be used to
UVBL5MQSJ8
predict the nulls using values that are not missing. This method has good
accuracy unless a high variance is expected from the missing value. It
gives us unbiased estimates for model parameters. The bias does not
only arise from the data being large but also when conditioning set used
for a categorical variable is incomplete.
One can use the SciKit Learn library to perform the linear regression and KNN
imputation to find missing values.

Outlier treatment:
Some of the ways to handle outliers would be by trimming or removing the
outlier. It isn’t a good practice. One can copy the normal values to another
array and delete the remaining outliers.
Flooring and Capping based on quantiles will floor an outlier at a particular
value below the 10th percentile value and cap it above the 90th percentile
value. the values of outliers than lie above the latter are replaced by the value
of the 90th percentile and the same is true for outliers below the 10th
percentile.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Project:
Let us consider the Boston Housing Prices dataset and perform exploratory
data analysis on the same.
You can easily find the data on the web, or from Kaggle-
https://www.kaggle.com/puxama/bostoncsv
It is a dataset that contains 15 columns in total with 506 data entries. It was
collected by the U.S Census Service. It shows the housing situation in the area
of Boston, Massachusetts. The columns are numeric in nature and discuss
various factors that might affect the housing prices and purchases in the city.
Description of each column (Source: Kaggle-
https://www.kaggle.com/prasadperera/the-boston-housing-dataset):

 CRIM - per capita crime rate by town


 ZN - proportion of residential land zoned for lots over 25,000
sq.ft.
 INDUS - proportion of non-retail business acres per town.
 CHAS - Charles River dummy variable (1 if tract bounds river; 0
chanddra.p@gmail.com
UVBL5MQSJ8 otherwise)
 NOX - nitric oxides concentration (parts per 10 million)
 RM - average number of rooms per dwelling
 AGE - proportion of owner-occupied units built prior to 1940
 DIS - weighted distances to five Boston employment centres
 RAD - index of accessibility to radial highways
 TAX - full-value property-tax rate per $10,000
 PTRATIO - pupil-teacher ratio by town
 B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by
town
 LSTAT - % lower status of the population
 MEDV - Median value of owner-occupied homes in $1000's

Moving further, to perform Exploratory Data Analysis on any dataset, one must
follow the basic steps:
1. Decide the objective of the analysis. EDA is making sense of all the
numbers in the data. For instance, in the considered dataset, the median
value of owner-occupied homes will not make any sense if we do not
Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
know what are we using that value for. Owner occupied houses are said
to be more invested in community concerns and are known for proper
maintenance. It can be used to determine if the neighborhood will be
looked after, regularly. This can be a vital reason while deciding whether
one wants to live in the locality or not.
2. Next is importing the required libraries and loading the data into a
dataframe.
3. Get maximum information from the dataset. Check the variables that
you have and the type of data that you are considering. Find the
structure of the entire dataset.
4. Once you have a fair idea about the type of data the dataset has and
what the same is trying to convey, you can move towards data cleaning.
Cleaning the data includes dealing with datapoints without entries, or
inconsistent data. It revolves around removing redundancy, unwanted
noise and data.
Get rid of unnecessary data columns. Remove the columns according to
the ease of use. Check the data for duplicate rows, missing rows, and
null values.
Data must be cleaned in a manner where there is consistent data that
chanddra.p@gmail.com
UVBL5MQSJ8
has enough numerical datapoints so that it can be evaluated with ease.
5. Detect outliers in the data. According to the data, one must decide if
they want to keep the outliers or delete outlier row data points
completely.
6. Outlier detection is followed by Data Visualization. Simple plots can be
made for univariate data and and scatter-plots, histograms, jointplots
etc can be used to understand bivariate data.
7. After interpreting these graphs, charts and plots, answer the objective
questions that are decided in point 1.
8. Make inferences and conclusions based on various answers.

EDA on the Housing Price dataset:


1. For the given dataset we decide the objective, which is to see if
numerous factors have an effect on the purchase of a home in
Boston.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
2. We import the required libraries and the dataset. Here, we have
imported all the libraries that we have understood so far, which are-
Pandas, Numpy, Seaborn, and matplotlib.
We also import the csv file. Enter the path followed by the
filename. One must note that when you ‘copy address as text’ for
the pathname, you might get an error:
“ (unicode error) 'unicodeescape' codec cant decode
bytes in position 2-3: truncated \UXXXXXXXX escape ”
To resolve this error, type the pathname with a backward slash
instead of a forward one.

chanddra.p@gmail.com
UVBL5MQSJ8 3. Looking at the type of data that it has. It can be noticed that all
columns are of numeric type, some are integer values, but most
others are floating point type values.

4. We now move towards data cleaning. We check if there are null


values or missing values in any of the dataframe columns. It can be
noticed that none of the columns have null values, which implies

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
that there are no rows/data entries with null values that need to be
treated. The data is already clean.

5. We detect outliers in the data. Check for variables in which having


outliers will result in an inappropriate output. Lets check the per
capita crime rate, average number of rooms per house and the
Median value of owner-occupied homes.
We construct boxplots to check if they contain outliers. It is
observed that both these columns have outliers.
chanddra.p@gmail.com
UVBL5MQSJ8

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
The important question here would be whether the outliers must
be deleted
or whether they should be part of the visualisation.
For ease of visualisation, we can keep the ‘medv’ and ‘rm’ values
as is, and delete outliers exceeding 35 in ‘crim’ rate since there are
not a lot of datapoints beyond that mark.

chanddra.p@gmail.com
UVBL5MQSJ8

6. Once we’re done treating outliers, we move further to the stage of


data visualisation.

We compare if the average number of rooms per dwelling has


an impact on the purchase of homes.
We create a jointplot for the same.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
The jointplot evidently shows that there is a correlation between the number
of rooms in each house and the houses that are purchased, despite a few
outliers that would say otherwise.

Similarly, we also do comparisons between few other variables can be done


like ‘lstat’ and ‘medv’.

chanddra.p@gmail.com
UVBL5MQSJ8

We make a pairplot for the given dataset.


One can get a better view and understanding of the data and
make proper conclusions or corrections on the basis that they see
fit.
Input:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
Output:

For all the variables, the pairplot will make bivariate charts. This
will make it easier for the analyst to figure out which variables are
correlated instead of trying to find the correlation between some
two variables individually.

chanddra.p@gmail.com
UVBL5MQSJ8

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.
7. We finally conclude what we found via visualisations. It can be
noticed that only ‘rad’ variable is of categorical type and the rest are
integer type. There is relation between ‘lstat’ and ‘medv’ indicating
that when lstat is on the lower end, ‘medv’ is on the higher end and
vice-versa, indicating that there is an indirect relationship between
the two. Similar observations can be made and further analysis of
the data can be performed.

chanddra.p@gmail.com
UVBL5MQSJ8

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by chanddra.p@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.

You might also like