Download as pdf or txt
Download as pdf or txt
You are on page 1of 88

MGM’s

Jawaharlal Nehru Engineering College Aurangabad


MGM University, Aurangabad

Department of Computer Science & Engineering

LAB MANUAL

Program (UG/PG) : UG

Year : Second Year

Semester : IV

Course Code : 20UCC401B

Course Title : Engineering Statistics (Tutorial)

Prepared By : Mr. Nitin V Tawar

Department of Computer Science & Engineering


2021-22

1
FOREWORD

It is our great pleasure to present this laboratory manual for Second Year
engineering students for the subject of Engineering Statistics.

As a student, many of you may be wondering with some of the questions in your
mind regarding the subject and exactly what has been tried is to answer through
this manual.

As you may be aware that MGM has already been awarded with ISO
9001:2015,140001:2015 certification and it is our endure to technically equip our
students taking the advantage of the procedural aspects of ISO Certification.

Faculty members are also advised that covering these aspects in initial stage itself,
will greatly relived them in future as much of the load will be taken care by the
enthusiastic energies of the students once they are conceptually clear.

Dr. H. H. Shinde
Principal

2
LABORATORY MANUAL CONTENTS

This manual is intended for the Second year students of Computer Science and
Engineering in the subject of Engineering Statistics. This manual typically contains
Tutorial Sessions related to Engineering Statistics covering various aspects related
the subject to enhanced understanding.

Students are advised to thoroughly go through this manual rather than only topics
mentioned in the syllabus as practical aspects are the key to understanding and
conceptual visualization of theoretical aspects covered in the books.

Good Luck for your enjoyable tutorial sessions.

Prof. Vijaya B. Musande


Subject Teacher HOD

3
LIST OF EXPERIMENTS

Course Code: 20UCC401B


Course Title: Engineering Statistics

Sr. No. Name of the Experiment Page No.

Applications of Statistics in real life and introduction to useful Python


1
libraries.
Write a Python program to find basic descriptive statistics using measures of
2
central tendency and measures of dispersion.
Write a Python program to find basic descriptive statistics on a real life Titanic
3
dataset.
Write a Python program to understand and demonstrate the use of various
4 types of probabilities including Addition and Multiplication rule of
Probability, Conditional probability.
Write a Python program for Bayes Theorem and demonstrate a real life
5
example using a sample dataset.
Write a Python program to understand various probability distributions with
6
the help of a sample dataset.
Write a program for various types of correlation. Plot the correlation plot on
7 dataset and visualize giving an overview of relationships among data on iris
data
8 Write a Python program for linear regression.

9 Write a Python program for multiple regression.

10 Write a Python program for logistic regression.

4
DOs and DON’Ts in Laboratory:
1. Make entry in the Log Book as soon as you enter the Laboratory.
2. All the students should sit according to their roll numbers starting from their left
to right.
3. All the students are supposed to enter the terminal number in the log book.
4. Do not change the terminal on which you are working.
5. All the students are expected to get at least the algorithm of the program/concept
to be implemented.
6. Strictly observe the instructions given by the teacher/Lab Instructor.
7. Do not disturb machine Hardware / Software Setup.

Instruction for Laboratory Teachers:


1. Submission related to whatever lab work has been completed should be done
during the next lab session along with signing the index.
2. The promptness of submission should be encouraged by way of marking and
evaluation patterns that will benefit the sincere students.
3. Continuous assessment in the prescribed format must be followed.

5
MGM University
Jawaharlal Nehru Engineering College, Aurangabad

Course Name: Engineering Statistics Class: S.Y. B. Tech. (CSE)


Course Co-ordinator: Prof. Nitin V Tawar Semester: IV

Tutorial No. 1

Aim: Applications of Statistics in real life and introduction to useful Python libraries.

Theory:

Statistics is a a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of
masses of numerical data.

Statistics are defined as numerical data, and is the field of math that deals with the collection, tabulation and
interpretation of numerical data. An example of statistics is a report of numbers saying how many followers of
each game there are in a particular country. It makes a set of data more easily understandable. It is a branch of
mathematics that analyses data and then uses it to solve different types of problems related to the data.

Consider a real-life example of statistics. To learn the mean of the marks obtained by each student in the class
whose strength is 60, so the average value is the statistics of the marks obtained.

Suppose you want to find out how many people are employed in a town. Before the town is settled with 10
lakh people, thus we will take an analysis here for 1000 people that is a sample, and based on that, we will
build the data, which is the statistic.

Statistics is a set of equations that allows us to solve complex problems. These statistical problems in real
life are usually based on facts and data.

Types of Statistics-

6
1. Descriptive Statistics
In descriptive statistics, the information is summed up through the given perceptions. The
summarization is one from an example of the populace utilizing boundaries, for example, the mean or
standard deviation. Thus, it gives a graphical synopsis of information and it is just used for summing up
objects, and so on. Descriptive statistics are applied to the data which is already known.

2. Inferential Statistics
Inferential Statistics makes inferences and predictions about extensive data by considering a sample data
from the original data. It uses probability to reach conclusions.
The best real-world example of “Inferential Statistics” is, predicting the amount of rainfall we get in
the next month by Weather Forecast.

What is Statistical Problem?

The statistical problems in real life consist of sampling, inferential statistics, probability, estimating, enabling a
team to develop effective projects in a problem-solving frame. For instance, car manufacturers looking to paint
the cars might include a wide range of people that include supervisors, painters, paint representatives, or the
same professionals to collect the data, which is necessary for the whole process and make it successful.

The current situation of covid-19 is the best example of statistical problems where we need to determine-

 Corona positive cases.


 Recovered people after treatment.
 The number of people who recovered at home.
 People who got vaccinated or not.
 Which vaccinations are best?
 The number of dead people in each village, city, state, and country.
 Side effects of various vaccines.

What is the use of statistics in real life?

Statistics is used for the graphical representation of the collected data. Statistics can compare information
through median, mean, and mode. Therefore, statistics concepts can easily be applied to real life, such as for
7
calculating the time to get ready for office, how much money is required to visit work in a month, gymming
diet count of a weak, in education, and much more.

Besides this, statistics can be utilized for managing daily routines so that you can work efficiently.

Real Life Applications of Statistics-


1. Government-

The importance of statistics in government is utilized by making judgments about health, populations,
education, and much more. It may help the government to check which vaccine is assertive for the corona
novel virus for the citizens. What are the progressive reports after vaccination, whether vaccines are useful or
not? For instance, the government can analyze in which areas vaccination is done and know where they need
to target, or where cases are increasing day by day with the help of polls. The governments of different nations
are using statistical data to vaccinate the people. As per the data, the availability of vaccines can also be
monitored.

2. Weather Forecast

Statistics play a crucial role in weather forecasting. The computer use in weather forecasting is based on a set
of statistics functions. All these statistics function to compare the weather condition with the pre-recorded
seasons and conditions. For weather forecasting, predictors use the concept of probability and statistics. They
employ several concepts and tools to predict maximum accuracy forecasting.

3. Prediction

The data help us make predictions about something that is going to happen in the future. Based on what we
face in our daily lives, we make predictions. How accurate this prediction will depend on many factors. When
we make a prediction, we take into account the external or internal factors that may affect our future. Doctors,
engineers, artists, and practitioners all use statistics to make predictions about future events.

For example, doctors use statistics to understand the future of the disease. They can predict the magnitude of
the flue in each winter season through the use of data.

Engineers use statistics to estimate the success of their ongoing project, and they also use the data to evaluate
how long it will take to complete a project.

For example, now when all the schools and colleges are closed students are studying via online mode, and it is
difficult for the children to study online. In this case, teachers are working hard and trying to teach the children

8
efficiently, and with the help of statistics, they can analyze the performance of the students and can guide them
properly.

4. Financial Market

The financial market completely relies on the statistical data. All the stock prices calculate with the help of
statistics. It also helps the investor to take the decision of investment in the particular stock. It also helps the
corporate to manage their finance to do long-term business.

For instance, you want to buy the shares of a company, and you do not have an idea about the company, is it
good or bad. Different types of statistical calculations such as price to book ratio, price to sale ratio, and price
to earnings growth ratio will help you to invest in the right stocks.

5. Business Statistics

Each large organization uses business statistics and utilizes various data analysis tools. For instance,
approximating the probability and seeing where sales can be headed in the future. Several tools are used for
business statistics, which are built on the basis of the mean, median, and mode, the bell curve, and bar graphs,
and basic probability. These can be employed for research problems related to employees, products, customer
service, and much more.

Besides this, statistics are widely used in consumer goods products. The reason is consumer goods are daily
used products. The business use statistics to calculate which consumer goods are available in the store or not.
They also used stats to find out which store needs the consumer goods and when to ship the products. Even
proper statistics decisions are helping the business to make massive revenue on consumer goods.

6. Machine Learning

Statistics are utilized for quantifying the uncertainty of the estimated skills within the machine learning
models. These uncertainties are defined with the help of confidence intervals and tolerance intervals.

Statistics can be used for machine learning in various ways, such as for:

 Problem Framing
 Understanding the data
9
 Data Cleaning
 Selection of data
 Data Preparation
 Model Evaluation
 Model Configuration
 Selection of Model
 Model Presentation
 Model Prediction

Role of Python
Data science and machine learning have become essential in many fields of science and technology. A
necessary aspect of working with data is the ability to describe, summarize, and represent data
visually. Python statistics libraries are comprehensive, popular, and widely used tools that will assist you in
working with data.

Choosing Python Statistics Libraries


There are many Python statistics libraries popular for statistical analysis as follows-

 Python’s statistics is a built-in Python library for descriptive statistics. You can use it if your datasets
are not too large or if you can’t rely on importing other libraries.

 NumPy is a third-party library for numerical computing, optimized for working with single- and multi-
dimensional arrays. Its primary type is the array type called ndarray. This library contains
many routines for statistical analysis.

 SciPy is a third-party library for scientific computing based on NumPy. It offers additional
functionality compared to NumPy, including scipy.stats for statistical analysis.

 Pandas is a third-party library for numerical computing based on NumPy. It excels in handling labeled
one-dimensional (1D) data with Series objects and two-dimensional (2D) data with DataFrame objects.

 Matplotlib is a third-party library for data visualization. It works well in combination with NumPy,
SciPy, and Pandas.

numpy Module

NumPy stands for Numerical Python. It is a Python library used for working with an array. In Python, we
use the list for purpose of the array but it’s slow to process. NumPy array is a powerful N-dimensional array
object and used in statistical analysis, data visualization, linear algebra, Fourier transform, and random
number capabilities. It provides an array object much faster than traditional Python lists.

NumPy is a Python library that supports multi-dimensional arrays & matrices and offers a wide range of
mathematical functions to operate on the NumPy arrays & matrices.

It is one of the most fundamental libraries for scientific computation.

Example:
# importing numpy module
import numpy as np

# creating list
list = [1, 2, 3, 4]
10
# creating numpy array
sample_array = np.array(list1)

print("List in python : ", list)

print("Numpy Array in python :", sample_array)

# print shape of the array


print("Shape of the array :", sample_array.shape)
# display data type
print("Data type of the array 1 :", sample_array.dtype)

#create a range of values equally spaced


a= np.arange(1, 20 , 2)
print(a)

pandas Module

Pandas is an open-source library that is built on top of NumPy library. It is a Python package that offers
various data structures and operations for manipulating numerical data and time series. It is mainly popular for
importing and analyzing data much easier. Pandas is fast and it has high-performance & productivity for users.

Example-
import pandas as pd

#Create a Dictionary of series


d=
{'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack','Lee','Chanchal','Gasper','Naviya','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
#Create a DataFrame
df = pd.DataFrame(d)
print("Mean Values in the Distribution")
print df.mean()
print("Median Values in the Distribution")
print(df.median())
MATPLOTLIB
Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib consists of
several plots like line, bar, scatter, histogram etc. We can also plot various functions using matplotlib which
makes it easier for us to analyze the functions by looking at the graphs. Plots help to understand trends,
patterns, and to make correlations. They’re typically instruments for reasoning about quantitative
information.

matplotlib can plot a huge amount of data on a single plot and make it look simple and compact.
11
Matplotlib is a Python package for 2D plotting and the matplotlib.pyplot sub-module contains many plotting
functions to create various kinds of plots.

PYPLOT:

Most of the Matplotlib utilities lies under the pyplot submodule, and are usually imported under the plt alias:

import matplotlib.pyplot as plt

Now the Pyplot package can be referred to as plt.

Basic Plotting

Procedure

The general procedure to create a 2D line plot is:

1. Create a sequence of x values.

2. Create a sequence of y values.

3. Enter plt.plot(x,y,[fmt],**kwargs) where [fmt] is a (optional) format string and **kwargs are (optional)
keyword arguments specifying line properties of the plot.

4. Use pyplot functions to add features to the figure such as a title, legend, grid lines, etc.

5. Enter plt.show() to display the resulting figure.

1. Draw a line in a diagram

# importing matplotlib module

from matplotlib import pyplot as plt

# x-axis values

xpoints = [5, 2, 9, 4, 7]

# Y-axis values

ypoints = [10, 5, 8, 4, 2]

# Function to plot

plt.plot(xpoints, ypoints)

# function to show the plot

plt.show()

12
2. Draw two points in the diagram, one at position (1, 3) and one in position (8, 10)

# importing matplotlib module

from matplotlib import pyplot as plt

# x-axis values

xpoints = (1,3)

# Y-axis values

ypoints = (8,10)

# Function to plot

plt.plot(xpoints, ypoints, 'o')

# function to show the plot

plt.show()

13
3. Draw a line in a diagram from position (1, 3) to (2, 8) then to (6, 1) and finally to position (8, 10)

# importing matplotlib module

from matplotlib import pyplot as plt

# x-axis values

xpoints = (1,2,6,8)

# Y-axis values

ypoints = (3,8,1,10)

# Function to plot

plt.plot(xpoints, ypoints, 'o')

# function to show the plot

plt.show()

14
MARKERS:

You can use the keyword argument marker to emphasize each point with a specified marker:

Mark each point with a circle.

import matplotlib.pyplot as plt


ypoints =(3, 8, 1, 10)

plt.plot(ypoints, marker = 'o')


plt.show()

15
MGM University
Jawaharlal Nehru Engineering College, Aurangabad

Course Name: Engineering Statistics Class: S.Y. B. Tech. (CSE)


Course Co-ordinator: Prof. Nitin V Tawar Semester: IV

Tutorial No. 2

Aim: Understanding and demonstrating measures of central tendency & dispersion using Python library and
with the help of Loan dataset.

Theory:

Descriptive Statistics is the building block of data science. Advanced analytics is often incomplete without
analyzing descriptive statistics of the key metrics. In simple terms, descriptive statistics can be defined as the
measures that summarize a given data, and these measures can be broken down further into the measures of
central tendency and the measures of dispersion.

Measures of central tendency include mean, median, and the mode, while the measures of variability include
standard deviation, variance, and the inter-quartile range. In this guide, you will learn how to compute these
measures of descriptive statistics and use them to interpret the data.

The mean, median, and mode are three metrics that are commonly used to describe the center of a
dataset.

Here’s a quick definition of each metric:

 Mean: The average value in a dataset.

 Median: The middle value in a dataset.

 Mode: The most frequently occurring value(s) in a dataset.

Real life Use Cases of Measures of Central Tendency

Individuals and companies use these metrics all the time in different fields to gain a better understanding of
datasets. The following examples explain how the mean, median, and mode are used in different real life
scenarios.

Example 1: Mean, Median, & Mode in Healthcare

The mean, median, and mode are widely used by insurance analysts and actuaries in the healthcare industry.

For example:

16
 Mean: Insurance analysts often calculate the mean age of the individuals they provide insurance for so
they can know the average age of their customers.

 Median: Actuaries often calculate the median amount spend on healthcare each year by individuals so
they can know how much insurance they need to be able to provide to individuals.

 Mode: Actuaries also calculate the mode of their customers (the most commonly occurring age) so they
can know which age group uses their insurance the most.

Example 2: Mean, Median, & Mode in Real Estate

The mean, median, and mode are also used often by real estate agents.

For example:

 Mean: Real estate agents calculate the mean price of houses in a particular area so they can inform their
clients of what they can expect to spend on a house.

 Median: Real estate agents also calculate the median price of houses to gain a better idea of the
“typical” home price, since the median is less influenced by outliers (like multi-million dollar homes)
compared to the mean.

 Mode: Real estate agents also calculate the mode of the number of bedrooms per house so they can
inform their clients on how many bedrooms they can expect to have in houses in a particular area.

Mean Median, and Mode

What can we learn from looking at a group of numbers?

In Machine Learning (and in mathematics) there are often three values that interest us:

Mean - The average value

Median - The midpoint value

Mode - The most common value

Example: We have registered the speed of 13 cars:

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

What is the average, the middle, or the most common speed value?

Mean:-

The mean value is the average value.

To calculate the mean, find the sum of all values, and divide the sum by the number of values:

(99+86+87+88+111+86+103+87+94+78+77+85+86) / 13 = 89.77
17
The NumPy module has a method for this. Learn about the NumPy module in our NumPy Tutorial.

Example

Use the NumPy mean() method to find the average speed:

import numpy

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = numpy.mean(speed)

print(x)

Median:-

The median value is the value in the middle, after you have sorted all the values:

77, 78, 85, 86, 86, 86, 87, 87, 88, 94, 99, 103, 111

It is important that the numbers are sorted before you can find the median.

The NumPy module has a method for this:

Example

Use the NumPy median() method to find the middle value:

import numpy

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = numpy.median(speed)

print(x)

If there are two numbers in the middle, divide the sum of those numbers by two.

77, 78, 85, 86, 86, 86, 87, 87, 94, 98, 99, 103

(86 + 87) / 2 = 86.5

Example

Using the NumPy module:

18
import numpy

speed = [99,86,87,88,86,103,87,94,78,77,85,86]

x = numpy.median(speed)

print(x)

Mode:-

The Mode value is the value that appears the most number of times:

99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86 = 86

The SciPy module has a method for this. Learn about the SciPy module in our SciPy Tutorial.

Example

Use the SciPy mode() method to find the number that appears the most:

from scipy import stats

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = stats.mode(speed)

print(x)

Standard Deviation & Variance

What is Standard Deviation?

Standard deviation is a number that describes how spread out the values are.

A low standard deviation means that most of the numbers are close to the mean (average) value.

A high standard deviation means that the values are spread out over a wider range.

Example: This time we have registered the speed of 7 cars:

speed = [86,87,88,86,87,85,86]

The standard deviation is:

0.9

Meaning that most of the values are within the range of 0.9 from the mean value, which is 86.4.

19
Let us do the same with a selection of numbers with a wider range:

speed = [32,111,138,28,59,77,97]

The standard deviation is:

37.85

Meaning that most of the values are within the range of 37.85 from the mean value, which is 77.4.

As you can see, a higher standard deviation indicates that the values are spread out over a wider range.

The NumPy module has a method to calculate the standard deviation:

Example:-

Use the NumPy std() method to find the standard deviation:

import numpy

speed = [86,87,88,86,87,85,86]

x = numpy.std(speed)

print(x)

Example

import numpy

speed = [32,111,138,28,59,77,97]

x = numpy.std(speed)

print(x)

Variance:-

Variance is another number that indicates how spread out the values are.

In fact, if you take the square root of the variance, you get the standard deviation!

Or the other way around, if you multiply the standard deviation by itself, you get the variance!

To calculate the variance you have to do as follows:

1. Find the mean:

20
(32+111+138+28+59+77+97) / 7 = 77.4

2. For each value: find the difference from the mean:

32 - 77.4 = -45.4
111 - 77.4 = 33.6
138 - 77.4 = 60.6
28 - 77.4 = -49.4
59 - 77.4 = -18.4
77 - 77.4 = - 0.4
97 - 77.4 = 19.6

3. For each difference: find the square value:

(-45.4)2 = 2061.16
(33.6)2 = 1128.96
(60.6)2 = 3672.36
(-49.4)2 = 2440.36
(-18.4)2 = 338.56
(- 0.4)2 = 0.16
(19.6)2 = 384.16

4. The variance is the average number of these squared differences:

(2061.16+1128.96+3672.36+2440.36+338.56+0.16+384.16) / 7 = 1432.2

Luckily, NumPy has a method to calculate the variance:

Example:-

Use the NumPy var() method to find the variance:

import numpy

speed = [32,111,138,28,59,77,97]

x = numpy.var(speed)

print(x)

Standard Deviation

As we have learned, the formula to find the standard deviation is the square root of the variance:

√1432.25 = 37.85

Or, as in the example from before, use the NumPy to calculate the standard deviation:

Example
21
Use the NumPy std() method to find the standard deviation:

import numpy

speed = [32,111,138,28,59,77,97]

x = numpy.std(speed)

print(x)

Symbols

Standard Deviation is often represented by the symbol Sigma: σ

Variance is often represented by the symbol Sigma Square: σ2

Graphical representation of descriptive statistics using Box-Plot

A statistical graph is a tool that helps you learn about the shape or distribution of a sample or a population. A
graph can be a more effective way of presenting data than a mass of numbers because we can see where data
clusters and where there are only a few data values. Newspapers and the Internet use graphs to show trends and
to enable readers to compare facts and figures quickly. Statisticians often graph data first to get a picture of the
data. Then, more formal tools may be applied.

Some of the types of graphs that are used to summarize and organize data are the scatter plot, the bar graph,
the histogram, the frequency polygon (a type of broken line graph), the pie chart, and the box plot.

BAR GRAPH

if you wanted to display relationships between data in categories, you could make a bar graph.

22
PIE CHART

A pie chart would show you how categories in your data relate to the whole set.

SCATTER PLOT

Scatter plots are a good way to display data points.

23
HISTOGRAM

To create a histogram the first step is to create bin of the ranges, then distribute the whole range of the values
into a series of intervals, and count the values which fall into each of the intervals. Bins are clearly identified
as consecutive, non-overlapping intervals of variables. The matplotlib.pyplot.hist() function is used to compute
and create histogram of x.

Example Code:

from matplotlib import pyplot as plt

import numpy as np

# Creating dataset

a = np.array([22, 87, 5, 43, 56,

73, 55, 54, 11,

20, 51, 5, 79, 31,

27])

# Creating histogram

fig, ax = plt.subplots(figsize =(10, 7))

ax.hist(a, bins = [0, 25, 50, 75, 100])

# Show plot

plt.show()

24
BOX PLOT

Box Plot is the visual representation of the depicting groups of numerical data through their quartiles.
Boxplot is also used for detect the outlier in data set. It captures the summary of the data efficiently with a
simple box and whiskers and allows us to compare easily across groups. Boxplot summarizes a sample data
using 25th, 50th and 75th percentiles. These percentiles are also known as the lower quartile, median and
upper quartile.
A box plot consists of 5 things.
 Minimum
 First Quartile or 25%
 Median (Second Quartile) or 50%
 Third Quartile or 75%
 Maximum

A Box Plot is also known as Whisker plot. In the box plot, a box is created from the first quartile to the third
quartile; a vertical line is also there which goes through the box at the median. Here x-axis denotes the data to
be plotted while the y-axis shows the frequency distribution.

Draw the box plot with Pandas & Seaborn:

Box Plot in Seaborn is used to draw a box plot to show distributions with respect to categories. The
seaborn.boxplot() is used for this. Use the "orient” parameter for orientation of each numeric variable.

#Example Code

import seaborn as sb

import pandas as pd

import matplotlib.pyplot as plt

# Load data from a CSV file into a Pandas DataFrame:


25
df = pd.read_csv("Cricketers.csv")

# plotting box plot

# using the orient parameter for orientation of each numeric variable

sb.boxplot( data = df, orient="h")

# display

plt.show()

OUTPUT:

REAL LIFE APPLICATION WITH LOAN DATASET:

We will be using fictitious data of loan applicants containing 13 variables, including Loan_ID, Age, Gender,
Married, Dependents, Education, Self_Employed, ApplicantIncome, CoapplicantIncome, LoanAmount,
Loan_Amount_Term, Credit_History, Property_Area, Loan_Status

#Python Program

import pandas as pd

import numpy as np

import statistics as st

# Load the data

df = pd.read_csv("loan_data_set.csv")

print(df.shape)
26
print(df.info())

#Find mean

#below function gives the mean of eah numerical column in the dataset

#we can infer that the average age of the applicant, the average annual income and the average tenure of loans

print(df.mean)

#to find mean of a specific column

print(df["Age"].mean())

#find median

print(df.median(numeric_only=True))

#From the output, we can infer that the median age of the applicants, the median annual income, and the
median tenure of loans

#to find median of a specific column

print(df["Age"].median())

#Mode- only central tendency measure that can be used with categorical variables,

#It gives the mode of each numerical as well as categorical column

print(df.mode())

#The output above shows that most of the applicants are married, as depicted by the 'Marital_status' value of
#"Yes". Similar interpreation could be done for the other categorical variables For numerical variables, the
#mode value represents the value that occurs most frequently.

#Standard deviation

print(df.std(numeric_only=True))

#It gives the standard deviation of all the columns

#In the output, the standard deviation of the variable 'Income' is much higher than that of the variable
'Dependents'.

#Variance – gives variance of each numerical column

print(df.var(numeric_only=True))

#to get all the measures of all the numerical columns in one step, we can use describe function which
#summarises everything including measures of central tendency and dispersion

27
print(df.describe())

# to get all the measures of all the columns (including categorical) in one step, we can use describe function
which summarizes #everything including measures of central tendency and dispersion

print(df.describe(include='all'))

#to print box plot for each numerical column

import seaborn as sb

import matplotlib.pyplot as plt

# plotting box plot

# using the orient parameter for orientation of each numeric variable

sb.boxplot( data = df, orient="h")

# display

plt.show()

28
MGM University
Jawaharlal Nehru Engineering College, Aurangabad

Course Name: Engineering Statistics Class: S.Y. B. Tech. (CSE)


Course Co-ordinator: Prof. Nitin V Tawar Semester: IV

Tutorial No. 3

Aim: Python program for demonstrating measures of central tendency & dispersion on Titanic Dataset

Theory:

Introduction

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during
her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and
crew. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for
the passengers and crew. Although there was some element of luck involved in surviving the sinking, some
groups of people were more likely to survive than others, such as women, children, and the upper-class.

We are using the Titanic data set, stored as CSV. The data consists of the following data columns:

PassengerId: Id of every passenger.

Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.

Pclass: There are 3 classes: Class 1, Class 2 and Class 3.

Name: Name of passenger.

Sex: Gender of passenger.

Age: Age of passenger.

SibSp: Indication that passenger have siblings and spouse.

Parch: Whether a passenger is alone or have family.

Ticket: Ticket number of passenger.

Fare: Indicating the fare.

Cabin: The cabin of passenger.

Embarked: The embarked category

29
We are going to study the data set to understand various descriptive properties of the columns such as Age. We
are going to calculate the summary statistics for the data set which will contain the measures of central
tendency and the measures of dispersion. Different statistics are available and can be applied to columns with
numerical data. Operations in general exclude missing data and operate across rows by default.

PYTHON PROGRAM

import pandas as pd

import numpy as np

import seaborn as sb

import matplotlib.pyplot as plt

#Loading the data set

titanic = pd.read_csv("titanic.csv")

#print top 5 rows of the dataset

print(titanic.head())

#Average (mean) age of Titanic passengers

avg_age= titanic["Age"].mean()

print(avg_age)

#Median age and median ticket fare price of Titanic passengers

print(titanic[["Age", "Fare"]].median())

#All statistical measures for multiple numerical columns with one function

print(titanic[["Age", "Fare"]].describe())

#aggregating statistics grouped by category

#Gender wise average age of Titanic passengers

print(titanic[["Gender", "Age"]].groupby("Gender").mean())

#Gender wise average of all numerical columns

print(titanic.groupby("Gender").mean())

# mean ticket fare price for each of the Gender and cabin class combinations

print(titanic.groupby(["Gender", "Pclass"])["Fare"].mean())

# Count number of records by category

#Number of passenger in each cabin class

30
print(titanic["Pclass"].value_counts())

#to view top 5 rows of the data

titanic.head()

# the size of this data frame(no of rows & columns)

print(titanic.shape)

#to check the data types of each column

print(titanic.dtypes)

#in the output the int64 and float64 are numeric types and object is the string type.

#to get detailed info about the dataset

print(titanic.info())

#to get description/summary of the columns in the dataset

print(titanic.describe)

#to check for the total number of null values in the dataset

print(titanic.isnull().sum())

#Above code displays missing value in each column and it is important to see as missing values may #create
problems in analysis and calculations

# the values of Survived column are either 0 or 1,

#where 0 represents that the passenger is not survived while 1 says that they are survived.

#Now in order to find out the number of the #two, we are going to employ groupby() method

survived_count = titanic.groupby('Survived')['Survived'].count()

print(survived_count)

#Above code Group the data frame by values in Survived column, and count the number of #occurrences of
each group

# Based on the output , we can see that there are 549 people who #were not survived.

# find out the number of survived persons based on their Gender

survived_Gender = titanic.groupby('Gender')['Survived'].sum()

print(survived_Gender)

# find out the distribution of ticket classes

pclass_count = titanic.groupby('Pclass')['Pclass'].count()
31
print(pclass_count)

# find out the distribution of Gender (how many male, female & children

Gender_count = titanic.groupby('Gender')['Gender'].count()

print(Gender_count)

#find out the age distribution

# But Age column contains 177 missing values out of 891 data in total.

#We need to remove those NaNs (Null values) first.

ages = titanic[titanic['Age'].notnull()]['Age'].values

#It retrieves all non-NaN age values and then store the result to ages Numpy array

ages_hist = np.histogram(ages, bins=[0,10,20,30,40,50,60,70,80,90])

print(ages_hist)

ages_hist_labels = ['0–10', '11–20', '21–30', '31–40', '41–50', '51–60', '61–70', '71–80', '81–90']

plt.figure(figsize=(7,7))

plt.title('Age distribution')

plt.bar(ages_hist_labels, ages_hist[0])

plt.xlabel('Age')

plt.ylabel('No of passenger')

for i, bin in zip(ages_hist[0], range(9)):

plt.text(bin, i+3, str(int(i)), fontsize=12, horizontalalignment='center', verticalalignment='center')

plt.show()

#find out fare distribution

print(titanic['Fare'].describe())

#to print box plot for each numerical column

# plotting box plot

# using the orient parameter for orientation of each numeric variable

sb.boxplot( data = titanic, orient="h")

# display
32
plt.show()

OUTPUT

ALL COMMANDS OUTPUT WITH GRAPH

33
MGM University
Jawaharlal Nehru Engineering College, Aurangabad

Course Name: Engineering Statistics Class: S.Y. B. Tech. (CSE)


Course Co-ordinator: Prof. Nitin V Tawar Semester: IV

Tutorial No. 4

Aim: Understand and demonstrate the use of various types of probabilities including Addition and
Multiplication rule of Probability, Conditional probability using Python.

Theory:

Introduction

What is probability?
The likelihood of the occurrence of any event can be called Probability.

Application of probability
Some of the applications of probability are predicting the outcome when you:
 Flipping a coin.
 Choosing a card from the deck.
 Throwing a dice.
 Pulling a green candy from a bag of red candies.
 Winning a lottery 1 in many millions.

Probability theory is widely used in the area of studies such as statistics, finance, gambling artificial
intelligence, machine learning, computer science, game theory, and philosophy.

Event
An event is an outcome of the experiment.
Getting a head or a tail after tossing the coin can be considered an event in our experiment.
Sample Space
It is the set of all the possible outcomes.
In our case of tossing a coin, the number of possible outcomes can be, S= {H,T}
Probability
It is a value that denotes the chances of occurrence of some event.
Let “n” be the total number of possible outcomes, and E be an event. The probability of occurrence of that
event is,

34
Notice that by its definition, the numerator will always be less than or equal to the denominator. So,
P(E) ≤ 1
Example: Find the probability of getting an even number when a die is rolled.
Solution:
Sample space, S={1,2,3,4,5,6}
Let E be an event of getting an even number.
E={2,4,6}

Addition Theorem

Statement 1: If A and B are two mutually exclusive events, then

P(A ∪B)=P(A)+P(B)

Example: Two dice are tossed once. Find the probability of getting an even number on first dice or a total of
8.

Solution:

Sample space={(1,1),(1,2),(1,3),(1,4),(1,5),(1,6),

(2,1),(2,2),(2,3),(2,4),(2,5),(2,6),

(3,1),(3,2),(3,3),(3,4),(3,5),(3,6),

(4,1),(4,2),(4,3),(4,4),(4,5),(4,6),

(5,1),(5,2),(5,3),(5,4),(5,5),(5,6),

(6,1),(6,2),(6,3),(6,4),(6,5),(6,6)}

An even number can be got on a die in 3 ways because any one of 2, 4, 6, can come. The other die can have
any number. This can happen in 6 ways. Hence, total is 18.

∴ P (an even number on first die) =

A total of 8 can be obtained in the following cases: {(2,6),(3,5),(4,4),(5,3),(6,2)}

P (a total of 8) = \

∴ Total Probability =

35
Statement 2: If A and B are two events that are not mutually exclusive, then

P(A ∪B)=P(A)+P(B)- P (A∩B).

Example: Two dice are tossed once. Find the probability of getting an even number on first dice or a total of
8.

Solution: P(even number on Ist die or a total of 8) = P (even number on Ist die)+P (total of 8)= P(even number
on Ist die and a total of 8)

∴ Now, P(even number on Ist die)=

Ordered Pairs showing a total of 8 = {(6, 2), (5, 3), (4, 4), (3, 5), (2, 6)} = 5

∴ Probability; P(total of 8) =

P(even number on Ist die and total of 8) =

∴ Required Probability =

Multiplication Theorem

Statement: If A and B are two independent events, then the probability that both will occur is equal to the
product of their individual probabilities.

P(A∩B)=P(A)xP(B)

Example: A bag contains 5 green and 7 red balls. Two balls are drawn. Find the probability that one is green
and the other is red.

Solution: P(A) =P(a green ball) =

P(B) =P(a red ball) =

By Multiplication Theorem

P(A) and P(B) = P(A) x P(B) =

Conditional Probability

Conditional probabilities arise naturally in the investigation of experiments where an outcome of a trial may

affect the outcomes of the subsequent trials.


36
We try to calculate the probability of the second event (event B) given that the first event (event A) has already

happened. If the probability of the event changes when we take the first event into consideration, we can safely

say that the probability of event B is dependent of the occurrence of event A.

Let’s think of cases where this happens:

 Drawing a second ace from a deck given we got the first ace
 Finding the probability of having a disease given you were tested positive
 Finding the probability of liking Harry Potter given we know the person likes fiction

Here we can define, 2 events:

 Event A is the probability of the event we’re trying to calculate.


 Event B is the condition that we know or the event that has happened.

We can write the conditional probability as , the probability of the occurrence of event A given that B

has already happened.

Suppose you draw two cards from a deck and you win if you get a jack followed by an ace (without

replacement). What is the probability of winning, given we know that you got a jack in the first turn?

Let event A be getting a jack in the first turn

Let event B be getting an ace in the second turn.

We need to find P(B|A)

P(A) = 4/52

P(B) = 4/51 {no replacement}

P(A and B) = 4/52*4/51= 0.006

37
Here we are determining the probabilities when we know some conditions instead of calculating random

probabilities. Here we knew that he got a jack in the first turn.

Let’s take another example.

Suppose you have a jar containing 6 marbles – 3 black and 3 white. What is the probability of getting a black

given the first one was black too.

P (A) = getting a black marble in the first turn

P (B) = getting a black marble in the second turn

P (A) = 3/6

P (B) = 2/5

P (A and B) = ½*2/5 = 1/5

Real Life Example:

A research group collected the yearly data of road accidents with respect to the conditions of

following and not following the traffic rules of an accident prone area. They are interested in calculating the

probability of accident given that a person followed the traffic rules. The table of the data is given as follows:

Condition Follow Traffic Rule Does not follow Traffic Rule


Accident 50 500
No Accident 2000 5000

Solution:

Now here our equation becomes:

P(Accident | A person follow Traffic Rule) = P(Accident and follow Traffic Rule) / P(Follow Traffic Rule)

38
Let us sove this problem using a Python Program-

P_Accident_who_follow_Traffic_Rule =50

P_who_follow_Traffic_Rule=50+2000

Conditional_Probability=(P_Accident_who_follow_Traffic_Rule/P_who_follow_Traffic_Rule)

print(Conditional_Probability)

Output
Conditional_Probability
0.02439024

39
MGM University
Jawaharlal Nehru Engineering College, Aurangabad

Course Name: Engineering Statistics Class: S.Y. B. Tech. (CSE)


Course Co-ordinator: Prof. Nitin V Tawar Semester: IV

Tutorial No. 5

Aim: Write a Python program for Bayes Theorem and demonstrate a real life example using a sample dataset.

Theory:

Bayes Theorem provides a principled way for calculating a conditional probability.

Although it is a powerful tool in the field of probability, Bayes Theorem is also widely used in the field of
machine learning in developing models for classification predictive modeling problems such as the Bayes
Optimal Classifier and Naive Bayes.

The Bayes theorem describes the probability of an event based on the prior knowledge of the conditions that
might be related to the event. If we know the conditional probability P(A|B), we can use the bayes rule to find
out the reverse probabilities P(B|A).

We can do this as follows-

The above statement is the general representation of the Bayes rule.

EXAMPLE:

40
Diagnostic Test: Consider a human population that may or may not have cancer (Cancer is True or False) and
a medical test that returns positive or negative for detecting cancer (Test is Positive or Negative), e.g. like a
mammogram for detecting breast cancer.

Problem Statement: If a randomly selected patient has the test and it comes back positive, what is the
probability that the patient has cancer?

Manual Calculation
Medical diagnostic tests are not perfect; they have error. Sometimes a patient will have cancer, but the test will
not detect it. This capability of the test to detect cancer is referred to as the sensitivity, or the true positive rate.

In this case, we will use a sensitivity value for the test. The test is good, but not great, with a true positive rate
or sensitivity of 85%. That is, of all the people who have cancer and are tested, 85% of them will get a positive
result from the test.

P(Test=Positive | Cancer=True) = 0.85


Given this information, our intuition would suggest that there is an 85% probability that the patient has cancer.
We can assume the probability of breast cancer is low, and use a base rate value of one person in 5,000, or
(0.0002) 0.02%.

P(Cancer=True) = 0.02%

We can correctly calculate the probability of a patient having cancer given a positive test result using Bayes
Theorem. Let’s map our scenario onto the equation:

P(A|B) = P(B|A) * P(A) / P(B)


P(Cancer=True | Test=Positive) = P(Test=Positive|Cancer=True) * P(Cancer=True) / P(Test=Positive)
We know the probability of the test being positive given that the patient has cancer is 85%, and we know the
base rate or the prior probability of a given patient having cancer is 0.02%; we can use these values inabove
equation:

P(Cancer=True | Test=Positive) = 0.85 * 0.0002 / P(Test=Positive)

We don’t know P(Test=Positive), it’s not given directly. Instead, we can estimate it using:

P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)


P(Test=Positive) = P(Test=Positive|Cancer=True) * P(Cancer=True) + P(Test=Positive|Cancer=False) *
P(Cancer=False)

Firstly, we can calculate P(Cancer=False) as the complement of P(Cancer=True), which we already know

P(Cancer=False) = 1 – P(Cancer=True)
= 1 – 0.0002
= 0.9998
41
We can plug in our known values as follows:

P(Test=Positive) = 0.85 * 0.0002 + P(Test=Positive|Cancer=False) * 0.9998

We still do not know the probability of a positive test result given no cancer.

This requires additional information. Specifically, we need to know how good the test is at correctly
identifying people that do not have cancer. That is, testing negative result (Test=Negative) when the patient
does not have cancer (Cancer=False), called the true negative rate or the specificity.

We will use a specificity value of 95%.

P(Test=Negative | Cancer=False) = 0.95


With this final piece of information, we can calculate the false positive or false alarm rate as the complement
of the true negative rate.

P(Test=Positive|Cancer=False) = 1 – P(Test=Negative | Cancer=False)


= 1 – 0.95
= 0.05
We can use this value into our calculation of P(Test=Positive) as follows:

P(Test=Positive) = 0.85 * 0.0002 + 0.05 * 0.9998


P(Test=Positive) = 0.00017 + 0.04999
P(Test=Positive) = 0.05016

Excellent, so the probability of the test returning a positive result, regardless of whether the person has cancer
or not is about 5%.

We now have enough information to calculate Bayes Theorem and estimate the probability of a randomly
selected person having cancer if they get a positive test result.

P(Cancer=True | Test=Positive) = P(Test=Positive|Cancer=True) * P(Cancer=True) / P(Test=Positive)


P(Cancer=True | Test=Positive) = 0.85 * 0.0002 / 0.05016
P(Cancer=True | Test=Positive) = 0.00017 / 0.05016
P(Cancer=True | Test=Positive) = 0.003389154704944

The calculation suggests that if the patient is informed they have cancer with this test, then there is only 0.33%
chance that they have cancer.

Python Code
# calculate the probability of cancer patient and diagnostic test

# calculate P(A|B) given P(A), P(B|A), P(B|not A)


# P(A)
p_a = 0.0002
# P(B|A)
p_b_given_a = 0.85
42
# P(B|not A)
p_b_given_not_a = 0.05
# calculate P(A|B)
# calculate P(not A)
not_a = 1 - p_a
# calculate P(B)
p_b = p_b_given_a * p_a + p_b_given_not_a * not_a
# calculate P(A|B)
p_a_given_b = (p_b_given_a * p_a) / p_b
result=p_a_given_b
# summarize
print('P(A|B) = %.3f%%' % (result * 100))

Running the example calculates the probability that a patient has cancer given the test returns a positive result,
matching our manual calculation.

P(A|B)=0.339%

43
MGM University
Jawaharlal Nehru Engineering College, Aurangabad

Course Name: Engineering Statistics Class: S.Y. B. Tech. (CSE)


Course Co-ordinator: Prof. Nitin V Tawar Semester: IV

Tutorial No. 5 Part 2

Aim: Write a Python program for Naïve Bayes Theorem and demonstrate a real life example using a sample
dataset.

Theory:

Introduction

Here’s a situation you’ve got into in your data science project:

You are working on a classification problem and have generated your set of hypothesis, created features and
discussed the importance of variables. Within an hour, stakeholders want to see the first cut of the model.

What will you do? You have hundreds of thousands of data points and quite a few variables in your training
data set. In such a situation, if I were in your place, I would have used ‘Naive Bayes‘, which can be extremely
fast relative to other classification algorithms. It works on Bayes theorem of probability to predict the class of
unknown data sets.

What is Naive Bayes algorithm?

It is a classification technique based on Bayes’ Theorem with an assumption of independence among


predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class
is unrelated to the presence of any other feature.

For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even
if these features depend on each other or upon the existence of the other features, all of these properties
independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity,
Naive Bayes is known to outperform even highly sophisticated classification methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at
the equation below:

44
Above,

 P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
 P(c) is the prior probability of class.
 P(x|c) is the likelihood which is the probability of predictor given class.
 P(x) is the prior probability of predictor.

Game Prediction Using Bayes' Theorem


Let us predict the future with some weather data.

Here we have our data, which comprises the day, outlook, humidity, and wind conditions. The final column is
'Play,' i.e., can we play outside, which we have to predict.

 First, we will create a frequency table using each attribute of the dataset.

45
 For each frequency table, we will generate a likelihood table.

 Likelihood of ‘Yes’ given ‘Sunny‘ is:


o P(c|x) = P(Yes|Sunny) = P(Sunny|Yes)* P(Yes) / P(Sunny) = (0.3 x 0.71) /0.36 = 0.591
 Similarly, the likelihood of ‘No’ given ‘Sunny‘ is:
o P(c|x) = P(No|Sunny) = P(Sunny|No)* P(No) / P(Sunny) = (0.4 x 0.36) /0.36 = 0.40
 Now, in the same way, we need to create the Likelihood Table for other attributes as well.

Suppose we have a Day with the following values :

 Outlook = Rain

46
 Humidity = High
 Wind = Weak
 Play = ?

So, with the data, we have to predict wheter "we can play on that day or not."

 Likelihood of 'Yes' on that Day = P(Outlook = Rain|Yes)*P(Humidity= High|Yes)* P(Wind=


Weak|Yes)*P(Yes)

= 2/9 * 3/9 * 6/9 * 9/14 = 0.0199

 Likelihood of 'No' on that Day = P(Outlook = Rain|No)*P(Humidity= High|No)* P(Wind=


Weak|No)*P(No)

= 2/5 * 4/5 * 2/5 * 5/14 = 0.0166

Now, when we normalize the value, we get:

 P(Yes) = 0.0199 / (0.0199+ 0.0166) = 0.55


 P(No) = 0.0166 / (0.0199+ 0.0166) = 0.45

Our model predicts that there is a 55% chance there will be a game tomorrow.

Naive Bayes Example by Hand

Say you have 1000 fruits which could be either ‘banana’, ‘orange’ or ‘other’. These are the 3 possible classes
of the Y variable.

We have data for the following X variables, all of which are binary (1 or 0).

 Long
 Sweet
 Yellow

The first few rows of the training dataset look like this:

Fruit Long (x1) Sweet (x2) Yellow (x3)


Orange 0 1 0
Banana 1 0 1
Banana 1 1 1
Other 1 1 0
.. .. .. ..

For the sake of computing the probabilities, let’s aggregate the training data to form a counts table like this.

47
So the objective of the classifier is to predict if a given fruit is a ‘Banana’ or ‘Orange’ or ‘Other’ when only the
3 features (long, sweet and yellow) are known.

Let’s say you are given a fruit that is: Long, Sweet and Yellow, can you predict what fruit it is?

This is the same of predicting the Y when only the X variables in testing data are known. Let’s solve it by hand
using Naive Bayes.

The idea is to compute the 3 probabilities, that is the probability of the fruit being a banana, orange or other.
Whichever fruit type gets the highest probability wins.

All the information to calculate these probabilities is present in the above tabulation.

Step 1: Compute the ‘Prior’ probabilities for each of the class of fruits.

That is, the proportion of each fruit class out of all the fruits from the population. You can provide the ‘Priors’
from prior information about the population. Otherwise, it can be computed from the training data.

For this case, let’s compute from the training data. Out of 1000 records in training data, you have 500 Bananas,
300 Oranges and 200 Others. So the respective priors are 0.5, 0.3 and 0.2.

P(Y=Banana) = 500 / 1000 = 0.50

P(Y=Orange) = 300 / 1000 = 0.30

P(Y=Other) = 200 / 1000 = 0.20

Step 2: Compute the probability of evidence that goes in the denominator.

This is nothing but the product of P of Xs for all X. This is an optional step because the denominator is the
same for all the classes and so will not affect the probabilities.

P(x1=Long) = 500 / 1000 = 0.50

P(x2=Sweet) = 650 / 1000 = 0.65

P(x3=Yellow) = 800 / 1000 = 0.80

Step 3: Compute the probability of likelihood of evidences that goes in the numerator.

48
It is the product of conditional probabilities of the 3 features. If you refer back to the formula, it says P(X1
|Y=k). Here X1 is ‘Long’ and k is ‘Banana’. That means the probability the fruit is ‘Long’ given that it is a
Banana. In the above table, you have 500 Bananas. Out of that 400 is long. So, P(Long | Banana) = 400/500 =
0.8.

Here, I have done it for Banana alone.

Probability of Likelihood for Banana

P(x1=Long | Y=Banana) = 400 / 500 = 0.80

P(x2=Sweet | Y=Banana) = 350 / 500 = 0.70

P(x3=Yellow | Y=Banana) = 450 / 500 = 0.90

So, the overall probability of Likelihood of evidence for Banana = 0.8 * 0.7 * 0.9 = 0.504

Step 4: Substitute all the 3 equations into the Naive Bayes formula, to get the probability that it is a
banana.

Similarly, you can compute the probabilities for ‘Orange’ and ‘Other fruit’. The denominator is the same for
all 3 cases, so it’s optional to compute.

Clearly, Banana gets the highest probability, so that will be our predicted class.

Step-by-Step Implementation of Naive Bayes

For sake of demonstration, let’s use the standard iris dataset to predict the Species of flower using 4 different
features: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
# Import packages
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

49
import numpy as np
import pandas as pd
from sklearn import metrics
# Import data
training = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/iris_train.csv')
test = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/iris_test.csv')
# Create the X, Y, Training and Test
xtrain = training.drop('Species', axis=1)
ytrain = training.loc[:, 'Species']
xtest = test.drop('Species', axis=1)
ytest = test.loc[:, 'Species']
# Init the Gaussian Classifier
model = GaussianNB()
# Train the model
model.fit(xtrain, ytrain)
# Predict Output
pred = model.predict(xtest)
print(metrics.classification_report(ytest, pred))
print(metrics.confusion_matrix(ytest, pred))

Understanding the Classification report through sklearn

A Classification report is used to measure the quality of predictions from a classification algorithm. How many
predictions are True and how many are False. More specifically, True Positives, False Positives, True
negatives and False Negatives are used to predict the metrics of a classification report as shown below.

50
The report shows the main classification metrics precision, recall and f1-score on a per-class basis. The metrics
are calculated by using true and false positives, true and false negatives. Positive and negative in this case are
generic names for the predicted classes. There are four ways to check if the predictions are right or wrong:

1. TN / True Negative: when a case was negative and predicted negative


2. TP / True Positive: when a case was positive and predicted positive
3. FN / False Negative: when a case was positive but predicted negative
4. FP / False Positive: when a case was negative but predicted positive

Precision – What percent of your predictions were correct?


Precision is the ability of a classifier not to label an instance positive that is actually negative. For each class it
is defined as the ratio of true positives to the sum of true and false positives.

TP – True Positives
FP – False Positives

Precision – Accuracy of positive predictions.


Precision = TP/(TP + FP)

from sklearn.metrics import precision_score

print("Precision score: {}".format(precision_score(y_true,y_pred)))

Recall – What percent of the positive cases did you catch?


Recall is the ability of a classifier to find all positive instances. For each class it is defined as the ratio of true
positives to the sum of true positives and false negatives.

FN – False Negatives

Recall: Fraction of positives that were correctly identified.


Recall = TP/(TP+FN)

from sklearn.metrics import recall_score

print("Recall score: {}".format(recall_score(y_true,y_pred)))

F1 score – What percent of positive predictions were correct?


The F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst
is 0.0. Generally speaking, F1 scores are lower than accuracy measures as they embed precision and recall into
their computation. As a rule of thumb, the weighted average of F1 should be used to compare classifier models,
not global accuracy.

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

from sklearn.metrics import f1_score

print("F1 Score: {}".format(f1_score(y_true,y_pred)))


51
Support
Support is the number of actual occurrences of the class in the specified dataset. Imbalanced support in the
training data may indicate structural weaknesses in the reported scores of the classifier and could indicate the
need for stratified sampling or rebalancing. Support doesn’t change between models but instead diagnoses the
evaluation process.

Accuracy
Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions
our model got right. Formally, accuracy has the following definition:

For binary classification, accuracy can also be calculated in terms of positives and negatives as follows:

Where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives.

Let's try calculating accuracy for the following model that classified 100 tumors as malignant (the positive
class) or benign (the negative class):

True Positive (TP): False Positive (FP):

 Reality: Malignant  Reality: Benign


 ML model predicted: Malignant  ML model predicted: Malignant
 Number of TP results: 1  Number of FP results: 1

False Negative (FN): True Negative (TN):

 Reality: Malignant  Reality: Benign


 ML model predicted: Benign  ML model predicted: Benign
 Number of FN results: 8  Number of TN results: 90

Accuracy comes out to 0.91, or 91% (91 correct predictions out of 100 total examples). That means our tumor
classifier is doing a great job of identifying malignancies, right?

Actually, let's do a closer analysis of positives and negatives to gain more insight into our model's
performance.

Of the 100 tumor examples, 91 are benign (90 TNs and 1 FP) and 9 are malignant (1 TP and 8 FNs).

Of the 91 benign tumors, the model correctly identifies 90 as benign. That's good. However, of the 9 malignant
tumors, the model only correctly identifies 1 as malignant—a terrible outcome, as 8 out of 9 malignancies go
undiagnosed!

While 91% accuracy may seem good at first glance, another tumor-classifier model that always predicts benign
would achieve the exact same accuracy (91/100 correct predictions) on our examples. In other words, our
model is no better than one that has zero predictive ability to distinguish malignant tumors from benign
tumors.

52
Accuracy alone doesn't tell the full story when you're working with a class-imbalanced data set, like this one,
where there is a significant disparity between the number of positive and negative labels.

Confusion Matrix

Coming to confusion matrix, it is much detailed representation of what's going on with your labels. So there
were 71 points in the first class (label 0). Out of these, your model was successful in identifying 54 of those
correctly in label 0, but 17 were marked as label 4. Similarly look at second row. There were 43 points in class
1, but 36 of them were marked correctly. Your classifier predicted 1 in class 3 and 6 in class 4.

Now you can see the pattern this follows. An ideal classifiers with 100% accuracy would produce a pure
diagonal matrix which would have all the points predicted in their correct class.

53
MGM University
Jawaharlal Nehru Engineering College, Aurangabad

Course Name: Engineering Statistics Class: S.Y. B. Tech. (CSE)


Course Co-ordinator: Prof. Nitin V Tawar Semester: IV

Tutorial No. 6

Aim: Write a Python program to understand various probability distributions.

Theory:

Definition:

A probability distribution is a statistical function that describes all the possible values and probabilities for a
random variable within a given range. This range will be bound by the minimum and maximum possible
values, but where the possible value would be plotted on the probability distribution will be determined by a
number of factors. The mean (average), standard deviation, skewness, and kurtosis of the distribution are
among these factors.

Binomial Distribution

Suppose that you won the toss in Cricket match today and this indicates a successful event. You toss again but

you lost this time. If you win a toss today, this does not necessitate that you will win the toss tomorrow. Let’s

assign a random variable, say X, to the number of times you won the toss. What can be the possible value of

X? It can be any number depending on the number of times you tossed a coin.

There are only two possible outcomes. Head denoting success and tail denoting failure. Therefore, probability

of getting a head = 0.5 and the probability of failure can be easily computed as: q = 1- p = 0.5.

A distribution where only two outcomes are possible, such as success or failure, gain or loss, win or lose and

where the probability of success and failure is same for all the trials is called a Binomial Distribution.

The outcomes need not be equally likely. Remember the example of a fight between me and Undertaker? So, if

the probability of success in an experiment is 0.2 then the probability of failure can be easily computed as q =

1 – 0.2 = 0.8.

54
Each trial is independent since the outcome of the previous toss doesn’t determine or affect the outcome of the

current toss. An experiment with only two possible outcomes repeated n number of times is called binomial.

The parameters of a binomial distribution are n and p where n is the total number of trials and p is the

probability of success in each trial.

On the basis of the above explanation, the properties of a Binomial Distribution are

1. Each trial is independent.


2. There are only two possible outcomes in a trial- either a success or a failure.
3. A total number of n identical trials are conducted.
4. The probability of success and failure is same for all trials. (Trials are identical.)

The mathematical representation of binomial distribution is given by:

The mean and variance of a binomial distribution are given by:

Mean -> µ = n*p

Variance -> Var(X) = n*p*q

Example: If a coin is tossed 6 times, find the probability of:

(a) Exactly 2 heads

(b) At least 4 heads.

Solution:

(a) The repeated tossing of the coin is an example of a Bernoulli trial. According to the problem:

Number of trials: n=6

Probability of head: p= 1/2 and hence the probability of tail, q =1/2

For exactly two heads:


55
x=2

P(x=2) = 6C2 p2 q6-2 = 6! / 2! 4! × (½)2× (½)4

P(x=2) = 15/64=0.2343

(b) For at least four heads,

x ≥ 4, P(x ≥ 4) = P(x = 4) + P(x=5)

Hence,

P(x = 4) = 6C4 p4 q6-4 = 6!/4! 2! × (½)4× (½)2 = 15/64

P(x = 5) = 6C5 p5 q6-5 = (½)5. (½)1 = 1/64

Therefore,

P(x ≥ 4) = 15/64 + 1/64 = 16/64 = ¼=0.25

This can be achieved through programming by calculating the binomial distribution for each values of success.

For 6 tosses of a coin, we may have no of successes (here success=getting a head) from 0 or 1 or 2 or 3 or…..

upto 6.

An example illustrating the distribution:

Consider a random experiment of tossing a biased coin 6 times where the probability of getting a head is 0.6. If

‘getting a head’ is considered as ‘success’ then, the binomial distribution table will contain the probability

of r successes for each possible value of r.

r 0 1 2 3 4 5 6

P(r) 0.004096 0.036864 0.138240 0.276480 0.311040 0.186624 0.046656

56
Using Python to obtain the distribution:

Now, we will use Python to analyse the distribution(using SciPy).

SciPy:

SciPy is an Open Source Python library, used in mathematics, engineering, scientific and technical computing.

Installation :

pip install scipy

The scipy.stats module contains various functions for statistical calculations and tests. The stats() function of

the scipy.stats.binom module can be used to calculate a binomial distribution using the values of n and p.

Syntax : scipy.stats.binom.stats(n, p)

It returns a tuple containing the mean and variance of the distribution in that order.

scipy.stats.binom.pmf() function is used to obtain the probability mass function for a certain value of r, n and

p. We can obtain the distribution by passing all possible values of r(0 to n).

Syntax : scipy.stats.binom.pmf(r, n, p)

Calculating distribution table:

Approach :

 Define n and p.

 Define a list of values of r from 0 to n.

 Get mean and variance.

 For each r, calculate the pmf and store in a list.

Python Program

from scipy.stats import binom

# setting the values


57
# of n and p

n=6

p = 0.5 #probability of getting a head in one toss of a coin

# defining the list of x values

x_values = list(range(n + 1))

# obtaining the mean and variance

mean, var = binom.stats(n, p)

# list of pmf values

dist = [binom.pmf(x, n, p) for x in x_values ]

# printing the table

print("x\tp(x)")

for i in range(n + 1):

print(str(x_values[i]),"\t",str(dist[i]))

# printing mean and variance

print("mean = ",str(mean))

print("variance = ",str(var))

OUTPUT:

x p(x)

0 0.015625

1 0.09375000000000003

2 0.23437500000000003

3 0.31249999999999983

4 0.234375

5 0.09375000000000003

6 0.015625

mean = 3.0

variance = 1.5

After getting the distribution, we can use it to solve the question as solved above manually.

58
i) Probability of getting exactly two heads= P(x=2)

From the distribution obtained in the output, we can directly tell the answer.

P(x=2)=0.2343 which is same as the answer which we calculated manually.

ii) For at least four heads,

x ≥ 4, P(x ≥ 4) = P(x = 4) + P(x=5)

We can add the values from the program output for x=4 and x=5 to get this answer.

59
MGM University
Jawaharlal Nehru Engineering College, Aurangabad

Course Name: Engineering Statistics Class: S.Y. B. Tech. (CSE)


Course Co-ordinator: Prof. Nitin V Tawar Semester: IV

Tutorial No. 7

Aim: Write a program for various types of correlation. Plot the correlation plot on dataset and visualize giving
an overview of relationships among data on iris data.

Theory:

Correlation is used to test relationships between quantitative variables or categorical variables. In other words,
it’s a measure of how things are related. The study of how variables are correlated is called correlation
analysis.

Some examples of data that have a high correlation:

 Your caloric intake and your weight.

 Your eye color and your relatives’ eye colors.

 The amount of time your study and your GPA.

Some examples of data that have a low correlation (or none at all):

 A dog’s name and the type of dog biscuit they prefer.

 The cost of a car wash and how long it takes to buy a soda inside the station.

Correlations are useful because if you can find out what relationship variables have, you can make predictions
about future behavior. Knowing what the future holds is very important in the social sciences like government
and healthcare. Businesses also use these statistics for budgets and business plans.

The Correlation Coefficient

A correlation coefficient is a way to put a value to the relationship. Correlation coefficients have a value of
between -1 and 1. A “0” means there is no relationship between the variables at all, while -1 or 1 means that
there is a perfect negative or positive correlation (negative or positive correlation here refers to the type of
graph the relationship will produce).

60
Graphs showing a correlation of -1, 0 and +1

Applications

Essentially, correlation analysis is used for spotting patterns within datasets. A positive correlation result
means that both variables increase in relation to each other, while a negative correlation means that as one
variable decreases, the other increases.

Types of Correlation Coefficients

There are usually three different ways of ranking statistical correlation according to Spearman, Kendall, and
Pearson. Each coefficient will represent the end result as ‘r’. Spearman’s Rank and Pearson’s Coefficient are
the two most widely used analytical formulae depending on the types of data

When to Use

The two methods outlined above are to be used according to whether there are parameters associated with the
data gathered. The two terms to watch out for are:

Parametric: (Pearson’s Coefficient) Where the data must be handled in relation to the parameters of
populations or probability distributions. Typically used with quantitative data already set out within said
parameters.

Nonparametric: (Spearman’s Rank) Where no assumptions can be made about the probability distribution.
Typically used with qualitative data, but can be used with quantitative data if Spearman’s Rank proves
inadequate.

In cases when both are applicable, statisticians recommend using the parametric methods such as Pearson’s
Coefficient, because they tend to be more precise. But that doesn’t mean discount the non-parametric methods
if there isn’t enough data or a more specified accurate result is needed.

Pearson correlation coefficient


Pearson correlation coefficient or Pearson’s correlation coefficient or Pearson’s r is defined in statistics as the
measurement of the strength of the relationship between two variables and their association with each other. In
simple words, Pearson’s correlation coefficient calculates the effect of change in one variable when the other
variable changes.

For example: Up till a certain age, (in most cases) a child’s height will keep increasing as his/her age
increases. Of course, his/her growth depends upon various factors like genes, location, diet, lifestyle, etc.

61
This approach is based on covariance and thus is the best method to measure the relationship between two
variables.

Pearson correlation coefficient formula


The correlation coefficient formula finds out the relation between the variables. It returns the values between -
1 and 1. Use the below Pearson coefficient correlation calculator to measure the strength of two variables.

Pearson correlation coefficient formula:

Where:

N = the number of pairs of scores

Σxy = the sum of the products of paired scores

Σx = the sum of x scores

Σy = the sum of y scores

Σx2 = the sum of squared x scores

Σy2 = the sum of squared y scores

Pearson correlation coefficient calculator


Here is a step by step guide to calculating Pearson’s correlation coefficient:

Step one: Create a Pearson correlation coefficient table. Make a data chart, including both the variables. Label
these variables ‘x’ and ‘y.’ Add three additional columns – (xy), (x^2), and (y^2). Refer to this simple data
chart.

62
Step two: Use basic multiplication to complete the table.

Step three: Add up all the columns from bottom to top.

Step four: Use the correlation formula to plug in the values.

If the result is negative, there is a negative correlation relationship between the two variables. If the result is
positive, there is a positive correlation relationship between the variables. Results can also define the strength
of a linear relationship i.e., strong positive relationship, strong negative relationship, medium positive
relationship, and so on.

Correlation Matrix

A matrix is an array of numbers arranged in rows and columns.

A correlation matrix is simply a table showing the correlation coefficients between variables.

Correlation Matrix is basically a covariance matrix, also known as the auto-covariance matrix, dispersion
matrix, variance matrix, or variance-covariance matrix. It is a matrix in which i-j position defines the
correlation between the ith and jth parameter of the given data-set.

When the data points follow a roughly straight-line trend, the variables are said to have an approximately
linear relationship. In some cases, the data points fall close to a straight line, but more often there is quite a bit
of variability of the points around the straight-line trend. A summary measure called the correlation describes
the strength of the linear association. Correlation summarizes the strength and direction of the linear (straight-
line) association between two quantitative variables. Denoted by r, it takes values between -1 and +1. A
positive value for r indicates a positive association, and a negative value for r indicates a negative association.

The closer r is to 1 the closer the data points fall to a straight line, thus, the linear association is stronger. The
closer r is to 0, making the linear association weaker.

Here, the variables are represented in the first row, and in the first column:
63
The table above has used data from the full health data set.

Observations:

 We observe that Duration and Calorie_Burnage are closely related, with a correlation coefficient of
0.89. This makes sense as the longer we train, the more calories we burn

 We observe that there is almost no linear relationships between Average_Pulse and Calorie_Burnage
(correlation coefficient of 0.02)

 Can we conclude that Average_Pulse does not affect Calorie_Burnage? No.

#PYTHON PROGRAM TO FIND CORRELATION MATRIX FOR GIVEN EXAMPLE

import pandas as pd

full_health_data = pd.read_csv("data.csv", header=0, sep=",")

Corr_Matrix = round(full_health_data.corr(),2)

print(Corr_Matrix)

OUTPUT:

Duration Average_Pulse Max_Pulse Calorie_Burnage

Duration 1.00 -0.17 0.00 0.89

Average_Pulse -0.17 1.00 0.79 0.02

Max_Pulse 0.00 0.79 1.00 0.20

Calorie_Burnage 0.89 0.02 0.20 1.00

64
EXAMPLE 2: Let’s take another example and use House Price dataset for this.

#Load Libraries

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

from scipy.stats import norm

#Loading data

data = pd.read_csv("House Price.csv")

print(data.shape)

# ‘Sales Price’ Description

print(data['SalePrice'].describe())

#Histogram of Salesprice

plt.figure(figsize = (9, 5))

data['SalePrice'].plot(kind ="hist")

plt.show()

# Correlation Matrix

corrmat = data.corr()

print(corrmat)

65
66
MGM University
Jawaharlal Nehru Engineering College, Aurangabad

Course Name: Engineering Statistics Class: S.Y. B. Tech. (CSE)


Course Co-ordinator: Prof. Nitin V Tawar Semester: IV

Tutorial No. 8

Aim: Program to implement simple linear regression using Python

Theory:

What is Regression Analysis?

Regression analysis is a form of predictive modeling technique which investigates the relationship between a
dependent (target) and independent variable (s) (independent). This technique is used for forecasting, time
series modeling and finding the causal effect relationship between the variables. For example, relationship
between rash driving and number of road accidents by a driver is best studied through regression.

Regression analysis is an important tool for modeling and analyzing data. Here, we fit a curve / line to the data
points, in such a manner that the differences between the distances of data points from the curve or line is
minimized.

Why do we use Regression Analysis?

As mentioned above, regression analysis estimates the relationship between two or more variables. Let’s
understand this with an easy example:

Let’s say, you want to estimate growth in sales of a company based on current economic conditions. You have
the recent company data which indicates that the growth in sales is around two and a half times the growth in
the economy. Using this insight, we can predict future sales of the company based on current & past
information.

There are multiple benefits of using regression analysis. They are as follows:

1. It indicates the significant relationships between dependent variable and independent variable.
2. It indicates the strength of impact of multiple independent variables on a dependent variable.

Regression analysis also allows us to compare the effects of variables measured on different scales, such as the
effect of price changes and the number of promotional activities. These benefits help market researchers / data
analysts / data scientists to eliminate and evaluate the best set of variables to be used for building predictive
models.

Simple Linear Regression (Line of Best Fit/Linear Regression/Least Square Regression)


67
A linear regression is one of the easiest statistical models in machine learning. It is used to show the linear
relationship between a dependent variable and one or more independent variables.

Linear regression assumes a linear or straight line relationship between the input variables (X) and the single
output variable (y). More specifically, that output (y) can be calculated from a linear combination of the input
variables (X). When there is a single input variable, the method is referred to as a simple linear regression.

In simple linear regression we can use statistics on the training data to estimate the coefficients required by the
model to make predictions on new data.

Imagine you have some points, and want to have a line that best fits them like this:

We can place the line "by eye": try to have the line as close as possible to all points, and a similar number of
points above and below the line.

But for better accuracy let's see how to calculate the line using Least Squares Regression.

The Line

Our aim is to calculate the values m (slope) and b (y-intercept) in the equation of a line :

y = mx + b

Where:

 y = how far up
 x = how far along
 m = Slope or Gradient (how steep the line is)
 b = the Y Intercept (where the line crosses the Y axis)

68
Steps

To find the line of best fit for N points:

Step 1: For each (x,y) point calculate x2 and xy

Step 2: Sum all x, y, x2 and xy, which gives us Σx, Σy, Σx2 and Σxy

Step 3: Calculate Slope m:

(N is the number of points.)

Step 4: Calculate Intercept b:

b = (Σy − m Σx)/ N

Step 5: Assemble the equation of a line

y = mx + b

Example

Let's have an example to see how to do it!

Example: Sam found how many hours of sunshine vs how many ice creams were sold at the shop from
Monday to Friday:
"x" "y"
Hours of Sunshine Ice Creams Sold

2 4

3 5

5 7

7 10

9 15

Let us find the best m (slope) and b (y-intercept) that suits that data

69
y = mx + b

Step 1: For each (x,y) calculate x2 and xy:

x y x2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135

Step 2: Sum x, y, x2 and xy (gives us Σx, Σy, Σx2 and Σxy):

x y x2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
Σx: 26 Σy: 41 Σx2: 168 Σxy: 263

Also N (number of data values) = 5

Step 3: Calculate Slope m:

= (5 x 263 − 26 x 41)/( 5 x 168 − 262)

= (1315 – 1066) / (840 – 676)

= 249/ 164 = 1.5183...

Step 4: Calculate Intercept b:

b = (Σy − m Σx )/N

= 41 − 1.5183 x 26 5

= 0.3049...

Step 5: Assemble the equation of a line:

y = mx + b

y = 1.518x + 0.305

Let's see how it works out:


70
x y y = 1.518x + 0.305 error
2 4 3.34 −0.66
3 5 4.86 −0.14
5 7 7.89 0.89
7 10 10.93 0.93
9 15 13.97 −1.03

Here are the (x,y) points and the line y = 1.518x + 0.305 on a graph:

Nice fit!

Sam hears the weather forecast which says "we expect 8 hours of sun tomorrow", so he uses the above
equation to estimate that he will sell

y = 1.518 x 8 + 0.305 = 12.45 Ice Creams

Sam makes fresh waffle cone mixture for 14 ice creams just in case. Yum.

How does it work?

It works by making the total of the square of the errors as small as possible (that is why it is called "least
squares"):

The straight line minimizes the sum of squared errors

71
So, when we square each of those errors and add them all up, the total is as small as possible.

You can imagine (but not accurately) each data point connected to a straight bar by springs:

Steps in Linear Regression


Step 1: Importing all the required libraries

Step 2: Reading the dataset

Step 3: Exploring the data scatter

Step 4: Data cleaning

Step 5: Training our model

Step 6: Exploring & predicting our results

Application: some of the most popular applications of Linear regression algorithm are in financial portfolio
prediction, salary forecasting, real estate predictions and in traffic in arriving at ETAs.

72
First Program of linear regression using basic statistical formula

import numpy as np
import matplotlib.pyplot as plt

def estimate_coef(x, y):


# number of observations/points
n = np.size(x)
# mean of x and y vector
m_x, m_y = np.mean(x), np.mean(y)

# calculating cross-deviation and deviation about x


SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
# calculating regression coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return(b_0, b_1)

def plot_regression_line(x, y, b):


# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m", marker = "o", s = 30)
# predicted dependent vector
y_pred = b[0] + b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color = "g")
# putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot
plt.show()

def main():
# observations
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} nb_1 = {}".format(b[0], b[1]))

# plotting regression line


plot_regression_line(x, y, b)

if __name__ == "__main__":
main()

73
Second Program of linear regression using statsmodels
Dataset
Travel Petrol
20 1
45 3
56 5
34 2
28 1.6
49 3.7

74
import numpy as np;

import matplotlib.pyplot as plt;

import pandas as pd;

import statsmodels.api as sm;

tb1=pd.read_excel('regr.xlsx');

tb1.plot('Travel','Petrol',style='o');

plt.ylabel('Petrol');

plt.show();

t=tb1['Travel'];

c=tb1['Petrol'];

t1=sm.add_constant(t)

model1=sm.OLS(c,t1)

result1=model1.fit()

print(result1.summary())

coeff=result1.params;

print("Coefficient 1:",coeff[0])

print("Coefficient:",coeff[1])

y=[]

for x in tb1['Travel']:

y.append(coeff[0]+coeff[1]*x);

print('Predicted values','\tActual Values');

for i in range(len(y)):

print(y[i],"\t", c[i]);

print('Predicted Values using predefined method');

pred=result1.predict(t1);

print(pred);

75
76
Third Program of linear regression using sklearn

import numpy as np
from sklearn.linear_model import LinearRegression
x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))
y = np.array([5, 20, 14, 32, 22, 38])
model = LinearRegression()
model.fit(x, y)
model = LinearRegression().fit(x, y)
print('intercept:', model.intercept_)
print('slope:', model.coef_)
y_pred = model.predict(x)
print('predicted dependent:', y_pred, sep='\n')
y_pred = model.intercept_ + model.coef_ * x
print('predicted dependent:', y_pred, sep='\n')

77
MGM University
Jawaharlal Nehru Engineering College, Aurangabad

Course Name: Engineering Statistics Class: S.Y. B. Tech. (CSE)


Course Co-ordinator: Prof. Nitin V Tawar Semester: IV

Tutorial No. 9

Aim: Program to implement multivariate/multiple linear regression using Python

Theory:

Multivariate/Multiple Linear Regression

Simple linear regression uses a linear function to predict the value of a target variable y, containing the
function only one independent variable x₁.

y =b ₀+b ₁x ₁

After fitting the linear equation to observed data, we can obtain the values of the parameters b₀ and b₁ that best
fits the data, minimizing the square error.

For example, to predict the weight based on the height of a person, we can use above equation.

Here one independent variable is used to predict the weight of the person Weight = f(Height) , creating a
model. We can create a model that predicts the weight using both height and gender as independent variables.
And this is done woth the help of multiple linear regression.

Multiple linear regression uses a linear function to predict the value of a target variable y, containing the
function n independent variable x=[x₁,x₂,x₃,…,xn].

y =b ₀+b ₁x₁+b₂x₂+b₃x₃+…+bnxn

We obtain the values of the parameters bᵢ, using the same technique as in simple linear regression (least square
error). After fitting the model, we can use the equation to predict the value of the target variable y. In our case,
we use height and gender to predict the weight of a person Weight = f(Height,Gender).

Categorical variables in multiple linear regression

There are two types of variables used in statistics: numerical and categorical variables.

 Numerical variables represent values that can be measured and sorted in ascending and descending
order such as the height of a person.

78
 Categorical variables are values that can be sorted in groups or categories such as the gender of a
person.

Multiple linear regression accepts not only numerical variables, but also categorical ones. To include a
categorical variable in a regression model, the variable has to be encoded as a binary variable (dummy
variable). In Pandas, we can easily convert a categorical variable into a dummy variable using the
pandas.get_dummies function. This function returns a dummy-coded data where 1 represents the presence of
the categorical variable and 0 the absence.

For the two variable case:

and

At this point, you should notice that all the terms from the one variable case appear in the two variable case. In
the two variable case, the other X variable also appears in the equation. For example, X 2 appears in the
equation for b1. Note that terms corresponding to the variance of both X variables occur in the slopes. Also
note that a term corresponding to the covariance of X1 and X2 (sum of deviation cross-products) also appears
in the formula for the slope.

The equation for a with two independent variables is:

This equation is a straight-forward generalization of the case for one independent variable.

79
A Numerical Example

Suppose we want to predict job performance of Chevy mechanics based on mechanical aptitude test scores and
test scores from personality test that measures conscientiousness. (In practice, we would need many more
people, but I wanted to fit this on a PowerPoint slide.)

Job Mech Consc


Perf Apt
Y X1 X2 X1*Y X2*Y X1*X2
1 40 25 40 25 1000
2 45 20 90 40 900
1 38 30 38 30 1140
3 50 30 150 90 1500
2 48 28 96 56 1344
3 55 30 165 90 1650
3 53 34 159 102 1802
4 55 36 220 144 1980
4 58 32 232 128 1856
3 40 34 120 102 1360
5 55 38 275 190 2090
3 48 28 144 84 1344
3 45 30 135 90 1350
2 55 36 110 72 1980
4 60 34 240 136 2040
5 60 38 300 190 2280
5 60 42 300 210 2520
5 65 38 325 190 2470
4 50 34 200 136 1700
3 58 38 174 114 2204
Y X1 X2 X1*Y X2*Y X1*X2
65 1038 655 3513 2219 34510 Sum
20 20 20 20 20 20 N
3.25 51.9 32.75 175.65 110.95 1725.5 M
1.25 7.58 5.24 84.33 54.73 474.60 SD
29.75 1091.8 521.75 USS

We can collect the data into a matrix like this:

y X1 X2
Y 29.75 139.5 90.25
X1 0.77 1091.8 515.5
80
X2 0.72 0.68 521.75

The numbers in the table above correspond to the following sums of squares, cross products, and correlations:

y x1 X2
Y
X1
X2

We can now compute the regression coefficients:

To find the intercept, we have:

Therefore, our regression equation is:


81
Y '= -4.10+.09X1+.09X2 or

Job Perf' = -4.10 +.09MechApt +.09Coscientiousness.

This technique is used when there is more than one predictor variable in a multivariate regression model and
the model is called a multivariate multiple regression. Termed as one of the simplest supervised machine
learning algorithms by researchers, this regression algorithm is used to predict the response variable for a set
of explanatory variables. This regression technique can be implemented efficiently with the help of matrix
operations and in Python, it can be implemented via the “numpy” library which contains definitions and
operations for matrix object.

Application: Industry application of Multivariate Regression algorithm is seen heavily in the retail sector
where customers make a choice on a number of variables such as brand, price and product. The multivariate
analysis helps decision makers to find the best combination of factors to increase footfalls in the store.

Multiple Regression Program using statsmodels


Dataset
Gender Height Weight
Male 67.84867 176.1726
Male 65.93178 174.4853
Male 70.96655 193.9065
Male 67.58075 186.9916
Male 68.58927 173.5958
Male 73.09287 193.9442
Male 68.86006 177.1311
Male 68.97342 159.2852
Male 67.01379 199.1954
Male 71.55772 185.9059
Male 70.35188 198.903
Female 58.91073 102.0883
Female 65.23001 141.3058
Female 63.369 131.0414
Female 64.48 128.1715
Female 61.7931 129.7814
Female 65.96802 156.8021
Female 62.85038 114.969
Female 65.65216 165.083
Female 61.89023 111.6762

Steps:

1. Install all the libraries by using pip command for each module one by one in Windows command prompt as
follows:

a) pip install pandas


b) pip install statsmodels
c) pip install sklearn
82
d) pip install scipy
e) pip install seaborn
f) pip install numpy
g) pip install matplotlib

2. Save the weight-height data set in an excel file in Python Folder where your program is saved.

3. Open Python IDLE Window -> New File ->Type the following program

3. Run the module or press F5

Program:

import pandas as pd

import statsmodels.api as sm

from statsmodels.formula.api import ols

from sklearn.linear_model import LinearRegression

from scipy import stats

import seaborn as sns

import numpy as np

import matplotlib.pyplot as plt

import statsmodels.api as s

#Read the excel data set file

tb1=pd.read_excel('weight-height.xlsx')

print(tb1)

#convert the attribute Gender into a dummy binary attribute it creates two columns one is for male and another
is for female

#Male attributes contains 1 for male and 0 for female

#Female attribute contains 0 for male and 1 for female

# we can use any one of them

dummy=pd.get_dummies(tb1['Gender'])

#combine the gender attribute in the dataset tb1

step_1=pd.concat([tb1,dummy],axis=1)
83
step_1.drop(['Gender','Male'],inplace=True,axis=1)

print(step_1)

#perform multiple linear regression using OLS(Ordinary Least Square) Regression and print the result

result=sm.OLS(step_1['Weught'],s.add_constant(step_1[['Height','Female']])).fit()

print(result.summary())

Output:

After fitting the linear equation, we obtain the following multiple linear regression model using which
we can predict vaues of weight using any combination of gender and height:

Weight = -153.3956+4.8725*Height-24.0633*Gender

Here Gender takes two values- 1 for female and 0 for male.

84
MGM University
Jawaharlal Nehru Engineering College, Aurangabad

Course Name: Engineering Statistics Class: S.Y. B. Tech. (CSE)


Course Co-ordinator: Prof. Nitin V Tawar Semester: IV

Tutorial No. 10

Aim: Implementation of Logistic Regression using Python

Theory:

In regression applications, the dependent variable i.e. the result variable may only assume discrete values. For
instance, a bank might like to develop an estimated regression equation for predicting whether a person will be
approved for a credit card or not.

The dependent variable can be coded as y=1 if the bank request for a credit card and y=0 if the bank rejects for
the request.

Using logistic regression, we can estimate the probability that the bank will approve the request for a credit
card given a particular set of values for the chosen independent values.

Logistic Regression Equation

If the two values of the dependent variable y are coded as 0 or 1, the value of E(y) in equation given below
provides the probability that y=1 given a particular set of values for the indepenedent variables x 1, x2, x3,
x4,…..,xp

Because of the interprestation of E(y) as a probability, the logistic regression equation is written as follows

Here y hat provides am estimate of the probability that y=1 given a particular set of independent variables.

Example:

Let us consider an application of logistic regression involving a direct mail promotion being used by Simmons
Stores. Simmons owns and operates a national chain of women’s apparel stores. Five thousand copies of an
85
expensive four-color sales catalog have been printed, and each catalog includes a coupon that provides a $50
discount on purchases of $200 or more. The catalogs are expensive and Simmons would like to send them to
only those customers who have the highest probability of using the coupon.

Management thinks that annual spending at Simmons Stores and whether a customer has a Simmons credit
card are two variables that might be helpful in predicting whether a customer who receives the catalog will use
the coupon.

Simmons conducted a pilot study using a random sample of 50 Simmons credit card customers and 50 other
customers who do not have a Simmons credit card. Simmons sent the catalog to each of the 100 customers
selected. At the end of a test period, Simmons noted whether the customer used the coupon or not?

Following is the data (10 customer out of 100)

Customer Spending Card Coupon


1 2.291 1 0
2 3.215 1 0
3 2.135 1 0
4 3.924 0 0
5 2.528 1 0
6 2.473 0 1
7 2.384 0 0
8 7.076 0 0
9 1.182 1 1
10 3.345 0 0

The amount each customer spent last year at Simmons is shown in thousands of dollars and the credit card
information has been coded as 1 if the customer has a Simmons credit card and 0 if not.

In the Coupon column, a 1 is recorded if the sampled customer used the coupon and 0 if not.

86
Now we can compute y hat for any set of independent variables.

Example:

P(y=1/x1=2, x2=0)=.1880

Steps:

1. Install all the libraries (if not installed) by using pip command for each module one by one in Windows
command prompt as follows:

a) pip install pandas


b) pip install statsmodels

2. Save the above given tabular data in an excel file in Python Folder & save as Credit.xlsx where your
program is saved.

3. Open Python IDLE Window -> New File ->Type the following program

3. Run the module or press F5

Program:

import pandas as pd

import statsmodels.api as sm

#Read the excel data set file

tb1=pd.read_excel('Credit.xlsx')

print(tb1)

x=tb1[['Card','Spending']]

y=tb1['Coupon']

#perform logistic regression

x1=sm.add_constant(x)

logit_model=sm.Logit(y,x1)

result=logit_model.fit()

87
print(result.summary2())

Output:

88

You might also like