Professional Documents
Culture Documents
Final Stat 01 March
Final Stat 01 March
LAB MANUAL
Program (UG/PG) : UG
Semester : IV
1
FOREWORD
It is our great pleasure to present this laboratory manual for Second Year
engineering students for the subject of Engineering Statistics.
As a student, many of you may be wondering with some of the questions in your
mind regarding the subject and exactly what has been tried is to answer through
this manual.
As you may be aware that MGM has already been awarded with ISO
9001:2015,140001:2015 certification and it is our endure to technically equip our
students taking the advantage of the procedural aspects of ISO Certification.
Faculty members are also advised that covering these aspects in initial stage itself,
will greatly relived them in future as much of the load will be taken care by the
enthusiastic energies of the students once they are conceptually clear.
Dr. H. H. Shinde
Principal
2
LABORATORY MANUAL CONTENTS
This manual is intended for the Second year students of Computer Science and
Engineering in the subject of Engineering Statistics. This manual typically contains
Tutorial Sessions related to Engineering Statistics covering various aspects related
the subject to enhanced understanding.
Students are advised to thoroughly go through this manual rather than only topics
mentioned in the syllabus as practical aspects are the key to understanding and
conceptual visualization of theoretical aspects covered in the books.
3
LIST OF EXPERIMENTS
4
DOs and DON’Ts in Laboratory:
1. Make entry in the Log Book as soon as you enter the Laboratory.
2. All the students should sit according to their roll numbers starting from their left
to right.
3. All the students are supposed to enter the terminal number in the log book.
4. Do not change the terminal on which you are working.
5. All the students are expected to get at least the algorithm of the program/concept
to be implemented.
6. Strictly observe the instructions given by the teacher/Lab Instructor.
7. Do not disturb machine Hardware / Software Setup.
5
MGM University
Jawaharlal Nehru Engineering College, Aurangabad
Tutorial No. 1
Aim: Applications of Statistics in real life and introduction to useful Python libraries.
Theory:
Statistics is a a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of
masses of numerical data.
Statistics are defined as numerical data, and is the field of math that deals with the collection, tabulation and
interpretation of numerical data. An example of statistics is a report of numbers saying how many followers of
each game there are in a particular country. It makes a set of data more easily understandable. It is a branch of
mathematics that analyses data and then uses it to solve different types of problems related to the data.
Consider a real-life example of statistics. To learn the mean of the marks obtained by each student in the class
whose strength is 60, so the average value is the statistics of the marks obtained.
Suppose you want to find out how many people are employed in a town. Before the town is settled with 10
lakh people, thus we will take an analysis here for 1000 people that is a sample, and based on that, we will
build the data, which is the statistic.
Statistics is a set of equations that allows us to solve complex problems. These statistical problems in real
life are usually based on facts and data.
Types of Statistics-
6
1. Descriptive Statistics
In descriptive statistics, the information is summed up through the given perceptions. The
summarization is one from an example of the populace utilizing boundaries, for example, the mean or
standard deviation. Thus, it gives a graphical synopsis of information and it is just used for summing up
objects, and so on. Descriptive statistics are applied to the data which is already known.
2. Inferential Statistics
Inferential Statistics makes inferences and predictions about extensive data by considering a sample data
from the original data. It uses probability to reach conclusions.
The best real-world example of “Inferential Statistics” is, predicting the amount of rainfall we get in
the next month by Weather Forecast.
The statistical problems in real life consist of sampling, inferential statistics, probability, estimating, enabling a
team to develop effective projects in a problem-solving frame. For instance, car manufacturers looking to paint
the cars might include a wide range of people that include supervisors, painters, paint representatives, or the
same professionals to collect the data, which is necessary for the whole process and make it successful.
The current situation of covid-19 is the best example of statistical problems where we need to determine-
Statistics is used for the graphical representation of the collected data. Statistics can compare information
through median, mean, and mode. Therefore, statistics concepts can easily be applied to real life, such as for
7
calculating the time to get ready for office, how much money is required to visit work in a month, gymming
diet count of a weak, in education, and much more.
Besides this, statistics can be utilized for managing daily routines so that you can work efficiently.
The importance of statistics in government is utilized by making judgments about health, populations,
education, and much more. It may help the government to check which vaccine is assertive for the corona
novel virus for the citizens. What are the progressive reports after vaccination, whether vaccines are useful or
not? For instance, the government can analyze in which areas vaccination is done and know where they need
to target, or where cases are increasing day by day with the help of polls. The governments of different nations
are using statistical data to vaccinate the people. As per the data, the availability of vaccines can also be
monitored.
2. Weather Forecast
Statistics play a crucial role in weather forecasting. The computer use in weather forecasting is based on a set
of statistics functions. All these statistics function to compare the weather condition with the pre-recorded
seasons and conditions. For weather forecasting, predictors use the concept of probability and statistics. They
employ several concepts and tools to predict maximum accuracy forecasting.
3. Prediction
The data help us make predictions about something that is going to happen in the future. Based on what we
face in our daily lives, we make predictions. How accurate this prediction will depend on many factors. When
we make a prediction, we take into account the external or internal factors that may affect our future. Doctors,
engineers, artists, and practitioners all use statistics to make predictions about future events.
For example, doctors use statistics to understand the future of the disease. They can predict the magnitude of
the flue in each winter season through the use of data.
Engineers use statistics to estimate the success of their ongoing project, and they also use the data to evaluate
how long it will take to complete a project.
For example, now when all the schools and colleges are closed students are studying via online mode, and it is
difficult for the children to study online. In this case, teachers are working hard and trying to teach the children
8
efficiently, and with the help of statistics, they can analyze the performance of the students and can guide them
properly.
4. Financial Market
The financial market completely relies on the statistical data. All the stock prices calculate with the help of
statistics. It also helps the investor to take the decision of investment in the particular stock. It also helps the
corporate to manage their finance to do long-term business.
For instance, you want to buy the shares of a company, and you do not have an idea about the company, is it
good or bad. Different types of statistical calculations such as price to book ratio, price to sale ratio, and price
to earnings growth ratio will help you to invest in the right stocks.
5. Business Statistics
Each large organization uses business statistics and utilizes various data analysis tools. For instance,
approximating the probability and seeing where sales can be headed in the future. Several tools are used for
business statistics, which are built on the basis of the mean, median, and mode, the bell curve, and bar graphs,
and basic probability. These can be employed for research problems related to employees, products, customer
service, and much more.
Besides this, statistics are widely used in consumer goods products. The reason is consumer goods are daily
used products. The business use statistics to calculate which consumer goods are available in the store or not.
They also used stats to find out which store needs the consumer goods and when to ship the products. Even
proper statistics decisions are helping the business to make massive revenue on consumer goods.
6. Machine Learning
Statistics are utilized for quantifying the uncertainty of the estimated skills within the machine learning
models. These uncertainties are defined with the help of confidence intervals and tolerance intervals.
Statistics can be used for machine learning in various ways, such as for:
Problem Framing
Understanding the data
9
Data Cleaning
Selection of data
Data Preparation
Model Evaluation
Model Configuration
Selection of Model
Model Presentation
Model Prediction
Role of Python
Data science and machine learning have become essential in many fields of science and technology. A
necessary aspect of working with data is the ability to describe, summarize, and represent data
visually. Python statistics libraries are comprehensive, popular, and widely used tools that will assist you in
working with data.
Python’s statistics is a built-in Python library for descriptive statistics. You can use it if your datasets
are not too large or if you can’t rely on importing other libraries.
NumPy is a third-party library for numerical computing, optimized for working with single- and multi-
dimensional arrays. Its primary type is the array type called ndarray. This library contains
many routines for statistical analysis.
SciPy is a third-party library for scientific computing based on NumPy. It offers additional
functionality compared to NumPy, including scipy.stats for statistical analysis.
Pandas is a third-party library for numerical computing based on NumPy. It excels in handling labeled
one-dimensional (1D) data with Series objects and two-dimensional (2D) data with DataFrame objects.
Matplotlib is a third-party library for data visualization. It works well in combination with NumPy,
SciPy, and Pandas.
numpy Module
NumPy stands for Numerical Python. It is a Python library used for working with an array. In Python, we
use the list for purpose of the array but it’s slow to process. NumPy array is a powerful N-dimensional array
object and used in statistical analysis, data visualization, linear algebra, Fourier transform, and random
number capabilities. It provides an array object much faster than traditional Python lists.
NumPy is a Python library that supports multi-dimensional arrays & matrices and offers a wide range of
mathematical functions to operate on the NumPy arrays & matrices.
Example:
# importing numpy module
import numpy as np
# creating list
list = [1, 2, 3, 4]
10
# creating numpy array
sample_array = np.array(list1)
pandas Module
Pandas is an open-source library that is built on top of NumPy library. It is a Python package that offers
various data structures and operations for manipulating numerical data and time series. It is mainly popular for
importing and analyzing data much easier. Pandas is fast and it has high-performance & productivity for users.
Example-
import pandas as pd
matplotlib can plot a huge amount of data on a single plot and make it look simple and compact.
11
Matplotlib is a Python package for 2D plotting and the matplotlib.pyplot sub-module contains many plotting
functions to create various kinds of plots.
PYPLOT:
Most of the Matplotlib utilities lies under the pyplot submodule, and are usually imported under the plt alias:
Basic Plotting
Procedure
3. Enter plt.plot(x,y,[fmt],**kwargs) where [fmt] is a (optional) format string and **kwargs are (optional)
keyword arguments specifying line properties of the plot.
4. Use pyplot functions to add features to the figure such as a title, legend, grid lines, etc.
# x-axis values
xpoints = [5, 2, 9, 4, 7]
# Y-axis values
ypoints = [10, 5, 8, 4, 2]
# Function to plot
plt.plot(xpoints, ypoints)
plt.show()
12
2. Draw two points in the diagram, one at position (1, 3) and one in position (8, 10)
# x-axis values
xpoints = (1,3)
# Y-axis values
ypoints = (8,10)
# Function to plot
plt.show()
13
3. Draw a line in a diagram from position (1, 3) to (2, 8) then to (6, 1) and finally to position (8, 10)
# x-axis values
xpoints = (1,2,6,8)
# Y-axis values
ypoints = (3,8,1,10)
# Function to plot
plt.show()
14
MARKERS:
You can use the keyword argument marker to emphasize each point with a specified marker:
15
MGM University
Jawaharlal Nehru Engineering College, Aurangabad
Tutorial No. 2
Aim: Understanding and demonstrating measures of central tendency & dispersion using Python library and
with the help of Loan dataset.
Theory:
Descriptive Statistics is the building block of data science. Advanced analytics is often incomplete without
analyzing descriptive statistics of the key metrics. In simple terms, descriptive statistics can be defined as the
measures that summarize a given data, and these measures can be broken down further into the measures of
central tendency and the measures of dispersion.
Measures of central tendency include mean, median, and the mode, while the measures of variability include
standard deviation, variance, and the inter-quartile range. In this guide, you will learn how to compute these
measures of descriptive statistics and use them to interpret the data.
The mean, median, and mode are three metrics that are commonly used to describe the center of a
dataset.
Individuals and companies use these metrics all the time in different fields to gain a better understanding of
datasets. The following examples explain how the mean, median, and mode are used in different real life
scenarios.
The mean, median, and mode are widely used by insurance analysts and actuaries in the healthcare industry.
For example:
16
Mean: Insurance analysts often calculate the mean age of the individuals they provide insurance for so
they can know the average age of their customers.
Median: Actuaries often calculate the median amount spend on healthcare each year by individuals so
they can know how much insurance they need to be able to provide to individuals.
Mode: Actuaries also calculate the mode of their customers (the most commonly occurring age) so they
can know which age group uses their insurance the most.
The mean, median, and mode are also used often by real estate agents.
For example:
Mean: Real estate agents calculate the mean price of houses in a particular area so they can inform their
clients of what they can expect to spend on a house.
Median: Real estate agents also calculate the median price of houses to gain a better idea of the
“typical” home price, since the median is less influenced by outliers (like multi-million dollar homes)
compared to the mean.
Mode: Real estate agents also calculate the mode of the number of bedrooms per house so they can
inform their clients on how many bedrooms they can expect to have in houses in a particular area.
In Machine Learning (and in mathematics) there are often three values that interest us:
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
What is the average, the middle, or the most common speed value?
Mean:-
To calculate the mean, find the sum of all values, and divide the sum by the number of values:
(99+86+87+88+111+86+103+87+94+78+77+85+86) / 13 = 89.77
17
The NumPy module has a method for this. Learn about the NumPy module in our NumPy Tutorial.
Example
import numpy
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = numpy.mean(speed)
print(x)
Median:-
The median value is the value in the middle, after you have sorted all the values:
77, 78, 85, 86, 86, 86, 87, 87, 88, 94, 99, 103, 111
It is important that the numbers are sorted before you can find the median.
Example
import numpy
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = numpy.median(speed)
print(x)
If there are two numbers in the middle, divide the sum of those numbers by two.
77, 78, 85, 86, 86, 86, 87, 87, 94, 98, 99, 103
Example
18
import numpy
speed = [99,86,87,88,86,103,87,94,78,77,85,86]
x = numpy.median(speed)
print(x)
Mode:-
The Mode value is the value that appears the most number of times:
99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86 = 86
The SciPy module has a method for this. Learn about the SciPy module in our SciPy Tutorial.
Example
Use the SciPy mode() method to find the number that appears the most:
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = stats.mode(speed)
print(x)
Standard deviation is a number that describes how spread out the values are.
A low standard deviation means that most of the numbers are close to the mean (average) value.
A high standard deviation means that the values are spread out over a wider range.
speed = [86,87,88,86,87,85,86]
0.9
Meaning that most of the values are within the range of 0.9 from the mean value, which is 86.4.
19
Let us do the same with a selection of numbers with a wider range:
speed = [32,111,138,28,59,77,97]
37.85
Meaning that most of the values are within the range of 37.85 from the mean value, which is 77.4.
As you can see, a higher standard deviation indicates that the values are spread out over a wider range.
Example:-
import numpy
speed = [86,87,88,86,87,85,86]
x = numpy.std(speed)
print(x)
Example
import numpy
speed = [32,111,138,28,59,77,97]
x = numpy.std(speed)
print(x)
Variance:-
Variance is another number that indicates how spread out the values are.
In fact, if you take the square root of the variance, you get the standard deviation!
Or the other way around, if you multiply the standard deviation by itself, you get the variance!
20
(32+111+138+28+59+77+97) / 7 = 77.4
32 - 77.4 = -45.4
111 - 77.4 = 33.6
138 - 77.4 = 60.6
28 - 77.4 = -49.4
59 - 77.4 = -18.4
77 - 77.4 = - 0.4
97 - 77.4 = 19.6
(-45.4)2 = 2061.16
(33.6)2 = 1128.96
(60.6)2 = 3672.36
(-49.4)2 = 2440.36
(-18.4)2 = 338.56
(- 0.4)2 = 0.16
(19.6)2 = 384.16
(2061.16+1128.96+3672.36+2440.36+338.56+0.16+384.16) / 7 = 1432.2
Example:-
import numpy
speed = [32,111,138,28,59,77,97]
x = numpy.var(speed)
print(x)
Standard Deviation
As we have learned, the formula to find the standard deviation is the square root of the variance:
√1432.25 = 37.85
Or, as in the example from before, use the NumPy to calculate the standard deviation:
Example
21
Use the NumPy std() method to find the standard deviation:
import numpy
speed = [32,111,138,28,59,77,97]
x = numpy.std(speed)
print(x)
Symbols
A statistical graph is a tool that helps you learn about the shape or distribution of a sample or a population. A
graph can be a more effective way of presenting data than a mass of numbers because we can see where data
clusters and where there are only a few data values. Newspapers and the Internet use graphs to show trends and
to enable readers to compare facts and figures quickly. Statisticians often graph data first to get a picture of the
data. Then, more formal tools may be applied.
Some of the types of graphs that are used to summarize and organize data are the scatter plot, the bar graph,
the histogram, the frequency polygon (a type of broken line graph), the pie chart, and the box plot.
BAR GRAPH
if you wanted to display relationships between data in categories, you could make a bar graph.
22
PIE CHART
A pie chart would show you how categories in your data relate to the whole set.
SCATTER PLOT
23
HISTOGRAM
To create a histogram the first step is to create bin of the ranges, then distribute the whole range of the values
into a series of intervals, and count the values which fall into each of the intervals. Bins are clearly identified
as consecutive, non-overlapping intervals of variables. The matplotlib.pyplot.hist() function is used to compute
and create histogram of x.
Example Code:
import numpy as np
# Creating dataset
27])
# Creating histogram
# Show plot
plt.show()
24
BOX PLOT
Box Plot is the visual representation of the depicting groups of numerical data through their quartiles.
Boxplot is also used for detect the outlier in data set. It captures the summary of the data efficiently with a
simple box and whiskers and allows us to compare easily across groups. Boxplot summarizes a sample data
using 25th, 50th and 75th percentiles. These percentiles are also known as the lower quartile, median and
upper quartile.
A box plot consists of 5 things.
Minimum
First Quartile or 25%
Median (Second Quartile) or 50%
Third Quartile or 75%
Maximum
A Box Plot is also known as Whisker plot. In the box plot, a box is created from the first quartile to the third
quartile; a vertical line is also there which goes through the box at the median. Here x-axis denotes the data to
be plotted while the y-axis shows the frequency distribution.
Box Plot in Seaborn is used to draw a box plot to show distributions with respect to categories. The
seaborn.boxplot() is used for this. Use the "orient” parameter for orientation of each numeric variable.
#Example Code
import seaborn as sb
import pandas as pd
# display
plt.show()
OUTPUT:
We will be using fictitious data of loan applicants containing 13 variables, including Loan_ID, Age, Gender,
Married, Dependents, Education, Self_Employed, ApplicantIncome, CoapplicantIncome, LoanAmount,
Loan_Amount_Term, Credit_History, Property_Area, Loan_Status
#Python Program
import pandas as pd
import numpy as np
import statistics as st
df = pd.read_csv("loan_data_set.csv")
print(df.shape)
26
print(df.info())
#Find mean
#below function gives the mean of eah numerical column in the dataset
#we can infer that the average age of the applicant, the average annual income and the average tenure of loans
print(df.mean)
print(df["Age"].mean())
#find median
print(df.median(numeric_only=True))
#From the output, we can infer that the median age of the applicants, the median annual income, and the
median tenure of loans
print(df["Age"].median())
#Mode- only central tendency measure that can be used with categorical variables,
print(df.mode())
#The output above shows that most of the applicants are married, as depicted by the 'Marital_status' value of
#"Yes". Similar interpreation could be done for the other categorical variables For numerical variables, the
#mode value represents the value that occurs most frequently.
#Standard deviation
print(df.std(numeric_only=True))
#In the output, the standard deviation of the variable 'Income' is much higher than that of the variable
'Dependents'.
print(df.var(numeric_only=True))
#to get all the measures of all the numerical columns in one step, we can use describe function which
#summarises everything including measures of central tendency and dispersion
27
print(df.describe())
# to get all the measures of all the columns (including categorical) in one step, we can use describe function
which summarizes #everything including measures of central tendency and dispersion
print(df.describe(include='all'))
import seaborn as sb
# display
plt.show()
28
MGM University
Jawaharlal Nehru Engineering College, Aurangabad
Tutorial No. 3
Aim: Python program for demonstrating measures of central tendency & dispersion on Titanic Dataset
Theory:
Introduction
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during
her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and
crew. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for
the passengers and crew. Although there was some element of luck involved in surviving the sinking, some
groups of people were more likely to survive than others, such as women, children, and the upper-class.
We are using the Titanic data set, stored as CSV. The data consists of the following data columns:
Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.
29
We are going to study the data set to understand various descriptive properties of the columns such as Age. We
are going to calculate the summary statistics for the data set which will contain the measures of central
tendency and the measures of dispersion. Different statistics are available and can be applied to columns with
numerical data. Operations in general exclude missing data and operate across rows by default.
PYTHON PROGRAM
import pandas as pd
import numpy as np
import seaborn as sb
titanic = pd.read_csv("titanic.csv")
print(titanic.head())
avg_age= titanic["Age"].mean()
print(avg_age)
print(titanic[["Age", "Fare"]].median())
#All statistical measures for multiple numerical columns with one function
print(titanic[["Age", "Fare"]].describe())
print(titanic[["Gender", "Age"]].groupby("Gender").mean())
print(titanic.groupby("Gender").mean())
# mean ticket fare price for each of the Gender and cabin class combinations
print(titanic.groupby(["Gender", "Pclass"])["Fare"].mean())
30
print(titanic["Pclass"].value_counts())
titanic.head()
print(titanic.shape)
print(titanic.dtypes)
#in the output the int64 and float64 are numeric types and object is the string type.
print(titanic.info())
print(titanic.describe)
#to check for the total number of null values in the dataset
print(titanic.isnull().sum())
#Above code displays missing value in each column and it is important to see as missing values may #create
problems in analysis and calculations
#where 0 represents that the passenger is not survived while 1 says that they are survived.
#Now in order to find out the number of the #two, we are going to employ groupby() method
survived_count = titanic.groupby('Survived')['Survived'].count()
print(survived_count)
#Above code Group the data frame by values in Survived column, and count the number of #occurrences of
each group
# Based on the output , we can see that there are 549 people who #were not survived.
survived_Gender = titanic.groupby('Gender')['Survived'].sum()
print(survived_Gender)
pclass_count = titanic.groupby('Pclass')['Pclass'].count()
31
print(pclass_count)
# find out the distribution of Gender (how many male, female & children
Gender_count = titanic.groupby('Gender')['Gender'].count()
print(Gender_count)
# But Age column contains 177 missing values out of 891 data in total.
ages = titanic[titanic['Age'].notnull()]['Age'].values
#It retrieves all non-NaN age values and then store the result to ages Numpy array
print(ages_hist)
ages_hist_labels = ['0–10', '11–20', '21–30', '31–40', '41–50', '51–60', '61–70', '71–80', '81–90']
plt.figure(figsize=(7,7))
plt.title('Age distribution')
plt.bar(ages_hist_labels, ages_hist[0])
plt.xlabel('Age')
plt.ylabel('No of passenger')
plt.show()
print(titanic['Fare'].describe())
# display
32
plt.show()
OUTPUT
33
MGM University
Jawaharlal Nehru Engineering College, Aurangabad
Tutorial No. 4
Aim: Understand and demonstrate the use of various types of probabilities including Addition and
Multiplication rule of Probability, Conditional probability using Python.
Theory:
Introduction
What is probability?
The likelihood of the occurrence of any event can be called Probability.
Application of probability
Some of the applications of probability are predicting the outcome when you:
Flipping a coin.
Choosing a card from the deck.
Throwing a dice.
Pulling a green candy from a bag of red candies.
Winning a lottery 1 in many millions.
Probability theory is widely used in the area of studies such as statistics, finance, gambling artificial
intelligence, machine learning, computer science, game theory, and philosophy.
Event
An event is an outcome of the experiment.
Getting a head or a tail after tossing the coin can be considered an event in our experiment.
Sample Space
It is the set of all the possible outcomes.
In our case of tossing a coin, the number of possible outcomes can be, S= {H,T}
Probability
It is a value that denotes the chances of occurrence of some event.
Let “n” be the total number of possible outcomes, and E be an event. The probability of occurrence of that
event is,
34
Notice that by its definition, the numerator will always be less than or equal to the denominator. So,
P(E) ≤ 1
Example: Find the probability of getting an even number when a die is rolled.
Solution:
Sample space, S={1,2,3,4,5,6}
Let E be an event of getting an even number.
E={2,4,6}
Addition Theorem
P(A ∪B)=P(A)+P(B)
Example: Two dice are tossed once. Find the probability of getting an even number on first dice or a total of
8.
Solution:
Sample space={(1,1),(1,2),(1,3),(1,4),(1,5),(1,6),
(2,1),(2,2),(2,3),(2,4),(2,5),(2,6),
(3,1),(3,2),(3,3),(3,4),(3,5),(3,6),
(4,1),(4,2),(4,3),(4,4),(4,5),(4,6),
(5,1),(5,2),(5,3),(5,4),(5,5),(5,6),
(6,1),(6,2),(6,3),(6,4),(6,5),(6,6)}
An even number can be got on a die in 3 ways because any one of 2, 4, 6, can come. The other die can have
any number. This can happen in 6 ways. Hence, total is 18.
P (a total of 8) = \
∴ Total Probability =
35
Statement 2: If A and B are two events that are not mutually exclusive, then
Example: Two dice are tossed once. Find the probability of getting an even number on first dice or a total of
8.
Solution: P(even number on Ist die or a total of 8) = P (even number on Ist die)+P (total of 8)= P(even number
on Ist die and a total of 8)
Ordered Pairs showing a total of 8 = {(6, 2), (5, 3), (4, 4), (3, 5), (2, 6)} = 5
∴ Probability; P(total of 8) =
∴ Required Probability =
Multiplication Theorem
Statement: If A and B are two independent events, then the probability that both will occur is equal to the
product of their individual probabilities.
P(A∩B)=P(A)xP(B)
Example: A bag contains 5 green and 7 red balls. Two balls are drawn. Find the probability that one is green
and the other is red.
By Multiplication Theorem
Conditional Probability
Conditional probabilities arise naturally in the investigation of experiments where an outcome of a trial may
happened. If the probability of the event changes when we take the first event into consideration, we can safely
Drawing a second ace from a deck given we got the first ace
Finding the probability of having a disease given you were tested positive
Finding the probability of liking Harry Potter given we know the person likes fiction
We can write the conditional probability as , the probability of the occurrence of event A given that B
Suppose you draw two cards from a deck and you win if you get a jack followed by an ace (without
replacement). What is the probability of winning, given we know that you got a jack in the first turn?
P(A) = 4/52
37
Here we are determining the probabilities when we know some conditions instead of calculating random
Suppose you have a jar containing 6 marbles – 3 black and 3 white. What is the probability of getting a black
P (A) = 3/6
P (B) = 2/5
A research group collected the yearly data of road accidents with respect to the conditions of
following and not following the traffic rules of an accident prone area. They are interested in calculating the
probability of accident given that a person followed the traffic rules. The table of the data is given as follows:
Solution:
P(Accident | A person follow Traffic Rule) = P(Accident and follow Traffic Rule) / P(Follow Traffic Rule)
38
Let us sove this problem using a Python Program-
P_Accident_who_follow_Traffic_Rule =50
P_who_follow_Traffic_Rule=50+2000
Conditional_Probability=(P_Accident_who_follow_Traffic_Rule/P_who_follow_Traffic_Rule)
print(Conditional_Probability)
Output
Conditional_Probability
0.02439024
39
MGM University
Jawaharlal Nehru Engineering College, Aurangabad
Tutorial No. 5
Aim: Write a Python program for Bayes Theorem and demonstrate a real life example using a sample dataset.
Theory:
Although it is a powerful tool in the field of probability, Bayes Theorem is also widely used in the field of
machine learning in developing models for classification predictive modeling problems such as the Bayes
Optimal Classifier and Naive Bayes.
The Bayes theorem describes the probability of an event based on the prior knowledge of the conditions that
might be related to the event. If we know the conditional probability P(A|B), we can use the bayes rule to find
out the reverse probabilities P(B|A).
EXAMPLE:
40
Diagnostic Test: Consider a human population that may or may not have cancer (Cancer is True or False) and
a medical test that returns positive or negative for detecting cancer (Test is Positive or Negative), e.g. like a
mammogram for detecting breast cancer.
Problem Statement: If a randomly selected patient has the test and it comes back positive, what is the
probability that the patient has cancer?
Manual Calculation
Medical diagnostic tests are not perfect; they have error. Sometimes a patient will have cancer, but the test will
not detect it. This capability of the test to detect cancer is referred to as the sensitivity, or the true positive rate.
In this case, we will use a sensitivity value for the test. The test is good, but not great, with a true positive rate
or sensitivity of 85%. That is, of all the people who have cancer and are tested, 85% of them will get a positive
result from the test.
P(Cancer=True) = 0.02%
We can correctly calculate the probability of a patient having cancer given a positive test result using Bayes
Theorem. Let’s map our scenario onto the equation:
We don’t know P(Test=Positive), it’s not given directly. Instead, we can estimate it using:
Firstly, we can calculate P(Cancer=False) as the complement of P(Cancer=True), which we already know
P(Cancer=False) = 1 – P(Cancer=True)
= 1 – 0.0002
= 0.9998
41
We can plug in our known values as follows:
We still do not know the probability of a positive test result given no cancer.
This requires additional information. Specifically, we need to know how good the test is at correctly
identifying people that do not have cancer. That is, testing negative result (Test=Negative) when the patient
does not have cancer (Cancer=False), called the true negative rate or the specificity.
Excellent, so the probability of the test returning a positive result, regardless of whether the person has cancer
or not is about 5%.
We now have enough information to calculate Bayes Theorem and estimate the probability of a randomly
selected person having cancer if they get a positive test result.
The calculation suggests that if the patient is informed they have cancer with this test, then there is only 0.33%
chance that they have cancer.
Python Code
# calculate the probability of cancer patient and diagnostic test
Running the example calculates the probability that a patient has cancer given the test returns a positive result,
matching our manual calculation.
P(A|B)=0.339%
43
MGM University
Jawaharlal Nehru Engineering College, Aurangabad
Aim: Write a Python program for Naïve Bayes Theorem and demonstrate a real life example using a sample
dataset.
Theory:
Introduction
You are working on a classification problem and have generated your set of hypothesis, created features and
discussed the importance of variables. Within an hour, stakeholders want to see the first cut of the model.
What will you do? You have hundreds of thousands of data points and quite a few variables in your training
data set. In such a situation, if I were in your place, I would have used ‘Naive Bayes‘, which can be extremely
fast relative to other classification algorithms. It works on Bayes theorem of probability to predict the class of
unknown data sets.
For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even
if these features depend on each other or upon the existence of the other features, all of these properties
independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity,
Naive Bayes is known to outperform even highly sophisticated classification methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at
the equation below:
44
Above,
P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.
Here we have our data, which comprises the day, outlook, humidity, and wind conditions. The final column is
'Play,' i.e., can we play outside, which we have to predict.
First, we will create a frequency table using each attribute of the dataset.
45
For each frequency table, we will generate a likelihood table.
Outlook = Rain
46
Humidity = High
Wind = Weak
Play = ?
So, with the data, we have to predict wheter "we can play on that day or not."
Our model predicts that there is a 55% chance there will be a game tomorrow.
Say you have 1000 fruits which could be either ‘banana’, ‘orange’ or ‘other’. These are the 3 possible classes
of the Y variable.
We have data for the following X variables, all of which are binary (1 or 0).
Long
Sweet
Yellow
The first few rows of the training dataset look like this:
For the sake of computing the probabilities, let’s aggregate the training data to form a counts table like this.
47
So the objective of the classifier is to predict if a given fruit is a ‘Banana’ or ‘Orange’ or ‘Other’ when only the
3 features (long, sweet and yellow) are known.
Let’s say you are given a fruit that is: Long, Sweet and Yellow, can you predict what fruit it is?
This is the same of predicting the Y when only the X variables in testing data are known. Let’s solve it by hand
using Naive Bayes.
The idea is to compute the 3 probabilities, that is the probability of the fruit being a banana, orange or other.
Whichever fruit type gets the highest probability wins.
All the information to calculate these probabilities is present in the above tabulation.
Step 1: Compute the ‘Prior’ probabilities for each of the class of fruits.
That is, the proportion of each fruit class out of all the fruits from the population. You can provide the ‘Priors’
from prior information about the population. Otherwise, it can be computed from the training data.
For this case, let’s compute from the training data. Out of 1000 records in training data, you have 500 Bananas,
300 Oranges and 200 Others. So the respective priors are 0.5, 0.3 and 0.2.
This is nothing but the product of P of Xs for all X. This is an optional step because the denominator is the
same for all the classes and so will not affect the probabilities.
Step 3: Compute the probability of likelihood of evidences that goes in the numerator.
48
It is the product of conditional probabilities of the 3 features. If you refer back to the formula, it says P(X1
|Y=k). Here X1 is ‘Long’ and k is ‘Banana’. That means the probability the fruit is ‘Long’ given that it is a
Banana. In the above table, you have 500 Bananas. Out of that 400 is long. So, P(Long | Banana) = 400/500 =
0.8.
So, the overall probability of Likelihood of evidence for Banana = 0.8 * 0.7 * 0.9 = 0.504
Step 4: Substitute all the 3 equations into the Naive Bayes formula, to get the probability that it is a
banana.
Similarly, you can compute the probabilities for ‘Orange’ and ‘Other fruit’. The denominator is the same for
all 3 cases, so it’s optional to compute.
Clearly, Banana gets the highest probability, so that will be our predicted class.
For sake of demonstration, let’s use the standard iris dataset to predict the Species of flower using 4 different
features: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
# Import packages
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
49
import numpy as np
import pandas as pd
from sklearn import metrics
# Import data
training = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/iris_train.csv')
test = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/iris_test.csv')
# Create the X, Y, Training and Test
xtrain = training.drop('Species', axis=1)
ytrain = training.loc[:, 'Species']
xtest = test.drop('Species', axis=1)
ytest = test.loc[:, 'Species']
# Init the Gaussian Classifier
model = GaussianNB()
# Train the model
model.fit(xtrain, ytrain)
# Predict Output
pred = model.predict(xtest)
print(metrics.classification_report(ytest, pred))
print(metrics.confusion_matrix(ytest, pred))
A Classification report is used to measure the quality of predictions from a classification algorithm. How many
predictions are True and how many are False. More specifically, True Positives, False Positives, True
negatives and False Negatives are used to predict the metrics of a classification report as shown below.
50
The report shows the main classification metrics precision, recall and f1-score on a per-class basis. The metrics
are calculated by using true and false positives, true and false negatives. Positive and negative in this case are
generic names for the predicted classes. There are four ways to check if the predictions are right or wrong:
TP – True Positives
FP – False Positives
FN – False Negatives
Accuracy
Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions
our model got right. Formally, accuracy has the following definition:
For binary classification, accuracy can also be calculated in terms of positives and negatives as follows:
Where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives.
Let's try calculating accuracy for the following model that classified 100 tumors as malignant (the positive
class) or benign (the negative class):
Accuracy comes out to 0.91, or 91% (91 correct predictions out of 100 total examples). That means our tumor
classifier is doing a great job of identifying malignancies, right?
Actually, let's do a closer analysis of positives and negatives to gain more insight into our model's
performance.
Of the 100 tumor examples, 91 are benign (90 TNs and 1 FP) and 9 are malignant (1 TP and 8 FNs).
Of the 91 benign tumors, the model correctly identifies 90 as benign. That's good. However, of the 9 malignant
tumors, the model only correctly identifies 1 as malignant—a terrible outcome, as 8 out of 9 malignancies go
undiagnosed!
While 91% accuracy may seem good at first glance, another tumor-classifier model that always predicts benign
would achieve the exact same accuracy (91/100 correct predictions) on our examples. In other words, our
model is no better than one that has zero predictive ability to distinguish malignant tumors from benign
tumors.
52
Accuracy alone doesn't tell the full story when you're working with a class-imbalanced data set, like this one,
where there is a significant disparity between the number of positive and negative labels.
Confusion Matrix
Coming to confusion matrix, it is much detailed representation of what's going on with your labels. So there
were 71 points in the first class (label 0). Out of these, your model was successful in identifying 54 of those
correctly in label 0, but 17 were marked as label 4. Similarly look at second row. There were 43 points in class
1, but 36 of them were marked correctly. Your classifier predicted 1 in class 3 and 6 in class 4.
Now you can see the pattern this follows. An ideal classifiers with 100% accuracy would produce a pure
diagonal matrix which would have all the points predicted in their correct class.
53
MGM University
Jawaharlal Nehru Engineering College, Aurangabad
Tutorial No. 6
Theory:
Definition:
A probability distribution is a statistical function that describes all the possible values and probabilities for a
random variable within a given range. This range will be bound by the minimum and maximum possible
values, but where the possible value would be plotted on the probability distribution will be determined by a
number of factors. The mean (average), standard deviation, skewness, and kurtosis of the distribution are
among these factors.
Binomial Distribution
Suppose that you won the toss in Cricket match today and this indicates a successful event. You toss again but
you lost this time. If you win a toss today, this does not necessitate that you will win the toss tomorrow. Let’s
assign a random variable, say X, to the number of times you won the toss. What can be the possible value of
X? It can be any number depending on the number of times you tossed a coin.
There are only two possible outcomes. Head denoting success and tail denoting failure. Therefore, probability
of getting a head = 0.5 and the probability of failure can be easily computed as: q = 1- p = 0.5.
A distribution where only two outcomes are possible, such as success or failure, gain or loss, win or lose and
where the probability of success and failure is same for all the trials is called a Binomial Distribution.
The outcomes need not be equally likely. Remember the example of a fight between me and Undertaker? So, if
the probability of success in an experiment is 0.2 then the probability of failure can be easily computed as q =
1 – 0.2 = 0.8.
54
Each trial is independent since the outcome of the previous toss doesn’t determine or affect the outcome of the
current toss. An experiment with only two possible outcomes repeated n number of times is called binomial.
The parameters of a binomial distribution are n and p where n is the total number of trials and p is the
On the basis of the above explanation, the properties of a Binomial Distribution are
Solution:
(a) The repeated tossing of the coin is an example of a Bernoulli trial. According to the problem:
P(x=2) = 15/64=0.2343
Hence,
Therefore,
This can be achieved through programming by calculating the binomial distribution for each values of success.
For 6 tosses of a coin, we may have no of successes (here success=getting a head) from 0 or 1 or 2 or 3 or…..
upto 6.
Consider a random experiment of tossing a biased coin 6 times where the probability of getting a head is 0.6. If
‘getting a head’ is considered as ‘success’ then, the binomial distribution table will contain the probability
r 0 1 2 3 4 5 6
56
Using Python to obtain the distribution:
SciPy:
SciPy is an Open Source Python library, used in mathematics, engineering, scientific and technical computing.
Installation :
The scipy.stats module contains various functions for statistical calculations and tests. The stats() function of
the scipy.stats.binom module can be used to calculate a binomial distribution using the values of n and p.
Syntax : scipy.stats.binom.stats(n, p)
It returns a tuple containing the mean and variance of the distribution in that order.
scipy.stats.binom.pmf() function is used to obtain the probability mass function for a certain value of r, n and
p. We can obtain the distribution by passing all possible values of r(0 to n).
Syntax : scipy.stats.binom.pmf(r, n, p)
Approach :
Define n and p.
Python Program
n=6
print("x\tp(x)")
print(str(x_values[i]),"\t",str(dist[i]))
print("mean = ",str(mean))
print("variance = ",str(var))
OUTPUT:
x p(x)
0 0.015625
1 0.09375000000000003
2 0.23437500000000003
3 0.31249999999999983
4 0.234375
5 0.09375000000000003
6 0.015625
mean = 3.0
variance = 1.5
After getting the distribution, we can use it to solve the question as solved above manually.
58
i) Probability of getting exactly two heads= P(x=2)
From the distribution obtained in the output, we can directly tell the answer.
We can add the values from the program output for x=4 and x=5 to get this answer.
59
MGM University
Jawaharlal Nehru Engineering College, Aurangabad
Tutorial No. 7
Aim: Write a program for various types of correlation. Plot the correlation plot on dataset and visualize giving
an overview of relationships among data on iris data.
Theory:
Correlation is used to test relationships between quantitative variables or categorical variables. In other words,
it’s a measure of how things are related. The study of how variables are correlated is called correlation
analysis.
Some examples of data that have a low correlation (or none at all):
The cost of a car wash and how long it takes to buy a soda inside the station.
Correlations are useful because if you can find out what relationship variables have, you can make predictions
about future behavior. Knowing what the future holds is very important in the social sciences like government
and healthcare. Businesses also use these statistics for budgets and business plans.
A correlation coefficient is a way to put a value to the relationship. Correlation coefficients have a value of
between -1 and 1. A “0” means there is no relationship between the variables at all, while -1 or 1 means that
there is a perfect negative or positive correlation (negative or positive correlation here refers to the type of
graph the relationship will produce).
60
Graphs showing a correlation of -1, 0 and +1
Applications
Essentially, correlation analysis is used for spotting patterns within datasets. A positive correlation result
means that both variables increase in relation to each other, while a negative correlation means that as one
variable decreases, the other increases.
There are usually three different ways of ranking statistical correlation according to Spearman, Kendall, and
Pearson. Each coefficient will represent the end result as ‘r’. Spearman’s Rank and Pearson’s Coefficient are
the two most widely used analytical formulae depending on the types of data
When to Use
The two methods outlined above are to be used according to whether there are parameters associated with the
data gathered. The two terms to watch out for are:
Parametric: (Pearson’s Coefficient) Where the data must be handled in relation to the parameters of
populations or probability distributions. Typically used with quantitative data already set out within said
parameters.
Nonparametric: (Spearman’s Rank) Where no assumptions can be made about the probability distribution.
Typically used with qualitative data, but can be used with quantitative data if Spearman’s Rank proves
inadequate.
In cases when both are applicable, statisticians recommend using the parametric methods such as Pearson’s
Coefficient, because they tend to be more precise. But that doesn’t mean discount the non-parametric methods
if there isn’t enough data or a more specified accurate result is needed.
For example: Up till a certain age, (in most cases) a child’s height will keep increasing as his/her age
increases. Of course, his/her growth depends upon various factors like genes, location, diet, lifestyle, etc.
61
This approach is based on covariance and thus is the best method to measure the relationship between two
variables.
Where:
Step one: Create a Pearson correlation coefficient table. Make a data chart, including both the variables. Label
these variables ‘x’ and ‘y.’ Add three additional columns – (xy), (x^2), and (y^2). Refer to this simple data
chart.
62
Step two: Use basic multiplication to complete the table.
If the result is negative, there is a negative correlation relationship between the two variables. If the result is
positive, there is a positive correlation relationship between the variables. Results can also define the strength
of a linear relationship i.e., strong positive relationship, strong negative relationship, medium positive
relationship, and so on.
Correlation Matrix
A correlation matrix is simply a table showing the correlation coefficients between variables.
Correlation Matrix is basically a covariance matrix, also known as the auto-covariance matrix, dispersion
matrix, variance matrix, or variance-covariance matrix. It is a matrix in which i-j position defines the
correlation between the ith and jth parameter of the given data-set.
When the data points follow a roughly straight-line trend, the variables are said to have an approximately
linear relationship. In some cases, the data points fall close to a straight line, but more often there is quite a bit
of variability of the points around the straight-line trend. A summary measure called the correlation describes
the strength of the linear association. Correlation summarizes the strength and direction of the linear (straight-
line) association between two quantitative variables. Denoted by r, it takes values between -1 and +1. A
positive value for r indicates a positive association, and a negative value for r indicates a negative association.
The closer r is to 1 the closer the data points fall to a straight line, thus, the linear association is stronger. The
closer r is to 0, making the linear association weaker.
Here, the variables are represented in the first row, and in the first column:
63
The table above has used data from the full health data set.
Observations:
We observe that Duration and Calorie_Burnage are closely related, with a correlation coefficient of
0.89. This makes sense as the longer we train, the more calories we burn
We observe that there is almost no linear relationships between Average_Pulse and Calorie_Burnage
(correlation coefficient of 0.02)
import pandas as pd
Corr_Matrix = round(full_health_data.corr(),2)
print(Corr_Matrix)
OUTPUT:
64
EXAMPLE 2: Let’s take another example and use House Price dataset for this.
#Load Libraries
import numpy as np
import pandas as pd
#Loading data
print(data.shape)
print(data['SalePrice'].describe())
#Histogram of Salesprice
data['SalePrice'].plot(kind ="hist")
plt.show()
# Correlation Matrix
corrmat = data.corr()
print(corrmat)
65
66
MGM University
Jawaharlal Nehru Engineering College, Aurangabad
Tutorial No. 8
Theory:
Regression analysis is a form of predictive modeling technique which investigates the relationship between a
dependent (target) and independent variable (s) (independent). This technique is used for forecasting, time
series modeling and finding the causal effect relationship between the variables. For example, relationship
between rash driving and number of road accidents by a driver is best studied through regression.
Regression analysis is an important tool for modeling and analyzing data. Here, we fit a curve / line to the data
points, in such a manner that the differences between the distances of data points from the curve or line is
minimized.
As mentioned above, regression analysis estimates the relationship between two or more variables. Let’s
understand this with an easy example:
Let’s say, you want to estimate growth in sales of a company based on current economic conditions. You have
the recent company data which indicates that the growth in sales is around two and a half times the growth in
the economy. Using this insight, we can predict future sales of the company based on current & past
information.
There are multiple benefits of using regression analysis. They are as follows:
1. It indicates the significant relationships between dependent variable and independent variable.
2. It indicates the strength of impact of multiple independent variables on a dependent variable.
Regression analysis also allows us to compare the effects of variables measured on different scales, such as the
effect of price changes and the number of promotional activities. These benefits help market researchers / data
analysts / data scientists to eliminate and evaluate the best set of variables to be used for building predictive
models.
Linear regression assumes a linear or straight line relationship between the input variables (X) and the single
output variable (y). More specifically, that output (y) can be calculated from a linear combination of the input
variables (X). When there is a single input variable, the method is referred to as a simple linear regression.
In simple linear regression we can use statistics on the training data to estimate the coefficients required by the
model to make predictions on new data.
Imagine you have some points, and want to have a line that best fits them like this:
We can place the line "by eye": try to have the line as close as possible to all points, and a similar number of
points above and below the line.
But for better accuracy let's see how to calculate the line using Least Squares Regression.
The Line
Our aim is to calculate the values m (slope) and b (y-intercept) in the equation of a line :
y = mx + b
Where:
y = how far up
x = how far along
m = Slope or Gradient (how steep the line is)
b = the Y Intercept (where the line crosses the Y axis)
68
Steps
Step 2: Sum all x, y, x2 and xy, which gives us Σx, Σy, Σx2 and Σxy
b = (Σy − m Σx)/ N
y = mx + b
Example
Example: Sam found how many hours of sunshine vs how many ice creams were sold at the shop from
Monday to Friday:
"x" "y"
Hours of Sunshine Ice Creams Sold
2 4
3 5
5 7
7 10
9 15
Let us find the best m (slope) and b (y-intercept) that suits that data
69
y = mx + b
x y x2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
x y x2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
Σx: 26 Σy: 41 Σx2: 168 Σxy: 263
b = (Σy − m Σx )/N
= 41 − 1.5183 x 26 5
= 0.3049...
y = mx + b
y = 1.518x + 0.305
Here are the (x,y) points and the line y = 1.518x + 0.305 on a graph:
Nice fit!
Sam hears the weather forecast which says "we expect 8 hours of sun tomorrow", so he uses the above
equation to estimate that he will sell
Sam makes fresh waffle cone mixture for 14 ice creams just in case. Yum.
It works by making the total of the square of the errors as small as possible (that is why it is called "least
squares"):
71
So, when we square each of those errors and add them all up, the total is as small as possible.
You can imagine (but not accurately) each data point connected to a straight bar by springs:
Application: some of the most popular applications of Linear regression algorithm are in financial portfolio
prediction, salary forecasting, real estate predictions and in traffic in arriving at ETAs.
72
First Program of linear regression using basic statistical formula
import numpy as np
import matplotlib.pyplot as plt
def main():
# observations
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} nb_1 = {}".format(b[0], b[1]))
if __name__ == "__main__":
main()
73
Second Program of linear regression using statsmodels
Dataset
Travel Petrol
20 1
45 3
56 5
34 2
28 1.6
49 3.7
74
import numpy as np;
tb1=pd.read_excel('regr.xlsx');
tb1.plot('Travel','Petrol',style='o');
plt.ylabel('Petrol');
plt.show();
t=tb1['Travel'];
c=tb1['Petrol'];
t1=sm.add_constant(t)
model1=sm.OLS(c,t1)
result1=model1.fit()
print(result1.summary())
coeff=result1.params;
print("Coefficient 1:",coeff[0])
print("Coefficient:",coeff[1])
y=[]
for x in tb1['Travel']:
y.append(coeff[0]+coeff[1]*x);
for i in range(len(y)):
print(y[i],"\t", c[i]);
pred=result1.predict(t1);
print(pred);
75
76
Third Program of linear regression using sklearn
import numpy as np
from sklearn.linear_model import LinearRegression
x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))
y = np.array([5, 20, 14, 32, 22, 38])
model = LinearRegression()
model.fit(x, y)
model = LinearRegression().fit(x, y)
print('intercept:', model.intercept_)
print('slope:', model.coef_)
y_pred = model.predict(x)
print('predicted dependent:', y_pred, sep='\n')
y_pred = model.intercept_ + model.coef_ * x
print('predicted dependent:', y_pred, sep='\n')
77
MGM University
Jawaharlal Nehru Engineering College, Aurangabad
Tutorial No. 9
Theory:
Simple linear regression uses a linear function to predict the value of a target variable y, containing the
function only one independent variable x₁.
y =b ₀+b ₁x ₁
After fitting the linear equation to observed data, we can obtain the values of the parameters b₀ and b₁ that best
fits the data, minimizing the square error.
For example, to predict the weight based on the height of a person, we can use above equation.
Here one independent variable is used to predict the weight of the person Weight = f(Height) , creating a
model. We can create a model that predicts the weight using both height and gender as independent variables.
And this is done woth the help of multiple linear regression.
Multiple linear regression uses a linear function to predict the value of a target variable y, containing the
function n independent variable x=[x₁,x₂,x₃,…,xn].
y =b ₀+b ₁x₁+b₂x₂+b₃x₃+…+bnxn
We obtain the values of the parameters bᵢ, using the same technique as in simple linear regression (least square
error). After fitting the model, we can use the equation to predict the value of the target variable y. In our case,
we use height and gender to predict the weight of a person Weight = f(Height,Gender).
There are two types of variables used in statistics: numerical and categorical variables.
Numerical variables represent values that can be measured and sorted in ascending and descending
order such as the height of a person.
78
Categorical variables are values that can be sorted in groups or categories such as the gender of a
person.
Multiple linear regression accepts not only numerical variables, but also categorical ones. To include a
categorical variable in a regression model, the variable has to be encoded as a binary variable (dummy
variable). In Pandas, we can easily convert a categorical variable into a dummy variable using the
pandas.get_dummies function. This function returns a dummy-coded data where 1 represents the presence of
the categorical variable and 0 the absence.
and
At this point, you should notice that all the terms from the one variable case appear in the two variable case. In
the two variable case, the other X variable also appears in the equation. For example, X 2 appears in the
equation for b1. Note that terms corresponding to the variance of both X variables occur in the slopes. Also
note that a term corresponding to the covariance of X1 and X2 (sum of deviation cross-products) also appears
in the formula for the slope.
This equation is a straight-forward generalization of the case for one independent variable.
79
A Numerical Example
Suppose we want to predict job performance of Chevy mechanics based on mechanical aptitude test scores and
test scores from personality test that measures conscientiousness. (In practice, we would need many more
people, but I wanted to fit this on a PowerPoint slide.)
y X1 X2
Y 29.75 139.5 90.25
X1 0.77 1091.8 515.5
80
X2 0.72 0.68 521.75
The numbers in the table above correspond to the following sums of squares, cross products, and correlations:
y x1 X2
Y
X1
X2
This technique is used when there is more than one predictor variable in a multivariate regression model and
the model is called a multivariate multiple regression. Termed as one of the simplest supervised machine
learning algorithms by researchers, this regression algorithm is used to predict the response variable for a set
of explanatory variables. This regression technique can be implemented efficiently with the help of matrix
operations and in Python, it can be implemented via the “numpy” library which contains definitions and
operations for matrix object.
Application: Industry application of Multivariate Regression algorithm is seen heavily in the retail sector
where customers make a choice on a number of variables such as brand, price and product. The multivariate
analysis helps decision makers to find the best combination of factors to increase footfalls in the store.
Steps:
1. Install all the libraries by using pip command for each module one by one in Windows command prompt as
follows:
2. Save the weight-height data set in an excel file in Python Folder where your program is saved.
3. Open Python IDLE Window -> New File ->Type the following program
Program:
import pandas as pd
import statsmodels.api as sm
import numpy as np
import statsmodels.api as s
tb1=pd.read_excel('weight-height.xlsx')
print(tb1)
#convert the attribute Gender into a dummy binary attribute it creates two columns one is for male and another
is for female
dummy=pd.get_dummies(tb1['Gender'])
step_1=pd.concat([tb1,dummy],axis=1)
83
step_1.drop(['Gender','Male'],inplace=True,axis=1)
print(step_1)
#perform multiple linear regression using OLS(Ordinary Least Square) Regression and print the result
result=sm.OLS(step_1['Weught'],s.add_constant(step_1[['Height','Female']])).fit()
print(result.summary())
Output:
After fitting the linear equation, we obtain the following multiple linear regression model using which
we can predict vaues of weight using any combination of gender and height:
Weight = -153.3956+4.8725*Height-24.0633*Gender
Here Gender takes two values- 1 for female and 0 for male.
84
MGM University
Jawaharlal Nehru Engineering College, Aurangabad
Tutorial No. 10
Theory:
In regression applications, the dependent variable i.e. the result variable may only assume discrete values. For
instance, a bank might like to develop an estimated regression equation for predicting whether a person will be
approved for a credit card or not.
The dependent variable can be coded as y=1 if the bank request for a credit card and y=0 if the bank rejects for
the request.
Using logistic regression, we can estimate the probability that the bank will approve the request for a credit
card given a particular set of values for the chosen independent values.
If the two values of the dependent variable y are coded as 0 or 1, the value of E(y) in equation given below
provides the probability that y=1 given a particular set of values for the indepenedent variables x 1, x2, x3,
x4,…..,xp
Because of the interprestation of E(y) as a probability, the logistic regression equation is written as follows
Here y hat provides am estimate of the probability that y=1 given a particular set of independent variables.
Example:
Let us consider an application of logistic regression involving a direct mail promotion being used by Simmons
Stores. Simmons owns and operates a national chain of women’s apparel stores. Five thousand copies of an
85
expensive four-color sales catalog have been printed, and each catalog includes a coupon that provides a $50
discount on purchases of $200 or more. The catalogs are expensive and Simmons would like to send them to
only those customers who have the highest probability of using the coupon.
Management thinks that annual spending at Simmons Stores and whether a customer has a Simmons credit
card are two variables that might be helpful in predicting whether a customer who receives the catalog will use
the coupon.
Simmons conducted a pilot study using a random sample of 50 Simmons credit card customers and 50 other
customers who do not have a Simmons credit card. Simmons sent the catalog to each of the 100 customers
selected. At the end of a test period, Simmons noted whether the customer used the coupon or not?
The amount each customer spent last year at Simmons is shown in thousands of dollars and the credit card
information has been coded as 1 if the customer has a Simmons credit card and 0 if not.
In the Coupon column, a 1 is recorded if the sampled customer used the coupon and 0 if not.
86
Now we can compute y hat for any set of independent variables.
Example:
P(y=1/x1=2, x2=0)=.1880
Steps:
1. Install all the libraries (if not installed) by using pip command for each module one by one in Windows
command prompt as follows:
2. Save the above given tabular data in an excel file in Python Folder & save as Credit.xlsx where your
program is saved.
3. Open Python IDLE Window -> New File ->Type the following program
Program:
import pandas as pd
import statsmodels.api as sm
tb1=pd.read_excel('Credit.xlsx')
print(tb1)
x=tb1[['Card','Spending']]
y=tb1['Coupon']
x1=sm.add_constant(x)
logit_model=sm.Logit(y,x1)
result=logit_model.fit()
87
print(result.summary2())
Output:
88