Introduction To Machine Learning

DADS303: Introduction to Machine Learning Manipal University Jaipur (MUJ)
MASTER OF BUSINESS ADMINISTRATION

SEMESTER 3
DADS303
INTRODUCTION TO MACHINE LEARNING
Unit 1 : Introduction to Machine Learning 1

Unit 1
Introduction to Machine Learning
Table of Contents
SL Topic Fig No / Table SAQ / Page No

No / Graph Activity
1 Introduction - -
3-4
1.1 Learning Objectives - -
2 Definitions and Examples - 1

5-6
3 How Machine Learning Works and its 1, 2 ,3 2

7 - 11
relevance to businesses
4 Type(s) of Machine Learning Model - 3 12 - 14
5 Terminal Questions - - 15
6 Answers - - 15 - 17

1. INTRODUCTION
AI (i.e., Artificial Intelligence) has beaten the excitement of recent technologies like Quantum
computing, Cybersecurity, the Internet of Things, and blockchain. AI is becoming popular
daily because of the availability of resources to the common people. Also, the developers are
using AI to build various Machine Learning models. The demand for Data Science
professionals is increasing due to the vast application of machine learning concepts to
various industry domains.
ML, i.e., Machine Learning, is considered the sub-area of AI, i.e., Artificial Intelligence.
Machine learning aims to know about the structure of collected data and represent that data
through various models which can be easily analyzed and applied by people to get required
outcomes.
Although this field differs from traditional computing due to its advanced features, one can
train the data to receive the output within a given range using machine learning algorithms.
Compared to traditional machine learning algorithms (designed for problem-solving), these
algorithms would use various statistical techniques while analyzing. In addition to the above,
machine learning algorithms are also used to build models, which can be used for better
decision-making.
Some common applications of machine learning include facial recognition technology, which
is used in many devices as a biometric measure and allows various social media platforms to
tag and share photos or pictures, and optical character recognition, i.e., OCR, which translates
typed images into machine-encoded text. The self-driving car is yet another recent example
of machine learning.
Also, Machine learning is an uninterrupted growing area. Due to this, there are a few
considerations that we need to keep in our mind while we are working with machine
learning techniques or doing the analysis of the influence of machine learning developments.
In this unit, we will discuss various merits and demerits of machine learning. In addition, we
also discuss the role of machine learning in various businesses. We also explore the types of
machine learning techniques.

1.1 Learning Objectives
After studying this unit, you should be able to:
❖ Describe Machine Learning and various examples

❖ Discuss the relevance of machine learning in various businesses
❖ Explain the various types of Machine Learning techniques
So, let us start with an Introduction to Machine Learning and its usage.

2. DEFINITION AND EXAMPLES
In 1959, Arthur Samuel Introduced the term Machine Learning. He says, “It is the study of
algorithms which allows the computer machines to learn without explicit programming.”
One more definition Now, the second, more recent, comes from Professor Tom Mitchell from
Carnegie Mellon University. The professor talks about an experience in a computer from
performing a task with an associated performance measurement. The machine can learn
with its performance in a task and improve its performance.
Consider a few examples of machine learning being used in one’s daily life. So, if you have
noticed, Google can mark certain emails as spam. Spam filtering is one of the many cases
where machine learning has been extremely helpful in achieving really good results. And not
just Gmail, any popular email service provider you use nowadays can mark emails as spam
or not with a reasonable degree of accuracy. And if you have wondered how popular e-
commerce websites can come up with surprisingly good recommendations for products you
wish to buy. Again, this is done by some very clever application of machine learning
algorithms in the backend. And Machine learning algorithms can now classify handwritten
digits with nearly 100% accuracy. Additionally, if you want to know about ABS on your cars,
autopilot on Tesla cars, or self-driving cars from Google, the list is very big when it comes to
the application of machine learning. Not to get into the details of every application, machine
learning has made its presence in one way or another in one’s daily life.
This field is not a new one by any means. Over the last few decades, machine learning has
combined ideas and methodologies from various fields like probability, statistics, computer
sciences, computational biology, and much more. Probability, Statistics, and computer
science are the backbone of modern machine learning theory. It also borrows ideas from
Biology, Genetics, Clinical trials, and many other fields of social sciences.

Self-Assessment Questions - 1
1. Who coined the term Machine Learning?

a) Arthur Samuel
b) Arthur Mike
c) Samuel Arthur
d) None
2. Which of the following belongs to the working of Machine Learning?
a) Feed the Data.
b) Tag the Data.
c) Test the Model
d) All the Above
3. Which of the following doesn’t belong to the need of Machine Learning?
a) Machine Learning is used for Analysis.
b) We can use it for representing data by using visualization concepts.
c) No Data sets are required.
d) All of the above

3. HOW MACHINE LEARNING WORKS AND ITS RELEVANCE TO

BUSINESSES
The advent of modern computing power has helped in the spurting machine learning.
Anyone with a credit card can spawn up a cluster from one of the many web services
available, ranging from basic to advanced specifications. This enterprise’s scale hardware is
available to our consumers in a fraction of a minute. Additionally, good results on problems
considered difficult to solve historically have given machine learning much popularity in the
press. And most importantly, everybody sounds to be talking about it these days.
So, where and how does one learn machine learning? The starting point one always advises
is a question. A question with a specific business context in mind. Why is image recognition
important? It’s not because it’s cool to do. It is because image recognition helps solve a
business problem. Why do you think predicting whether a customer will default on credit
card payment is useful or not from a business perspective?
No. That’s the crux of starting with machine learning. Once you have a question, the relevant
data needs to be gathered. Historical data helps the machine learn from it. Once the machine
learns from its historical data, it needs additional data to determine how well it has learned.
Think about a semester-long course in any subject, say probability, or for that matter, the
course that we are currently learning. There is classroom material that helps the students
learn the concepts, along with some real-life examples.
Once classroom lectures are done, students get evaluated based on an examination at the
end of the course. What is important is that the questions in the final exam can come from
something other than the material that you have studied. That’s the same scenario here. So,
what does data look like? Raw Data might come in all sorts of different formats.

Figure.1: Sample Email (Not Spam)
Here is an example from fig.1, which represents an email that is probably not spam. This is
data for one email. As you can see, this is a bunch of text along with sender and receiver
information.
Reading through the text, this is probably not spam. This raw data contains useful
information about whether the email is spam. Our machine learning algorithm must do a
good job of extracting the key information from this unstructured raw data.
Figure.2: Sample Email (Spam)
Here is an illustration from fig.2 of an email that unmistakably screams spam. How come this
is spam? There is some helpful information concerning the email's validity in the sender and
email body. The dilemma that arises now is how to teach a machine to extract knowledge
from unstructured data that will aid problem-solving and learning.
Therefore, feature engineering is here as a saviour. Feature engineering is the process of

turning raw data into meaningful, machine-readable information features. You must create

features from raw data. Remember that for the exploratory study, we calculated the average
monthly bill for each customer. That is a type of feature engineering as well. For feature
engineering, we are not restricted to averaging. The only restriction on the number of
features is your ingenuity and imagination. The way you execute feature engineering on a
dataset greatly affects the outcomes. Hence, moving from what appears to be unstructured
data to something more organized, neat, and structured is where feature engineering starts.
Consider, for example, in fig.3, the raw text of emails that can be represented in a structured
form.
Figure.3: Raw Email texts in tabular form
Each row of the data contains details about an email, and each column has details about a
feature. The data is organized in a rectangular style. Here, we use three emails to
demonstrate some basic feature engineering. Therefore, the first characteristic we may build
is whether the sender is on the recipient's contact list. A Boolean variable, this feature is also
designated as x1. Whether the sender is included in the contact list can be either true or false.
The second function keeps track of the email's special character count and is denoted by x2.
The column Spam, which is of the type Boolean and indicates whether the email is spam, is
the last one on the list. The task is to predict this variable using a machine-learning method.
This variable is hence known as a response. The phrases outcome and dependent variable
are also frequently used.
Other names for features include independent variables, explanatory variables, and the like.
Think of any interesting characteristics one could create from the email's raw dataset as an
exercise, and then see what might be useful for feature engineering.

Applications and uses of machine learning are merely limitless, especially when we highly
promote using smartphones in our day-to-day activities. At the present time the smart
phones are highly tied to various machine-learning processes. Therefore, using machine
learning (through various mobile-based applications) in doing business is also becoming a
renowned practice nowadays.
Businesses using Machine Learning are gaining various benefits like improvement in their
working processes, atomization of various day-to-day activities, etc. For example, Uber
Technologies Inc. uses various machine learning algorithms for optimizing the pick and drop
time for their customers, Spotify (an audio streaming and media service) uses machine
learning to offer the concept of personalized marketing, Dell company delivers laptops, etc,
uses various algorithms in getting employees and customers feedback and improve their
practices based upon that feedback.
In addition to the above, there are some more business areas where machine learning can
assist our business in numerous ways. While using machine learning for our business
benefits, we need to define a strategy and follow the steps as per the policy. So, in
continuation to above, below we are going to define some areas where machine learning can
be used with some glowing ideas.
• Monitoring of Social Media Contents

• Various Customer Care Services
• Area of Image Processing
• Virtually Assistance in some specific areas
• Recommendations for Quality Product
• Trading and Investment in Stock Market
• Clinical Decisions Support Systems
• Reducing Data Repetition
• Increasing Cyber Security

4. In which can we use Machine Learning Model(s)?
a) Social Media
b) Quality Check
c) Virtual Assistance
d) All the Above
5. Machine Learning model can be used to reduce medical expenses?
a) True
b) False
6. Which of the following are the key terms used in Machine Learning?
a) Trend
b) Patterns
c) Data set
d) All the above

4. TYPE(S) OF MACHINE LEARNING MODEL
Tasks in machine learning are classified into various categories. This classification is based
on how one can learn by using the existing system or making predictions based on the
feedback data set one will use.
Thus, the following is a widely used classification of machine learning which are in use
globally, the list is as follows:
• Supervised Learning
• Unsupervised Learning &
• Reinforcement Learning
Supervised Learning
A response column in the data serves as a teacher for the algorithm in supervised learning.
It receives some instances to learn and then applies what it has learned to fresh, unexplored
data. Our example of classifying spam falls within a supervised learning challenge category.
The computer can determine if a group of emails is spam or not. It can attempt to learn how
to categorize a new email as spam or not spam based on that information and some inventive
feature engineering. The credit default issue is another illustration of a supervised learning
issue. Why? The system examines 29000 customers to determine whether each has missed
a payment. It can take what it has learned from this and attempt to apply it to new clients to
forecast whether they will default. Currently, there are two categories of challenges in
supervised learning. Depending on the sort of reaction, these groups are differentiated.
The response is categorical between 2 possible values, like the email spam example or the
credit default. Classification is the name given to such a supervised problem. However, if the
response is continuous or has a genuine value, one would normally need to answer a
regression problem if one wanted to estimate a student's final exam score based on
performance.

Unsupervised Learning
When a machine learns without an instructor, it is said to be performing unsupervised

learning. In other words, there is no clear response column in the data. The computer can
only analyze its features to hunt for a helpful pattern. The best use of unsupervised learning
is to identify groupings of data based on features. One can imagine a scenario requiring
classifying a group of clients in the credit card data based on the data provided in the bill
amounts and payments, which goes with the classification in an uninstructed mode.
Following are the use cases that use unsupervised learning techniques:
• Identifying different client groups around which to develop marketing or other

company tactics is known as customer segmentation.
• Genetics, for instance, uses DNA pattern clustering to study evolutionary biology.
• Recommender systems combine individuals with comparable viewing habits to suggest
related material.
• Fraud detection, for example, or the detection of faulty mechanical components (i.e.,
predictive maintenance).
Reinforcement Learning
This has behavioural psychology at its core. Each data point is presented to the computer
successively. It is rewarded for getting it right and penalized for getting it wrong. It
frequently makes many blunders when first starting, but it progressively learns. Imagine a
baby attempting to handle a steaming cup of coffee. It releases smoke and cannot tell
whether it is hot or cold. The poor thing is shocked when it touches for the first time and
probably sobs a lot. Touches it once more out of curiosity after waiting a little while longer.
receives another shock. Over time, it realizes that it could be a better idea to touch something
with smoke coming out of it. Reinforcement learning is a relatively new approach to various
machine learning issues. Given that they are typically defined as a series of judgments, logic
games are excellent candidates for reinforcement learning. Reinforcement learning has been
used in games like chess and poker rather successfully.

7. Which of the following belongs to the classification of Machine Learning?

a) Supervised Learning
b) Un-Supervised Learning
c) Both a) & b)
d) None of the Above
8. When machine learning model is strictly based upon a feedback system?
a) Reinforcement Learning
b) Supervised Learning
c) Both a) & b)
d) None of the Above

5. TERMINAL QUESTIONS
Short Answer Questions
Q. 1 Define Machine Learning?
Q.2 Elaborate on the working of Machine Learning?
Q.3 Explain the need for Machine Learning in Business.
Long Answer Questions
Q 1: Explain the detailed classification of Machine Learning Model(s).
Q 2: Illustrate the areas where one can use Machine Learning by using suitable example?
6. ANSWERS
Self-Assessment
1. A
2. D
3. C
4. C
5. A
6. D
7. A
8. D
9. D
10. D
Terminal Questions
Short Answer
Ans 1: Machine Learning is a computer-based program used to learn from our experiences.
Or it is a fancy labelling machine, i.e., here we teach the machines to label the things like
Maruti and Hyundai by showing them examples of cars. Finally, the machines will start these

labelling cars without any intervention- because they are already trained using appropriate
data sets.
Ans 2: To know about the process of how machine learning works, first, we need to specify
where we are going to use Machine Learning, i.e., we are going to use image recognition or
for doing text analysis. But a common thing which we need to do is we have to define “tag”.
For example, whenever we “tag” pictures of animals i.e., cow, dog, cat etc we always try to
mention them with their appropriate names; this process is also known as data labeling (i.e.,
an important step in Machine Learning). But when doing text analysis from machine
learning, we need to fill the model (machine learning) by using some text data and then tag
it.
So, in general, we can define the basic working process of the machine learning model as per
the below-mentioned steps:
1. Feed the data in Machine Learning Model (i.e., input data).

2. Tag the data as per the desired output. (i.e., while entering the feedback data, we
classify it as worst, neutral, average, and best).
3. Now, test our model and try to make the predictions.
[Refer Sec 1.2]
Ans 3: Thus, after studying various facts now, we can define the need for machine learning
as follows.
• Machine learning makes use of data sets for doing the analysis.
• It uses historical data sets for predictions or forecasting, which helps us do the analysis.
• Sometimes, we can also correlate this term with data mining.
This is a technology highly dependent upon data or data sets.
Long Answer(s)
Ans 1: Thus, the following is the widely used classification of machine learning which we are
going to use globally as follows:

• Supervised Learning
• Unsupervised Learning &
• Reinforcement Learning
[Refer to sec 1.4]
Ans 2: some areas where machine learning can be used with some glowing ideas.
• Monitoring of Social Media Contents

• Various Customer Care Services
• Area of Image Processing
• Virtually Assistance in some specific areas
• Recommendations of Quality Product
• Trading and Investment in Stock Market
• Clinical Decisions Support Systems
• Reducing Data Repetition
• Increasing Cyber Security
[Refer to sec 1.3]

Unit 2
Linear Regression- I
Table of Contents

No / Graph Activity
1 Introduction - -
3
2 Simple Linear Regression 1, 2 , 3, 4, 5 , 6, -

7, 8, 9, 10
4 - 15
2.1 Ordinary Least Square (OLS) -
-
regression technique
3 Regression Assumptions - 1 16 - 19
5 Answers - - 20
Unit 2 : Linear Regression- I 2

1. INTRODUCTION
Regression analysis is a method for solving analytical problems that are widely employed. A
regression-based strategy often works well for many business challenges, particularly when
attempting to forecast or comprehend future events using data on consumer behavior that
is now available.
To comprehend and quantify Cause-Effect relationships, a regression model is used. To

understand the cause-and-effect connection, consider that there is a brand of shampoo and
its sales of it are getting examined, and consider that in a week, there is an offering of a 15%
reduction on the cost of that shampoo. The anticipation of the brand sales during that week
would increase sales. The product may be purchased by more people now that the price has
dropped, increasing sales. Price reduction is the root of the problem, and thus, there is a rise
in sales. Regression analysis is used to comprehend and measure cause-and-effect
relationships.
Regression analysis is a statistical method used to determine the size and direction of a
potential causal relationship between observed patterns and variables that are thought to
affect the pattern. Reiterating the sales example, the observed pattern is the change in sales.
The variable assumed to have an impact on the observed pattern is price. If the price will
impact the sales, therefore, when there is a price change, there is an impact on sales as well.
After studying this unit, you should be able to:
❖ Explain the concept of simple linear regression and its underlying principle
❖ Discuss the working of a simple linear regression model
❖ Describe the concept of OLS with a simple demonstration

2. SIMPLE LINEAR REGRESSION
Simple linear regression is a type of regression analysis where the number of independent
variables is one, and there is a linear relationship between the independent(x) and
dependent(y) variables. Linear regression is a linear model that assumes a linear
relationship between the input variables (X) and the single output variable (Y). Remember
that the variable we are attempting to comprehend is the target variable. In this instance, it
is the newborn’s birth weight. According to mathematics, a baby's birth weight is a sum
function of the gestational weeks.
Birthweight = f (gestation weeks) (1)
where f - is the functional form that needs to be determined
The case that needs to be examined is the relationship between baby birth weight and
gestation period.
Simple linear regression aligns with the concepts from mathematics, where a relationship
between two variables ‘x’ and ‘y’ is represented in a linear (straight-line) form; the equation
is as given below,
𝑌 = 𝑚𝑥 + 𝑐 (2)
where,
‘m’ represents the slope
‘c’ represents the intercept on the straight line
The slope is intercepted as the rate of change of Y when X changes or the magnitude of the
impact of changes in X on Y. The slope is the rate at which there are changes in Y because of
changes in X. The concept of intercept is the value of Y when X = 0.

Figure.1: Straight-Line Relationship
Fig. 1 represents a straight-line relationship with the equation as Y= 2+ 3x, where 2 is the
intercept, and 3 is the slope. Here, when ‘x’ changes by one unit, ‘Y’ changes by three units,
and if X = 1; Y = 5, if X = 2; Y = 8, this represents that the rate of change is ‘Y’ is always constant
that shows a linear relationship between ‘x’ and ‘Y.’ A linear relationship is when we believe
there is a straight-line relationship between X and Y. every time X changes by one unit, Y
changes by the same amount, which is the slope. If slope ‘x’ is 0, we have a line parallel to the
x-axis as Y=2, as shown in fig.2.
Figure.2: Straight-line representation when slope=0
If the intercept is 0, a straight line passes through the 0,0 point, as shown below in fig.3.

Figure.3: Straight-line representation when intercept=0
In simple linear regression, the assumption is that there exists a straight-line relationship
between ‘X’ and ‘Y,’ where ‘Y’ is the target variable. As per equation.1., the target variable is
the birthweight, ‘X’ is an independent variable, and considering equation.1., it is the number
of weeks of gestation. The linear regression equation is displayed below,
Y= β_0+β_1 X+ⅇ (3)
where,
β0=Intercept
β1=Slope
ⅇ=Error
Equation. 3 represents the error which is a random variation. The assumption is that the
average of random variation is zero, considering these the linear regression model is
essentially represented below, which represents the slope of the line Y=mx +c here m is β1
and c is β0, knowing the values of β_1and β_0 it is easy to understand the relationship
between the variables ‘X’ and ‘Y.’
Y= β0+β1 X (4)

The β0 and β1 are known as beta coefficients that help understand the rate at which changes
in variable ‘X’ influence the dependant or predicted variable ‘Y.’
2.1 Ordinary Least Square (OLS) Regression Technique
Considering the equation.3., the birth weight example can be represented as,
𝐵𝑖𝑟𝑡ℎ𝑤𝑒𝑖𝑔ℎ𝑡 = 𝛽0 + 𝛽1 ∗ 𝐺𝑒𝑠𝑡𝑎𝑡𝑖𝑜𝑛 𝑝𝑒𝑟𝑖𝑜𝑑 + 𝑒 (5)
The ordinary least square regression technique is used to know the right values for β0 and
β1. The Ordinary Least Square regression, also called OLS, estimates coefficients on the
variables by identifying the line that minimizes the sum of squared differences between
points on the estimated line and the actual values of the independent variable.
Figure.4: Birthweight vs. Gestation
As per fig.4, a relationship exists between birthweight and gestation, a straight line is to be
drawn that captures maximum data points, but that requires many straight lines to be drawn
as in fig.5. because there are random variations with many other factors influencing
birthweight, to choose this best possible line that captures maximum data points, an

ordinary least square regression technique identifies the best possible line, the slope, and
the intercept on the best possible line as beta coefficients.
Figure.5: Multiple straight lines between Birthweight vs. Gestation
The best possible line is the line that is as close as possible to many points as possible.
1.2.1.1 OLS Estimates
Figure.6: Distance calculation from predicted to actual values

Consider fig.6., A straight line connects the data points, and the distance between each point
and the line is then calculated. Now, some of the points will be traversed by this line. There
will be certain points that are exactly on the line. The distance will therefore be zero. But
only a few data points will be covered by the drawn line. A few points exist above and below
the drawn line; the point above the line yields a positive and negative distance below the
line. To negate the signs, one needs to square the differences of each point from the drawn
line and then sum them up. The line with the minimum total sum of square distances is the
best possible line because that is the line that is as close as possible to as many points as
possible.
There exists a mathematical way of calculating the beta coefficients as,
𝑄 = ∑𝑁
𝑖=1(𝑌𝑖 − 𝛽0 − 𝛽1 𝑋𝑖 )
2
(6)
where,
𝛽0 = (∑𝑋𝑖̇2 𝛴𝑌𝑖 − 𝛴𝑋𝑖 𝛴𝑋𝑖 𝑌𝑖 )/ 𝑛𝛴𝑋𝑖2 − (∑𝑋𝑖 )2
𝛽1 = (𝑛𝛴𝑋𝑖 𝑌𝑖 − 𝛴𝑋𝑖 𝛴𝑌𝑖 )/ 𝑛𝛴𝑋𝑖2 − (∑𝑋𝑖 )2
Once the coefficients are estimated, we arrive at the below equation that covers the
maximum data points.
𝐵𝑖𝑟𝑡ℎ𝑤𝑒𝑖𝑔ℎ𝑡 = 𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 + 𝐵𝑒𝑡𝑎 𝐶𝑜𝑒𝑓𝑓 ∗ 𝐺𝑒𝑠𝑡𝑎𝑡𝑖𝑜𝑛 𝑝𝑒𝑟𝑖𝑜𝑑 (7)
1.2.1.2 Implementation Of OLS Using Excel
Excel is bundled with a simple yet powerful data analysis package; one such technique is the
regression technique. The snippets provided below illustrate the analysis of birthweight data
using Excel’s data analysis package.
Following steps are used to perform the regression analysis on any dataset:
• Open the .csv dataset

• If Data Analysis package is not installed, install the data analysis package from the Add-
ins option of Excel Options

• Once installed the package will be available in the Data Menu with the name “Data
Analysis”
• Choose Data Menu and click on the Data Analysis option then select the Regression
technique
• Specify the Input Y range – This the dependent variable, birthweight
• Specify the Input X range- This is the independent variable, gestation period
• Click on Run
Sample Stepwise Snippets:
Figure.7: Birthweight Dataset

Figure.8: Regression technique selection
Figure.9: Specify the X and Y range

Figure.10: Standard Regression Output
Let us understand the coefficient table which has the values of 𝛽0and 𝛽1
The birthweight equation takes the form below,
Birthweight = -3245.44 + 166 * Gestate (8)
The above equation helps in estimating a straight line with the equation
𝑌 = 𝑚𝑥 + 𝑐 or 𝑌 = 𝛽0 + 𝛽1 𝑋
where,
𝛽0 = 𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 = -3245
𝛽1 = 𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑛 𝑡ℎ𝑒 𝑋 𝑣𝑎𝑟𝑎𝑖𝑏𝑙𝑒 = 166

Equation 8 says that if one increases gestate by one unit, then the expectation of birth weight
on average would go up by 166 grams. For every unit increase in a gestation period, we
expect an average increase in birthweight by 166 grams. The positive sign on the coefficient
of gestation implies a positive relationship between gestation and birthweight. If the
gestation period goes up, there is an expectation for birth weight to go up as well. Suppose
there is a negative sign on a coefficient. Then the expectation is the inverse relationship. If X
goes up, Y will come down. If the gestation period is zero, then there is no birthweight. This
intercept value is required to correctly capture the relationship between the birthweight and
the gestation period.
In this case, when the gestation period increases by one week, birthweight increases by 166
grams. While it is not true the other way around, if the birthweight is increased by 166
grams, then gestation will increase by one week because of the cause-and-effect relation. An
increase in the gestation period will increase birth weight, but it is not true the other way
around. An increase in birthweight does not mean that it will lead to an increase in gestation.
Because Y, is a function of X, not X being a function of Y.
Once the beta coefficients are examined for the relationship between the variables ‘X’ and
‘Y’, the next point to examine are the p- values which helps in determining whether the
variables ‘X’ and ‘Y’ are statistically significant which in our case one needs to look whether
the gestate is a statistically significant influencer of birthweight. The p- value is the outcome
of hypothesis test that helps in testing whether or not the coefficient on the gestate variable
is equal to zero. There is a p-value associated with each coefficient. To consider the p-value
associated with gestate is 2.54E-166 which essentially a value very close to zero which helps
in rejecting the null hypothesis. This helps in concluding that the gestate is a statistical
influencer of birthweight. Ideally the p- value must be less than 0.05 (Confidence level 95%=
0.05). If the p-value is greater than 0.05 then it concludes that the relation may not be
statistically significant.
Analysis of Variance is referred to as ANOVA. It provides details regarding the degrees of

variability present in the regression model.

• The degree of freedom (df) connected to the sources of variance.

• SS stands for sum of square. The smaller the Residual SS in comparison to the Total
SS, the better your model fits the data.
• The mean square, or MS.
• F is the null hypothesis's F statistic or F-test. It can be used to test the overall model
significance quite successfully.
• The P-value of F is significance F.
One may determine how well the generated linear regression equation matches your data
source by looking at the summary output.
The correlation coefficient known as the multiple R gauges how strongly two variables are
related linearly. The higher the absolute value the relationship is stronger.
• 1 indicates a highly positive association.

• -1 denotes a very negative association.
• Zero indicates no relationship at all.
The Coefficient of Determination, or R Square, denotes the goodness of fit. It indicates as to

how many points fall on the regression line. Our example's R square value of 0.49 indicates
an average fit. In other words, the independent variables account for 49% of the dependent
variables (y-values) (x-values). which implies that 49% of the variation in birthweight is

captured or explained by variation in the gestation weeks variable. R Square is not the only
indicator of model fit. It is possible to have the same R Square but different models with a
different fits. R Square also increases with the addition of variables; whether relevant or not,
it is better to use the adjusted R Square measure.
The R square that accounts for factors that are not important to the regression model is
called adjusted R square.
Another metric of goodness-of-fit that demonstrates the accuracy of one’s regression

analysis is the standard error.

3. REGRESSION ASSUMPTIONS
Knowing the implementation process of ordinary least square regression models, it is

essential to understand the assumptions under which the ordinary least square beta
coefficient estimates are valid. A series of assumptions is to consider, including a few that
are necessary.
• Linear relationship
• Multivariate normality
• No or little multicollinearity
• No autocorrelation
• Homoscedasticity
Linear relationship
The model is linear in parameters. The beta coefficients are the fundamental parameters in
a linear regression model since they have a linear nature. There is a linear relationship that
exists between X and the mean of Y. Due to the sensitivity of linear regression to outlier
effects, it is also crucial to look for outliers. Scatter plots are the most effective way to test
the linearity assumption.
Multivariate normality
Since the errors are not connected and are statistically independent of one another, the data
utilised for a linear regression model is effectively a random sample from the underlying
population. For the linear regression analysis, all variables must have multivariate normal
distributions. The best way to verify this assumption is via a histogram or Q-Q-Plot. With a
goodness of fit test, such as the Kolmogorov-Smirnov test, normality can be verified. A non-
linear transformation (such as a log transformation) may be able to resolve this problem
when the data is not normally distributed.

No or little multicollinearity
According to the assumption of linear regression, the data must have little to no
multicollinearity. When the independent variables have an excessive amount of correlation
with one another, multicollinearity occurs.
Three main criteria can be used to test for multicollinearity:
1) Pearson's bivariate correlation matrix: All independent variables must have correlation
coefficients that are less than 1 to compute the matrix.
2) Tolerance: Using an initial linear regression analysis, the tolerance quantifies the
impact of one independent variable on all other independent variables. For these first-
step regression analyses, tolerance is denoted by the formula T = 1 – R2. There may be
multicollinearity in the data with T < 0.1, and there is for sure with T < 0.01.
3) Variance Inflation Factor (VIF): VIF = 1/T is the linear regression's variance inflation
factor formula. There is a possibility of multicollinearity when the VIF is greater than 5,
and there is undeniable multicollinearity when the VIF is greater than 10.
4) Condition Index: The condition index is calculated using factor analysis on the
independent variables. The linear regression variables' multicollinearity is indicated by
values between 10 and 30, whereas values over 30 indicate significant
multicollinearity.
If the data are multicollinear, centering the data—that is, subtracting the mean score—
might be able to aid with the issue. Conducting a factor analysis and rotating the
components to ensure independence of the factors in the linear regression analysis are
other solutions to the issues.
If there is multicollinearity in the data, the issue might be resolved by centering the data,
which involves subtracting the variable's mean from each score. However, eliminating
independent variables with high VIF values is the most straightforward solution to the issue.
No autocorrelation
Little or no autocorrelation in the data is necessary for linear regression analysis. When the
residuals are not independent of one another, autocorrelation happens. For instance, when

stock prices fluctuate, this usually happens because the current price is based on the past
price.
The Durbin-Watson test can be used to test the linear regression model for autocorrelation
while a scatterplot can be used to look for autocorrelations. The null hypothesis that the
residuals are not linearly auto-correlated is tested using Durbin-Watson’s d tests, d can be
assumed to have values between 0 and 4, however values close to 2 suggest no
autocorrelation. Values of 1.5 < d < 2.5 generally indicate that there is no autocorrelation in
the data. The Durbin-Watson test, which examines first order effects, only examines linear
autocorrelation and only between immediate neighbours.
Homoscedasticity
The foundation of linear regression models is the homoscedasticity assumption, which

literally means "same variance". When the error term—the "noise" or random disturbance
in the relationship between the independent variables and the dependent variable—is the
same at all independent variable values, the condition is referred to as homoscedastic. The
scatter plot is good way to check whether the data are homoscedastic.

1. A regression model is a method for solving analytical problems and helps to

quantify _________ relationships.
2. 𝑌 = 𝛽0 + 𝛽1𝑋 + 𝑒, in the given equation which term influences the random
variation.
3. Identify the technique which helps in finding out the suitable beta coefficients.
4. Which values help determine whether the variables 'X' and 'Y' are statistically
significant?
5. _____________ provides details regarding the degrees of variability present in the
regression model.
6. ______________ gauges how strongly two variables are related linearly.
7. Which indicates how many points fall on the regression line and denotes the
goodness of fit?
8. ____________ checks whether all independent variables must have correlation
coefficients that are less than 1 to compute the matrix.
9. There is a possibility of multicollinearity when the Variance Inflation Factor
(VIF) is greater than __________.
10. The ______________ is calculated using factor analysis on the independent
variables.
11. When the error term—the in the relationship between the independent
variables and the dependent variable is the same at all independent variable
values, the condition is referred to as _________________.

1. Justify how the simple linear regression aligns with the concepts from a straight-line
equation.
2. Explain the concept of the OLS regression technique.
3. Write a short note on the OLS Estimates.
4. Discuss the importance of p-value in the coefficient table.
5. Brief about the various Regression Assumptions.
6. ANSWERS
Self-Assessment Question - Answers
1. Cause-Effect
2. e
3. Ordinary Least Square regression
4. p- values
5. ANOVA
6. Multiple R
7. R square
8. Pearson's bivariate correlation matrix
9. 5
10. condition index
11. homoscedastic
Terminal Questions - Answers
1. Refer to sec 1.2

2. Refer to sec 1.2.1
3. Refer to sec 1.2.1.1
4. Refer to sec 1.2.1.2
5. Refer to sec.1.3

Unit 3
Multiple Linear Regression Model
Table of Contents

No / Graph Activity
1 Introduction - -
3
2 Use Case - -
4-6
3 Validating the Models - -

7-9
3.1 Fit Chart - -
4 Bias and Variance - - 10
5 Regularization - -
5.1 Lasso regression or regression with L1 -

-
penalty 11 - 13
5.2 Ridge regression - -
5.3 Elastic net regularization - -
6 Implementation of OLS using statsmodel - 1 14 - 35
8 Self-Assessment Question Answers - - 36
Unit 3 : Multiple Linear Regression Model 2

1. INTRODUCTION
In most business situations, there are many factors that simultaneously impact the
dependent variable. The concept of Ordinary Least Square Estimation remains the same even
when the number of independent variables is more than one. The aim is to identify the line
that minimizes the sum of squared residuals across multiple dimensions.
At the end of the topic, you will be able to:
❖ Illustrate how a multiple linear regression is performed and the optimized function
calculated.
❖ Explain Bias and Variance.
❖ Elaborate the need for Regularization.
❖ Explain Lasso and Ridge Regularization.
❖ Write a Python code that calculate the linear regression using OLS and predict the values
of unknown datapoints.

2. USE CASE
Consider the use-case to predict the birth weight of a baby. The ‘birth weight’ is the predicted
variable or dependent variable. The independent variables are the level of the mother’s
education, Race, smoking status and gestation period. We are assuming that there is a
straight-line equation that captures the relationship between the target and the independent
variables.
Birthweight = β0 + β1 * Gestation + β2* Years of Education + β3 * Race + β4 * Smoking
Therefore, the number of β coefficients that must be estimated using the ordinary Least
Square method is 4.
Performing Linear Regression on excel will generate the following output.
The output contains regression statistics and ANOVA table and a coefficient table. Note that,
depending on the tool used, some of this information may be presented differently.
The coefficient column and ‘p’ values contain 4 rows since we are estimating 4 β coefficients.
There is also the intercept estimate, which is the 5th β coefficient.

The values of the coefficients column and the sign are an important part of the output.
Similarly, the ‘p’ values are also a significant part of the output. The values of the coefficient
reflect the impact of the independent variable on the dependent variable. For example, the
‘Years of education’ variable (YearsEduc) has a value of +9.57 for the ‘Coefficients” column.
It implies, if the Years of education go up by 1 year, then the birth weight of the baby, which
is the Y variable is going to go up on average by 9.57 grams. This seems logically correct also.
If the higher the years of education of the mother, it could imply better financial conditions
the mom is and is able to afford good food, and good nutrition. Therefore, child’s weight is
also expected to be quite healthy. So intuitively, this seems correct.
Similarly, the value of ‘race’ is -168.96. Race essentially is created as 0/1 variable which is a
dummy variable. ‘0’ implies the race is non-African-American and ‘1’ implies the race is
African-American. Now, this coefficient of -168.96 essentially implies if you move from 0 to
1, the average birth weight of the baby is going to go down by 168.96 grams.
Very often these relationships can be intuitively known. For example, most often we know
that when the price goes up, the sales will go down. When marketing goes up, sales are
expected to go up. However, this may not be the case always. These regression tests can help
to understand whether or not the relationship actually exists.
Here the relationship is borne out by the data. Note that, there could be non-intuitive results
because either the data is bad, or the hypothesis or expectation is wrong. This needs to be
carefully thought through.
The P value in the result indicates which of these coefficients are statically significant
influencers of the Y variable. In this example, everything is statistically significant except for
the years of education at a 5% level of α.
While there is a positive coefficient on the years of education, the model indicates that there
is a 13% chance that this is because of random variation. If the aim of the model is to
understand the impact, then one can note that the years of education are insignificant and
not a significant influencer on the birth weight of a baby. If the intention is to do prediction,
one may not want to include relationships that are not statistically significant that are

potentially driven by randomness. In this case, the model can be finalized by dropping the
years of education variable because it is not significant.
R2 is an indicator of the results. Typically, R2 (read as R Square) tells us what is the

percentage of variance in Y that is being explained by these X variables. In the simple linear
regression model, the R Square was 49%. The R Square has now gone up, but only a little bit.
It is now 52%. It implies, there is still a lot of unexplained variation in Y over and above the
4 independent variables we have used.
ANOVA table tests the hypothesis that none of these β coefficients is different from 0. If all
the β coefficients are equal to 0, the null hypothesis will be rejected. In this case, the P value
is very close to 0. Therefore, it can be concluded that at least one of these β coefficients is
significantly different from 0.
R2 is a good indicator to evaluate the model fit. However, there are other measures available
to assess the quality of the model.

3. VALIDATING THE MODELS
There are many ways of validating a Linear Regression model and all of them should be
evaluated. One of them is R2(R squared). The ‘R2’ explains the amount of variation in Y
because of the X variables. So, the higher the R2, the better the model is. There are other
ways of testing the model - Fit Chart, a MAPE measure and RMSE.
3.1 Fit Chart
According to the model,
birthweight = -2834 + 156.51*Gestation + 9.57* Years of Education -168.9* Race -174.8*

Smoking.
( Note, these numbers were derived from the coefficients table )
Now, if this was a good model or in other words if this was the straight line that best explains
the relationship between the X variables and the Y variables, fitted values of the birthweight
can be identified. In the sample data, there was data on gestation, years of education, race
and smoking and a thousand one hundred and fifteen observations for each of these
variables.’ Fitted’ means that if we fit the X values to this equation, the Y value can be
computed. The values in each of the observations can be substituted in the equation and the
birthweight can be calculated from the equation.
This can be done manually by taking the equation and multiplying the X values with the
coefficients and coming up with a birthweight. Or, we can automatically do that using a tool,
for example, excel.
In other words, it can be calculated as intercept plus first beta coefficient times[*] first X
value, which is ‘gestate’, plus the second coefficient years of education times[*] years of
education, plus the third beta coefficient which is race times[*] the value of race, plus the
fourth beta coefficient, smoking times[*] the value of the smoking in the first observation
The residuals are the difference between the actual Y and the predicted or the fitted Y, so this
is really like an error.

In a good model, the predicted and actual ‘y’ are very close to each other. One way to validate
the model is to generate the predicted or the fitted values for all values of X and compare the
actual Y to the predicted Y.
A line chart can be drawn with the actual and the predicted values. If the graphs are almost
super-imposed on each other, the model is considered good. In this case, in some instances
they are close, but this is not really a great model because many times the model is off. In the
top half many times the model is unable to predict when weights are greater than average.
Ideally if this was a good model, there should be a lot of overlap between the actual values
and the predicted values. This is called a FIT chart.
FIT chart can also be used to calculate the “Mean Absolute Percentage Error.” The difference
between the actual value and the fitted value is the error. To calculate the average of the
absolute error, calculate the difference between actual and predicted, ignore the sign and
then calculate the average. This is called the Mean Absolute Percentage Error (MAPE).
In this example, the Mean Average Percentage Error is 11%, in other words, the model is off
on average by 11%. Ideally, the MAPE values should be 5% or lower. Ideally a good model
will have a high R2 and a low MAPE.
RMSE is a scale-dependent measure of error, meaning that it is sensitive to the units of the
predicted and actual values. It is defined as the square root of the average of the squared
errors over a set of predictions, and is calculated as follows:
RMSE = √((1/n) * ∑ (actual - predicted)2)
Where n is the number of predictions, actual is the actual value for each prediction, and
predicted is the predicted value for each prediction.
One key difference between MAPE and RMSE is how they handle outliers. Because MAPE is
based on the absolute percentage error, it is less sensitive to outliers than RMSE, which is
based on the squared error. This means that MAPE may be a more appropriate measure of
error when there are outliers in the data, while RMSE may be more appropriate when the
data is more homogeneous.

Another difference between MAPE and RMSE is how they interpret the error. MAPE is an
absolute measure of error, meaning that it is expressed as a percentage of the actual value.
This can make it easier to interpret and compare the accuracy of predictions on the same
scale, such as the percentage change in sales or the percentage change in stock price. RMSE,
on the other hand, is a relative measure of error, meaning that it is expressed as a ratio of the
error to the variance of the actual values. This can make it more difficult to interpret and
compare the accuracy of predictions, particularly when the units of the predicted and actual
values are different.
The model that is generated can be used in predictive models. In a final validated model,
given the regression equation, for values of X, we can predict the value of Y. For example, if
there is a fitted model, and the data from a new mother is available as follows,
-10 years of education,
-African American women,
-40 weeks of gestation,
-Non-smoker
These values can be provided as X values and the y value is computed as 3352 gms. This is a
prediction, remember the baby’s not born yet, but we can predict that the baby’s birth weight
will be approximately 3352 gms.
However, note that the prediction is not entirely reliable. The better the R2, the better the
model fit, the lower the percentage error, and the more confident one can be about our
prediction. In the example, the model is not very good. R2 is 52%, MAPE is 12% which is
high, FIT also wasn’t very good when we looked at the FIT visually. Therefore, the model
needs to be improved to have more reliable predictions. However, the method to derive the
equation and evaluate the model remains the same.

4. BIAS AND VARIANCE
Consider a case where there are 100 columns that are used for predicting Sales. A model is
built with all possible predictors by ensuring that the errors are most minimal. Assume that
the model is built using the year 2020. Now, if the same model is used to predict the Sales for
2021 using the same model, the residual error may be large. In this case, the data from 2020
(which is used to create the model) is referred to as “in-sample” and the 2021 data is referred
to as “out-of-sample” error.
Now, Rebuild the model with fewer predictors or columns. Now it can be observed that the
in-sample error is higher, but the out-of-sample error is lower than earlier case. The models
with many predictors are referred to as ‘complex models’ and the model with fewer
predictors are referred to as ‘simple models.’
Complex models have low in-sample error but high out-sample error, meaning that complex
models are able to learn the intricacies of the training data very well, but they fail to
generalize. In machine learning terminology, in-sample error is termed as ‘bias’ and out-of-
sample error is termed as ‘variance’. So, complex models have low bias but high variance.
This situation in which a model has a low bias, but high variance is termed as ‘model overfit’.
Simple models have high in-sample error but low out-sample error or high bias and low
variance. This can be summarized as a bias-variance trade-off, where model complexity has
an impact on the bias as well as variance. The total error due to any model is the sum of the
bias and variance of the model. Best model or the model with the lowest total error is not a
model which is very complicated or very simple, but the model lies somewhere in between.

5. REGULARIZATION
As mentioned earlier, the best model is one which is between the complicated and the simple
mode. The challenge is to determine which predictors need to be removed.
Regularization can be summarized as a process where we start with a complex model and
upon regularizing this complex model, we end up with an optimal model, one that has fewer
predictors than the complex model, other predictors get removed as some beta coefficients
become very small in magnitude, nearly zero. For a data set with 100+ predictor columns, it
may turn out that the most optimum model is a model with only 10 predictors.
5.1 Lasso Regression Or Regression With L1 Penalty
Lasso regression is a type of linear regression that uses L1 regularization, which adds a
penalty term to the cost function that refers to the cost function as sum of square of
difference between actual value of 'Y' and predicted value of 'Y' that is proportional to the
absolute value of the coefficients. This can help to reduce overfitting and improve the
interpretability of the model by reducing the number of features that are included in the final
model.
In Lasso regression, the cost function is defined as follows:
Cost function = MSE + λ * ∑|β|
Where MSE is the mean squared error, λ is the regularization parameter, and β is the
coefficient for each feature. The regularization term is added to the cost function to
discourage the model from including too many features in the model, which can lead to
overfitting.
To fit a Lasso regression model, one can use an optimization algorithm to minimize the cost
function by adjusting the coefficients for each feature. The model can then be used to make
predictions on new data by using the coefficients to calculate a linear combination of the
features.

5.2 Ridge Regression
Ridge regression is a type of linear regression that uses L2 regularization, which adds a
penalty term to the cost function that is proportional to the square of the coefficients. This
can help to reduce overfitting and improve the interpretability of the model by reducing the
complexity of the model.
In Ridge regression, the cost function is defined as follows:
Cost function = MSE + λ * ∑β2
Where MSE is the mean squared error, λ is the regularization parameter, and β is the
coefficient for each feature. The regularization term is added to the cost function to
discourage the model from fitting the training data too closely, which can lead to overfitting.
To fit a Ridge regression model, you can use an optimization algorithm to minimize the cost
function by adjusting the coefficients for each feature. You can then use the model to make
features.
Ridge regression is useful when you have a large number of features, and you want to reduce
the complexity of the model and improve its interpretability. It can also be useful when you
want to prevent overfitting and improve the generalization performance of the model.
One key difference between Lasso and Ridge regression is the type of regularization used.
Lasso uses L1 regularization, which adds a penalty term to the cost function that is
proportional to the absolute value of the coefficients. Ridge regression uses L2
regularization, which adds a penalty term to the cost function that is proportional to the
square of the coefficients. Another difference is that Lasso tends to produce sparse models,
with many coefficients equal to zero, while Ridge regression produces models with all
coefficients non-zero. This can make Lasso more interpretable and easier to understand,
while Ridge regression may be more flexible and able to capture more complex relationships
in the data.

5.3 Elastic Net Regularization.
Elastic net regularization is a combination of L1 regularization, which is used in Lasso

regression, and L2 regularization, which is used in Ridge regression. It is a hybrid
regularization method that can balance the benefits of both L1 and L2 regularization and can
be useful when you have many features and you want to select a subset of the most important
features to include in the model, while also reducing the complexity of the model and
improving its interpretability.
In Elastic net regularization, the cost function is defined as follows:
Cost function = MSE + α * λ * ∑|β| + (1 - α) * λ * ∑β2
Where MSE is the mean squared error, λ is the regularization parameter, β is the coefficient
for each feature, and α is a mixing parameter that determines the balance between L1 and
L2 regularization. When α = 1, the cost function reduces to the Lasso cost function, and when
α = 0, it reduces to the Ridge cost function.
To fit an Elastic net model, you can use an optimization algorithm to minimize the cost
function by adjusting the coefficients for each feature. You can then use the model to make
features.
Elastic net regularization can be useful when there are correlated features in data, and one
wants to select a subset of the most important features to include in the model. It can also be
useful when one wants to balance the benefits of Lasso and Ridge regression and find a good
compromise between the simplicity and interpretability of Lasso and the flexibility and
generalization performance of Ridge regression.

6. IMPLEMENTATION OF OLS USING STATSMODEL
Python provides several packages for implementing linear regression on data.
The code snippets provide below illustrates the analysis of a marketing data to predict the
new sales volume using OLS method that is implemented in the statsmodel package.
The data is available in the file named mkmix.csv. Before any regression is performed,
- “Exploratory Data Analysis” needs to be performed to understand the data

- Imputation of the missing data needs to done.
- Transformation of the data, for example, converting the categorical values to indicator
variables needs to be performed.
- The data needs to be split into train/test sets.
- The model has to be created using the train data and evaluated against the test data.
As a first step, initialize the directory and read the data into a dataframe as shown below.
Display the first five rows using the head() command.
import os
os.chdir('C:\\Users\\MEDIA ENGINEER\\Desktop\\DSP')
import pandas as pd
from pandas import DataFrame
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
data=pd.read_csv("mktmix.csv")
data.head()

data.shape
(104, 9)
The output above indicates that there are 104 rows of data and 9 columns.
pd.options.display.float_format = '{:.2f}'.format
data.describe()

Display the list of columns and types of each of the column.
data.columns
data.describe()
Index(['NewVolSales', 'Base_Price', 'Radio ', 'InStore', 'NewspaperInserts',

'Discount', 'TV', 'Stout', 'Website_Campaign '],
dtype='object')data.describe()
The details of the values present in the column ‘Base Price’ can found using the describe()
method. The output indicates that the mean value is 15.31 and the median (50%) is about
15.33
print(data["Base_Price"].describe())
count 104.00
mean 15.31
std 0.53
min 13.74
25% 15.03
50% 15.33
75% 15.64
max 16.28
Name: Base_Price, dtype: float64
Since it is a continuous value, a histogram can be displayed on the values of the base_price to
understand its distribution.
data.Base_Price.plot('hist')

A box plot also helps us to visualize quantiles and outliers.
data.boxplot(column='Base_Price')

The values corresponding to the various quantiles can be printed using the quantile ()
function.
data['Base_Price'].quantile(np.arange(0,1,0.1))
0.00 13.74
0.10 14.59
0.20 14.97
0.30 15.03
0.40 15.18
0.50 15.33
0.60 15.49
0.70 15.64
0.80 15.80
0.90 15.96
q = data['Base_Price'].quantile(0.01)
print(q)
13.8779529113
Very often, the outlier values may be due to incorrect entries or other errors. It is a good idea
to check the lower quantile entries as shown below.
data[data.Base_Price < q]

If needed, the values can be replaced. It is a good practice to replace the value with the
mean().
avg =data['Base_Price'].mean()
data.loc[(data["Base_Price"]<q), 'Base_Price'] = avg
data['Base_Price'].describe()
count 104.00
mean 15.34
std 0.48
min 14.01
25% 15.03
50% 15.33
75% 15.64
max 16.28
Similar descriptive statistics can be performed on the resultant feature like ‘NewVolSales’.

data['NewVolSales'].describe()
count 104.00
mean 20171.07
std 1578.60
min 17431.00
25% 19048.75
50% 19943.50
75% 20942.75
max 24944.00
Name: NewVolSales, dtype: float64
data.NewVolSales.plot('hist')

The descriptive statistics on the variable ‘Radio’ indicates that there are missing values in
that particular column.
#new = data.loc[(data["NewVolSales"]>20050.53)]
#nl = list(new.index)
#data.drop(nl,axis = 0)
data['Radio '].describe()
count 100.00
mean 256.69
std 86.99
min 0.00
25% 235.00
5
50% 278.50
75% 313.25
max 399.00
Name: Radio , dtype: float64
R = data['Radio ']
Rd = DataFrame(R)
type(R)
pandas.core.series.Series
To handle the missing values, the sklearn provides a package called the ‘Imputer’. The
imputer can be iniatilzed and fit() method can be invoked with the appropriate strategy. In
this example, the strategy is ‘mean’. Once done, the column can be transformed and new
dataframe can be created as shown below.

from sklearn.preprocessing import Imputer

imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(Rd)
X = pd.DataFrame(data = imp.transform(Rd))
data = pd.concat([data,X],axis = 1)
data.head()
The newly created column will have to be renamed.
data.rename(columns={0:"NRadio"}, inplace=True)
data['Radio '].isnull().sum()
The replacement of values can also be done using the fillna() method as shown below

# Replace Missing values

data['Radio']=data['Radio '].fillna(data['Radio '].mean())
data['Radio'].describe()
count 104.00
mean 256.69
std 85.29
min 0.00
25% 235.75
50% 276.00
75% 312.25
max 399.00
Name: Radio, dtype: float64
Website_Campaign is a categorical variable that takes values as shown below. The

value_counts() can be used to understand how many entries are present for each category.
data['Website_Campaign '].value_counts(dropna = False)
NaN 90
Website Campaign 6
Twitter 4
Facebook 4
Name: Website_Campaign , dtype: int64

As a part of the initial analysis, it is important to understand the impact of one variable on
the other. The scatter plot can be used to compare two continuous variables. The code below
compares ‘NewVolSales’ and ‘BasePrice’ using a scatter plot. It roughly indicates that as the
‘Base_Price’ increases, the ‘NewVolSales’ decreases.
# Base Price
data.plot(x="NewVolSales",y="Base_Price",kind="scatter")
Similarly, the code below compares ‘NewVolSales’ and ‘Radio’ using a scatter plot.
#Radio
data.plot(x="NewVolSales",y="Radio",kind="scatter")

‘Correlation’ is an important concept that indicates the impact of one variable on other.
pd.options.display.float_format = '{:.2f}'.format
data.corr()

The correlation can be visualized using a heatmap as shown below.
import seaborn as sns

corr = data.corr()
sns.heatmap(corr,
xticklabels=corr.columns.values,
yticklabels=corr.columns.values)

In any model creation, the categorical variables cannot be consumed as is. It should be
converted into indicator values. This is done using the get_dummies() function. This will
create a column corresponding to each unique value. These columns will take the values 0
or 1 depending on the original value. Very often, continuous variables also can be ‘binned’
into categorical variables. This again is converted into ‘indicator’ variables in the model. The
code below demonstrates the same.

new = data.corr()
new.to_csv('corr.csv', sep=',')
data.rename(columns={'Website_Campaign ':'Website_Campaign'}, inplace=True)
data.rename(columns={'TV ':"TV"}, inplace=True)
data['log_sales'] = np.log(data.NewVolSales)
def d(x):
if x < 15:
return "Low"
elif 15< x < 15.33:
return "Medium"
else:
return "High"
data['BKTPrice']=data['Base_Price'].map(d)
data.head()

data = pd.get_dummies(data, columns=['NewspaperInserts'])

data = pd.get_dummies(data, columns=['Website_Campaign'])
data.columns
Index(['NewVolSales', 'Base_Price', 'Radio ', 'InStore', 'Discount', 'TV',

'Stout', 'NRadio', 'Radio', 'log_sales', 'BKTPrice',
'NewspaperInserts_Insert', 'Website_Campaign_Facebook',
'Website_Campaign_Twitter', 'Website_Campaign_Website Campaign
'],
dtype='object')
As shown in the output above, the columns now contain ‘Website_Campaign_Facebook',

'Website_Campaign_Twitter' and so on. ‘Facebook’ and ‘Twitter’ were original values of the
column named ‘Website_Campaign’.
data['Online'] = data['Website_Campaign_Facebook']+
,data['Website_Campaign_Twitter']+ data['Website_Campaign_Website
Campaign ']
data["Offline"]=data['TV']+data['InStore']+ data['Radio']
data.columns
Index(['NewVolSales', 'Base_Price', 'Radio ', 'InStore', 'Discount', 'TV',

'Stout', 'NRadio', 'Radio', 'log_sales', 'BKTPrice',
'NewspaperInserts_Insert', 'Website_Campaign_Facebook',
'Website_Campaign_Twitter', 'Website_Campaign_Website Campaign ',
'Online', 'Offline'],
dtype='object')
The ‘statsmodel’ is a package in Python that provides functions to calculate the OLS on a
given data. The code below demonstrates the same. It calculates the OLS for ‘NewVolSales’
given ‘Base_Price’, ‘TV’, ‘Discount’, ‘Sout’, ‘Radia’, ‘Online’, ‘NewspaperInserts_Insert’. The R2

is 0.72 . The other details of the regression- the coefficients, intercept , adjusted r2 etc are
displayed as a part of the summary.
import statsmodels.formula.api as smf
reg=smf.ols("NewVolSales~Base_Price+InStore+TV+Discount+Stout+Radio+Onli
ne+NewspaperInserts_Insert" )
results=reg.fit()
print(results.summary())

The fit() can be repeated by dropping a few columns as shown below.

reg=smf.ols("NewVolSales~Base_Price+InStore+TV+Discount+Stout",data=data)
results=reg.fit()
print(results.summary())
The model that is created can be used to predict. The difference between actuals and
predicted can be plotted to understand the overlap. Typically, data is split into train and test
set. The test set then is used to understand the capability of the model.

predictions=results.predict(data)
actuals=data['NewVolSales']
## Actual vs Predicted plot

plt.plot(actuals,"b")
plt.plot(predictions,"r")
residuals = results.resid
residualsdf = DataFrame(residuals)
residualsdf.rename(columns={0:"res"}, inplace=True)
plt.scatter(residualsdf, predictions)
plt.xticks([])

Besides the R2, the Mean absolute error and mean absolute percentages can be calculated
as shown below.
import sklearn.metrics as metrics
## Mean Absolute error

mae = metrics.mean_absolute_error(actuals,predictions)
np.mean(abs((actuals - predictions)/actuals))
0.0276494903340801
In this case, the model seems to be a good indicator since the MAPE is very low. However,
as mentioned earlier, the model needs to be evaluated on unseen data to get a better
understanding.

1. When trying to perform a linear regression with excel and if there are 4
features that are used to predict, there are 5 rows in the coefficient’s column.
Which is the additional row?
2. ______________ tells us what the percentage of variance in Y that is being
explained by these X variables.
3. To calculate the average of the absolute error, calculate the difference
between actual and predicted, ignore the sign and then calculate the average.
This is called the __________________.
4. In machine learning terminology, in-sample error is termed as ________, and
out-of-sample error is termed as ___________.
5. Complex models have _____ bias but ______ variance
6. Simple models have ______bias and _______ variance
7. ___________ can be summarized as a process where we start with a complex
model and upon regularizing this complex model, we end up with an optimal
model
8. _____________ is a type of linear regression that uses L1 regularization, which
adds a penalty term to the cost function that is proportional to the absolute
value of the coefficients.
9. _____________is a type of linear regression that uses L2 regularization, which
adds a penalty term to the cost function that is proportional to the square of
the coefficients.

1. Explain OLS with an example.

2. Define Bias, Variance with an example
3. Define Regularization and explain the different types of regularization.
4. What are the different parameters that are used to evaluate linear regression? Explain
each of them.
5. Write a code to explain how the OLS can be calculated using statsmodel package.
8. SELF-ASSESSMENT QUESTION ANSWERS
Self-Assessment Question Answers
1. Intercept
2. R2
3. Mean Absolute Percentage Error (MAPE)
4. 'bias', 'variance'
5. low, high
6. high, low
7. Regularization
8. Lasso regression
9. Ridge regression
Terminal Questions – Answers
1. Refer section “Use Case”

2. Refer section “Bias and Variance”
3. Refer section “Regularization”
4. Refer section “Validating the model”
5. Refer section “Implementation of OLS using statsmodel”

Unit 4
Logistic Regression
Table of Contents

No / Graph Activity
1 Introduction - -
3
2 Solving a Classification Problem 1 to 8 -

4-9
3 Logistic Cost Function 9 , 10 - 10 - 13
4 Measuring Binary Classification Performance 11 to 16 - 14 - 18
5 Python Demo - - 19 - 28
6 Multiclass Logistic Regression 17 to 20 -
6.1 “One vs Rest” Approach - - 29 - 32
6.2 Multiclass Logistic Regression - -
7 Python Demo - 1 33 - 39
Unit 4 : Logistic Regression 2

1. INTRODUCTION
Introduction: Linear regression is used to predict a continuous variable like the cost of ticket,
price of the stock and so on. Logistic Regression model can be used to predict a categorical
variable that contains binary or a discreet set of values. For example, to predict if an email is
spam or not, predict if the given data implies presence of cancer or not. This is a supervised
learning technique that learns from a labelled dataset and makes the predictions.
1.1 Learning Objectives:
At the end of the unit, you will be able to
❖ List the drawbacks for using the linear model to predict a classification feature.
❖ Derive the logistic cost function
❖ List and explain the metrics for evaluating the classification model.
❖ Demonstrate the training and evaluation of a logistic model in Python.
❖ Explain multi-class logistic regression and demonstrate its implementation in Python.

2. SOLVING A CLASSIFICATION PROBLEM
Suppose we have a classification problem, where we want to predict based on Age and
Income of people, the good or a bad customer for a bank. If we create a scatter plot for Age
and our response variable, it will look something like the one in Figure 1.
Figure 1 : Scatter plot of Feature1 and Response
The reason the scatter looks like Figure 1 , because the response variable only takes two
unique values, 0 and 1. If we fit a linear regression model to this data, the fitted model will
look like figure 2.

Figure 2: Scatter plot with fitted model
The variable that is being predicted takes only two values, 0 and 1. However, the predictions
that are made by our model as shown in Figure 2 take many unique values, some predictions
are more than 1, some are less than 0 and some are between 0 and 1. The problem here is
that we are using a linear estimator, which can take any real values, while the variable we
are estimating i.e., the ‘target’ variable can only take two unique values. The issue here is that
there is a mismatch between the ranges of the estimator we are using and the variable we
are predicting. Clearly fitting a linear model doesn’t seem to be a good idea. Therefore,
instead of trying to predict if a given data point will be categorized as 0 or 1, it is better to
predict the chances of an observation belonging to class 1 (Figure 3) .
Figure 3: Proportion Indicating the chances

If we fit a line now to our data, where we are predicting the probability that a given row will
have a label of 1 the graph will be similar to Figure 4.
Figure 4: Model Predicting Probability
The response is now in the range of 0 to 1 as we are estimating the probability. These
estimates, that lie between 0 and 1 make sense as these can be interpreted as probabilities.
But still, there are some predictions that have a magnitude either greater than one or less
than 0. So, there still seems to be a mismatch between the range of values that our target
variable can take and what our estimator is predicting. This needs to be rectified.
To further improve this, the odds-ratio can be computed. This is the likeliness of the event
(figure 5).

Figure 5: Odds Ratio
The odds ratio measure how likely an event is. For example, people who are 20 years old are
more likely to have a label of 1 as shown by the odds ratio. Now if we try to predict the Odds
Ratio of someone being a good customer, i.e., odds ratio that someone will be labelled as 1.
Theoretically, the minimum value of the probability of an event can be 0 and the maximum
value of the probability of an event can be 1. When the probability of an event approaches
zero, the odds ratio also becomes zero. When the probability of an event becomes 1, the odds
ratio of that event approaches infinity.
Figure 6 depicts the scenario when we fit a straight line to odds ratio.
Figure 6: Model fitting odds ratio

You can see that the odds ratio can lie between 0 to infinity. Hence these predictions by our
estimator make sense. However, there is still a mismatch between the predicted range of
values and actual values of the odds ratio. To rectify the mismatch in range of values of our
linear estimator and the target variable, we can apply another algebraic transformation – log
odds. Odds ratio lies between 0 to infinity. The log of odds will lie between -infinity to infinity.
Figure 7 is what we get if we fit a straight line to the log of odds ratio.
Figure 7: Model fitted with log odds
We have already seen that log odds can take any real value. Now the range of values of the
linear estimator as well as the log of odds ratio are in sync. This is known as ‘Logistic
Function’. This is also known as sigmoid function. If we plot the logistic function, then we get
this S shaped curve (Figure 8).

Figure 8: Sigmoid Function
Logistic function outputs the probability, since probability is bounded between 0 to 1, hence
the shape of this curve. In standard literature, logistic is referred to as a link function. There
are other link functions that can also do the same job that logistic function does, but over the
years logistic has become more famous and is widely used to do binary classification.

3. LOGISTIC COST FUNCTION
One of the issues that we need to deal with is, how would we estimate the parameters of a
logistic regression model. When we discussed about linear regression, we used to minimise
the residual sum of squares (RSS). And it was very intuitive to understand that RSS is the
relevant cost function that should be minimised when we do a linear regression. But in a
classification setting, this may not work.
To understand this assume that we have this data set from an e-Retailer and based on the
Age of a customer, the retailer wants to understand who a good customer would be. Assume
that Good customers have been labelled as 1. We can try to fit a model of this form. Notice,
once a model is fit we can always obtain, the probability estimate from this model. Also, when
we say that we are fitting the model, what we are trying to do is, we are trying to find the
optimum values of β0 and β1 .
Let’s assume that the value of β0 and β1 chosen for this model is β0 = 0.7 and β1 = 1.7. We can
fit another model to this data and obtain these probability predictions. Let’s assume that the
parameters for this model are: β0 = 0.3 and β1= 2.2.

Figure 9: Prediction with two sets of β values
Now between these two models, the model with β0=0.7 and β1=1.7 has a better fit compared
to the model with β0=0.3 and β1=2.2. You can clearly see that the second model makes a
mistake, giving high probability for a customer being good, when it should not. We need to
formalise this intuition and quantify this in the form of a cost function.
The cost function used in the logistic cost function is :
Cost = ∑𝑖=𝑛
𝑖=1 𝑦𝑖 𝑙𝑜𝑔 𝑃𝑖 + ( 1 − 𝑦𝑖 )𝑙𝑜𝑔(1 − 𝑃𝑖 )
Here, Pi is the predicted probability for row i, and yi is the label of the target variable in row
i. Notice yi can take only two values, 1 and 0. Another way of writing this cost function is this.
Both these forms are equivalent since the predicted probability for row i is computed using
the model and hence this expression. This form also explicitly relates the cost to the
parameters.

Table 1: Cost Predictions with different values of β
Now, using the definition of logistic cost, compute the value of cost for each of our
hypothetical models. The model has a lower cost compared to this model. So clearly, the cost
function seems to capture the “fit” of the model.
Note that the cost function changes its value when we change the model parameters.
The objective while fitting a logistic model is to reduce the cost by choosing suitable values
of parameters – β0 and β1.
This cost function is commonly known as the logistic cost function, it’s common to refer to
this cost function as ‘log loss’ as well. Unlike linear regression where there was a closed-form
solution for minimizing the RSS, in logistic regression, the cost function can only be
minimised using a numerical procedure such as Newton’s Method or Gradient Descent.
There is no closed-form solution for this cost function.
Just like linear regression models, logistic regression models can also be regularised. We can
use an L2 as well as an L1 norm-based regularization. Here is how an L2 norm is used, we
simply add an L2 penalty to our cost function.
As discussed earlier in the linear regression module, lambda is a tuning parameter that is
estimated using k-fold cross-validation and the L2 term doesn’t include the beta intercept.

One can also include an L1 regularization term.
There are similarities between the linear regression formulation and this formulation. We
are simply adding a penalty term to our cost function. Instead of beta squares, we are adding
absolute values of beta. Alpha is the tuning parameter and is estimated using k-fold CV. The
L1 penalty does not include the intercept term.

4. MEASURING BINARY CLASSIFICATION PERFORMANCE
Once a model is estimated, the next task is to figure out, how accurate is the model. One can
report ‘log loss’ or ‘logistic cost’ as a proxy for a model’s accuracy. But log loss is not very
intuitive to understand and interpret. We will need to use probability output to classify and
measure the accuracy.
Assume, we’ve built a classification model on this dataset. Now, our logistic regression model
gives us these probabilities as shown in Figure 10.
Figure 10: Table with Prediction Percentages
One can assume that any observation where this model predicts, a probability of having label
1 as more than 0.5, can be considered to have a predicted label of 1. Performing this we will
end up with data similar to Figure 11
Figure 11: Prediction with 100% Accuracy

As you can see, the predicted labels are same as the actual labels, hence we can say that our
model has achieved 100% accuracy. From this discussion, we can come up with this
definition of accuracy for any binary classifier, where the accuracy is equal to the number of
correctly predicted labels divided by the total number of rows in the data set.
There is another aspect of accuracy - assume we build another model and these are the
probability predictions that we obtain. Assuming a probability threshold of 0.5, we obtain
these predicted labels as shown is Figure 12.
Figure 12: Table in which a few predictions are incorrect
Figure 12 shows that there are errors, but the nature of error is different. Here we are
misclassifying an event as a non-event, i.e., misclassifying a 1 as 0. Also, misclassifying non-
events as events, i.e., we are misclassifying 0s as 1s. It is worth noting here that by
convention, whatever is labelled as 1 in the data is termed an event. Keeping this in mind we
can always create a table like the one shown in Figure 13.
Figure 13: Confusion Matrix
Here we are not only counting how many times we made a misclassification, but we are also
keeping a tab on what kind of misclassification we have made. This is termed as a confusion

matrix. “True Positives” are the counts of instances where predicted events were actually
also events. “False Positives” are the counts of instances where predicted events are actually
non-events. “False Negatives” are the counts of instances where predicted non-events are
actually events. “True Negatives” are the counts of instances where predicted non-events
were non-events. One can construct a host of measures using a confusion matrix. Some
popular measures are Precision and Recall. Precision measures out of all the predicted
events how many were actually events while Recall measures out of all the actual events,
how many a model was able to predict as events.
Consider an example illustrating how changing the probability threshold changes the
confusion matrix. This time if we assume a threshold of 0.4, you can see that: Now we don’t
have any False Negative. This means that a single confusion matrix is not sufficient to
completely capture the accuracy of a model as by merely changing the threshold, our
confusion matrix can change. Now, if we want to compare two classifiers on their ability to
accurately predict binary labels then confusion matrix will not serve the purpose.
Figure 14: Predictions with Threshold as 0.4

We need a single measure of model performance that is independent of the probability

threshold we choose.
One of the ways we can do is, we can enumerate, the true positive rate (TPR) and the false
positive rate (FPR) across many probability thresholds and arrive at an understanding of a
model’s ability to classify. We would ideally want to have a high TPR but a low FPR. In theory,
constructing such a table is fine, but from a practical standpoint, analysing tables is time-
consuming.
We can plot the TPR and FPR. Usually, this is how the TPR and FPR profile for a binary
classifier looks like - We also compare our classifier with a naïve classifier, one that randomly
assigns labels with equal probability. This straight line refers to the TPR/FPR profile of such
a naïve classifier. This curve is referred to as a ROC curve. You can clearly see our classifier
does a far better job than a naïve classifier because, for a given FPR, the TPR for our classifier
is higher than the naïve classifier.
Figure 15: ROC Curve
This can also be used to compare the performance of two models.

Figure 16: Comparing Multiple ROC Curves
In Figure 16, the blue model seems to do a better job than the orange model. We can compute
the area under the ROC curve corresponding to each model. You can clearly see that the AUC
will be more for the ROC curve coloured blue. The area under the curve is another
consolidated measure that captures the classification performance of a binary classifier. The
higher the AUC better is the classification ability of a model. Another point to notice here is
that the AUC for our naïve model represented by this straight line is always 0.5, so it is
worthwhile to keep in mind that your classifier should have an AUC of more than 0.5

5. PYTHON DEMO
This section consists of code snippets that demonstrate the use of the sklearn’s logistic
regression methods to create a model that can be used to predict a binary result. This section
uses the “jokes_cleaned.csv” data that is used to predict if a given sentence is funny or not.
The data is split into train and test set to create and evaluate the model respectively.
To begin with “Exploratory Data Analysis” needs to be performed on the data, therefore, load
the libraries as shown below.
import os
import pandas as pd
import numpy as np
%matplotlib inline
Load the data and display the first five rows:
data_dir="E:\ML Course\Logistic Regression\Data"

os.chdir(data_dir)
jokes_data=pd.read_csv('jokes_cleaned.csv')
jokes_data.head()

Check if any of the columns have null values using the isnull().sum() method.
jokes_data.isnull().sum()
Joke 0
Rating 0
Ratings_Cleaned 3
dtype: int64
For the sake of simplicity, we will drop the null values using the dropna() method. We can
use other imputational techniques to populate missing data. Check the types of the columns
using the “dtypes” attribute.
jokes_data=jokes_data.dropna()
jokes_data.dtypes
Display the statistical characteristics using the describe() method.
jokes_data['Ratings_Cleaned'].describe()

Create a new column named ‘Funny’. This will take a value of 0 if the ratings is greater than
or equal to 4. This will be set to zero otherwise.
## Assume if the rating is 4 or more, the joke is funny

jokes_data['Funny']=jokes_data['Ratings_Cleaned'].map(lambda x: 1 if x>=4 else 0)
jokes_data.head()
Since the input data is in the form of the text, we will use simple methods to convert the
sentence into a numerical vector. The sklearn package contains a method called
“CountVectorizer’ to convert the sentence into a numerical vector.

## Creating Features
demo_text=["This is sentence one.", "This is sentence two.", "This is a very very long
sentence three."]
import sklearn.feature_extraction.text as text

cv=text.CountVectorizer(demo_text)
count_matrix=cv.fit_transform(demo_text)
cv.get_feature_names()
[u'is', u'long', u'one', u'sentence', u'this', u'three', u'two', u'very']
count_matrix.toarray()
The resultant matrix after the sentence is vectorized contains the number of times every
word appears arranged in the form of a matrix.
pd.DataFrame(count_matrix.toarray(),columns=cv.get_feature_names())
Perform the same operation on all the sentences that is stored in the column ‘Joke’ in the
dataframe. Once transformed, ‘X’ contains the vectorised sentences.

## Create Features for the linear classifier

cv=text.CountVectorizer(jokes_data['Joke'].tolist())
X=cv.fit_transform(jokes_data['Joke'])
X.shape
(867, 7068)
The binary values of the column ‘Funny’ which is the predicted value is stored in ‘y’.
y=jokes_data['Funny']
For every training, the data needs to be split into train and test data. ‘Train’ is used to train
or create the model and ‘test’ is used to evaluate the model since it is an ‘unseen’ data. The
sklearn package contains the ‘model_selection’ subpackage that contains the
train_test_split() method that randomly splits the given data into train and test subsets.
import sklearn.model_selection as model_selection

X_train,X_test,y_train,y_test=model_selection.train_test_split(X,y,test_size=0.2,
random_state=200)
The sklearn package consists of classes like LogisticRegression. The method

LogisticRegression() can be used to create a model that can then be trained.
import sklearn.linear_model as linear_model

clf=linear_model.LogisticRegression()
“Cross Validation” is a technique that is used to understand how a model will generalize.
This is covered in detail in the later chapters. We can use the GridSearchCV() method to train
the data as shown below.

np.random.seed(300)
mod=model_selection.GridSearchCV(clf,param_grid={"penalty":["l1"],"C":np.random.
uniform(0,120,100)})
mod.fit(X_train,y_train)
We can get the best model after the training and the best score.
mod.best_estimator_
mod.best_score_
0.72871572871572876
The output above indicates that the model is able to get a score of 73% on the test data.
We can use the trained or fitted model on the X_test and display the probabilities as shown
below.
mod.predict_proba(X_test)

array([[ 9.12895311e-01, 8.71046889e-02],

[ 3.79010020e-01, 6.20989980e-01],
[ 9.92413027e-01, 7.58697291e-03],
[ 1.40181610e-02, 9.85981839e-01],
[ 9.99136438e-01, 8.63562254e-04],
[ 9.92947936e-01, 7.05206366e-03],
[ 9.72263747e-01, 2.77362532e-02],
[ 9.99611276e-01, 3.88723623e-04],
[ 9.16690337e-01, 8.33096627e-02],
[ 9.99945001e-01, 5.49992436e-05],
[ 7.44034974e-01, 2.55965026e-01],
[ 9.18009940e-01, 8.19900599e-02],
The output shows the probabilities
[ 9.99892027e-01, for ‘0’ and ‘1’ . In other words, it displays the
1.07972978e-04],
probabilities for the sentence3.62159151e-03],
[ 9.96378408e-01, ‘being funny’ and ‘not being funny’.
[ 7.67714421e-01, 2.32285579e-01],
mod.classes_
[ 6.55421776e-02, 9.34457822e-01]])
array([0, 1], dtype=int64)
We can compute the values in the roc_curve using the actual values and the computed
probabilities.

metrics.roc_curve(y_test,mod.predict_proba(X_test)[:,1])

(array([ 0. , 0. , 0. , 0.00769231, 0.00769231,

0.01538462, 0.01538462, 0.01538462, 0.01538462, 0.01538462,
0.03076923, 0.04615385, 0.06153846, 0.06153846, 0.08461538,
0.08461538, 0.1 , 0.1 , 0.16153846, 0.16153846,
0.2 , 0.2 , 0.21538462, 0.21538462, 0.23076923,
0.23076923, 0.26153846, 0.26153846, 0.29230769, 0.29230769,
0.3 , 0.3 , 0.31538462, 0.31538462, 0.32307692,
0.32307692, 0.39230769, 0.39230769, 0.45384615, 0.45384615,
0.49230769, 0.50769231, 0.59230769, 0.59230769, 0.74615385,
0.77692308, 0.79230769, 0.79230769, 0.80769231, 0.80769231,
0.84615385, 0.84615385, 0.89230769, 0.90769231, 0.93076923,
0.93076923, 0.95384615, 0.95384615, 0.97692308, 0.97692308, 1.
]),
array([ 9.99879200e-01, 9.98759598e-01, 9.98134609e-01,
9.97902840e-01, 9.97890631e-01, 9.96486702e-01,

9.95138853e-01, 9.85981839e-01, 9.81456932e-01,
9.79475872e-01, 9.46557827e-01, 9.40150257e-01,
9.26920326e-01, 9.18088167e-01, 7.72766778e-01,
7.60767562e-01, 6.42617602e-01, 6.20989980e-01,
4.69694749e-01, 4.21650730e-01, 3.05765242e-01,
3.03683635e-01, 2.65029422e-01, 2.64729794e-01,
2.48843799e-01, 2.32285579e-01, 1.76905888e-01,
1.16662570e-01, 8.80130017e-02, 8.71046889e-02,
8.37271548e-02, 8.33096627e-02, 8.19900599e-02,
7.74825661e-02, 7.57672867e-02, 7.55991994e-02,
5.09503193e-02, 4.72226814e-02, 3.55621431e-02,
The FPR and TPR for different thresholds can be computed and displayed in a plot as shown
3.55424431e-02, 3.14107242e-02, 2.92533467e-02,
below. 2.08828600e-02, 1.97430893e-02, 7.95744764e-03,
7.58697291e-03, 7.21525946e-03, 7.18829833e-03,
7.05206366e-03, 4.70958205e-03, 3.62159151e-03,
3.17200893e-03, 8.69791238e-04, 8.63562254e-04,
4.19439885e-04,
Unit 4 : Logistic Regression 2.90850117e-04, 1.24817040e-04, 26
1.12360519e-04, 8.37032844e-05, 6.33831313e-05,
6.00976743e-06]))
fpr,tpr,thresholds=metrics.roc_curve(y_test,mod.predict_proba(X_test)[:,1])
plt.plot(fpr,tpr,"-")
The roc_auc_score can be computed as shown below:
metrics.roc_auc_score(y_test,mod.predict_proba(X_test)[:,1])
0.65349650349650357
The confusion matrix on the test data can be displayed using the confusion_matrix() method.
metrics.confusion_matrix(y_test,mod.predict(X_test))
array([[113, 17],
[ 27, 17]], dtype=int64)
The output indicates that 113 of the sentences has been correctly tagged as “Funny” and 17
has been falsely tagged as “Funny”. Similarly, 27 has been wrongly tagged as “Not funny” and
17 has been correctly tagged as “Not Funny”.
The report displaying the precision, recall and f1-score can be displayed using the
classification_report() method on the test data.

print metrics.classification_report(y_test,mod.predict(X_test))

6. MULTICLASS LOGISTIC REGRESSION
So far, we have used logistic model to do binary classification. We will extend our model to
deal with multiple classes. There are two popular approaches to achieve this goal. One
approach is called “One vs Rest” or “One vs All”. The other approach is known as Multinomial
Logistic Regression.
6.1“One vs Rest” Approach
In the “One vs Rest” also known as “One vs All” approach, to handle multiple classes we build
several binary classifiers and then try to club their results to get a final prediction. Some ML
frameworks support this feature some don’t. Let’s take a concrete example to understand
this approach.
Let’s Assume we have this data set of a credit card company and let’s say we want to predict
the account status. There are three classes:
➢ “Paid” - People who pay back on time,

➢ “Not Paid”- People who never pay back
➢ “Late Payment” - People who pay back after 1 month.
Using the “One Vs All” approach: We first fit a model assuming people who pay back as one
class and people who do not pay back or are 1 month late as another class. Then we build a
model assuming people who do not pay as one class vs people who pay or people who are 1
month late as another class. Lastly, we build a model assuming people who are one month
late as one class vs people who pay or people who do not pay as another class
For a model where people who pay is one class and people who don’t pay or are late on their
payments, is another class, Figure 17 shows the data level view.

Figure 17: Model 1 predicting P(Paid) vs P (Not Paid or Late Payment)
You can notice that row with status paid are labelled as 1 while other rows are labelled as
zero in the column ‘Target’
Similarly, ‘Target’ can be calculated for Model 2 and Model 3.
Figure 18: Three Models for One vs Rest
Now, if we must make a prediction for this row based on the three models that were trained
earlier, for a person with Age 23 and income $12K - we will compute the probability
predicted by model 1, model 2, and model 3. Assume that model 1 returned 0.81, model 2
predicted 0.62 and model 3 predicted 0.70. Since the predicted probability by model 1 is
maximum, we will predict that - This person will pay!

6.2 Multiclass Logistic Regression
In One Vs. All and One Vs. Rest approach, you would have noticed that the sum of the
probabilities for K classes in each row can exceed 1. For example, in the previous illustration,
we had estimated these 3 probabilities for a single row and as you can see, the sum of these
3 probabilities can be more than 1.
To overcome this limitation where the sum of k class probabilities exceeds 1, another
multiclass logistic regression model is formulated. This formulation makes sure that the sum
of probabilities is always 1. The way this is ensured is, by fitting this model, where for each
class probability is estimated as:
This is referred to as Max Entropy classifier another popular name for this formulation is
called a softmax regression.
We can use the previous data to understand how maximum entropy classifier works. This is
a 3-class classification problem. For each class, this is how the probability is estimated. This
is akin to fitting 3 linear models for each class.
Table 2: Payment Status Based on Age and Income
Once the beta coefficients are set, the computation of probability can be done in two steps,
first is to compute exponentials and then calculating the probability by simply using the
above formula. As shown in table 3, the sum of probabilities sums to 1.

Table 3: Computing Probabilities Based on Soft-max Regression

7. PYTHON DEMO
“MNIST” is a popular dataset of handwritten digits. The mnist_x.csv contains the 64 pixels
representation and mnist_y.csv contains the number that the image it represents. The
sklearn package and the LogisticRegression class within it can be used to predict the digit
based on the image.
Like the previous demo, we load the dataset and perform initial analysis on the same.
import os
import pandas as pd
import numpy as np
%matplotlib inline
data_dir="E:\ML Course\Logistic Regression\Data"

os.chdir(data_dir)
pixel_values=pd.read_csv("mnist_x.csv")
image_labels=pd.read_csv("mnist_y.csv",header=None)
pixel_values.head()

image_labels.head()
One of the important steps before training the model is to ensure that the data is normalized.
Normalization will ensure that all the features use a common scale. This will help us to get
more accurate models. The pixels take the value between 0 and 255. Therefore, divide the
pixel value by 255. This will ensure that all values are between 0 and 1.
## Normalizing the pixel values

pixel_values=pixel_values/255.0
X=pixel_values
y=image_labels[0]

Similar to the previous demo, split the data into train and test subsets.
import sklearn.model_selection as model_selection

X_train,X_test,y_train,y_test=model_selection.train_test_split(X,y,test_size=0.20,rand
om_state=200)
Initialize the Logistic Regression model. Ensure that the multi_class parameter is set to
“ovr”(one vs rest). Use the GridSearchCV() to perform cross validation while training the
data as shown below.
import sklearn.linear_model as linear_model

clf=linear_model.LogisticRegression(multi_class="ovr",penalty="l2",solver="lbfgs")
np.random.seed(200)
mod=model_selection.GridSearchCV(clf,param_grid={"C":np.random.uniform(0.01,10
0,120)})
mod.fit(X_train,y_train)
The model parameters that provided the best model can be obtained using the best_params_
attribute.

mod.best_params_
{'C': 98.180452888741272}
mod.score(X_test,y_test)
0.96666666666666667
The output above indicates that 96.66% accuracy was obtained on the test data.
Use the model and the first data to predict the digit corresponding to the first data.
mod.predict_proba(X_test.iloc[0].values.reshape(1,-1))
np.argmax(mod.predict_proba(X_test.iloc[0].values.reshape(1,-1)))
The model predicts the value as ‘5’ which is correct.
Create a new model. This time use the “multinomial” method as parameter to the
‘multi_class’ to train the data as shown below.

np.random.seed(200)
clf=linear_model.LogisticRegression(multi_class="multinomial",penalty="l2",solver="l
bfgs")
mod1=model_selection.GridSearchCV(clf,param_grid={"C":np.random.uniform(0.01,1
00,120)})
mod1.fit(X_train,y_train)
GridSearchCV(cv=None, error_score='raise',
estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='multinomial',
n_jobs=1, penalty='l2', random_state=None, solver='lbfgs',
tol=0.0001, verbose=0, warm_start=False),
fit_params=None, iid=True, n_jobs=1,
param_grid={'C': array([ 94.76375, 22.66248, …, 16.85594, 70.5736
])},
pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
scoring=None, verbose=0)
mod1.best_params_)
{'C': 96.604798457712675}
mod1.best_score_
0.9617258176757133
mod1.score(X_test,y_test)
0.97222222222222221
This provides a better score of 97.2% on the test data.
Consider a sample data, for example, the first data

plt.imshow(np.array(X_test.iloc[0]).reshape(8,8),cmap='gray')
<matplotlib.image.AxesImage at 0xd436128>
Predict the probabilities for this first data for the different classes as shown below.
mod1.predict_proba(X_test.iloc[0].values.reshape(1,-1))
np.argmax(mod1.predict_proba(X_test.iloc[0].values.reshape(1,-1)))
The highest probability is for the digit ‘5’.
Thus a multi-class prediction can be performed on a dataset using ‘Python’ code.

1. Write the equation for calculating the logistic cost function?

2. What is accuracy?
3. ________are the counts of instances where predicted events are actually non-
events.
4. ________measures out of all the predicted events how many were actually
events.
5. ________measures out of all the actual events, how many a model was able to
predict as events.
6. In an ideal case, a good model is expected to have _____TPR but a ____ FPR.
7. the AUC for our naïve model is always ______.
8. The __________can be used to display the TP, FP, FN, and TN values.
9. Name the method in sklearn that can be used to display the Precision, recall,
f1-score of a result from a model - ________.
10. Name the two approaches to perform multiclass prediction.
11. The cost function for multiclass classification is popularly known as
___________loss.

Self-Assessment Question Answers
1. Cost = ∑𝑖=𝑛
𝑖=1 𝑦𝑖 𝑙𝑜𝑔 𝑃𝑖 + ( 1 − 𝑦𝑖 )𝑙𝑜𝑔(1 − 𝑃𝑖 )
2. Accuracy is equal to the number of correctly predicted labels divided by the total
number of rows in the data set.
3. False Positives
4. Precision
5. Recall
6. High, Low
7. 0.5
8. Confusion matrix
9. Classification_report()
10. One Vs. All approach and Multiclass Logistic Regression
11. Cross entropy loss
1. Describe the challenges in using a linear expression for modelling classification

problems and derive the need for a sigmoid function.
2. Elaborate on different measures for binary classification performance.
3. Describe the two ways of performing multiclass logistic regression with an example
1. Refer the section “Solving a Classification Problem”.

2. Refer the section “Measuring Binary Classification Performance”
3. Refer the section “Multiclass Logistic Regression”

Unit 5
SVM
Table of Contents
SL Fig No / Table SAQ /

Topic Page No
No / Graph Activity
1 Introduction - -
3
2 Basic Concepts 1, 2, 3, 4, 5 - 4–6
3 SVM Introduction 6, 7 -
7 - 11
3.1 SVM Optimization - -
4 Understanding SVM Mechanics - -
4.1 SVM Kernels - - 12 – 17
4.2 Sample demonstration of SVM - -
Unit 5: Logistic Regression 2

1. INTRODUCTION
Support Vector Machines (SVMs) are a type of supervised machine learning algorithm that
can be used for classification or regression tasks. The goal of an SVM is to find the hyperplane
in a high-dimensional space that maximally separates the data points of different classes.
The SVM works by finding the "line of best fit" (also known as a hyperplane) that separates
the two categories or predicts the continuous value as accurately as possible. This line is
chosen such that it maximally separates the two categories or accurately predicts the
continuous value. The points closest to this line are called "support vectors.” SVMs are widely
used in machine learning and data analysis and have been applied to various real-world
problems.
The concepts cover the background knowledge required for a better understanding of SVM
and its mechanics and the concepts of implementing SVM in a real-case scenario.
At the end of this topic, the students will be able to:
❖ Discuss the basic concepts of SVM.

❖ Describe the mechanics of SVM.
❖ Demonstrate effective ways of implementing SVM.

2. BASIC CONCEPTS
Consider a situation with two classes. As per fig.1, the two features provided are represented
as x1 and x2. There are ten observations, five of which contain circled responses, and the
other four are highlighted with crosses. Classification is accomplished by drawing a line
dividing the crosses from the circles. The space is divided into two halves by this line. All
points will be marked with a cross on one side of the line and a circle on the other. The
equation of a random straight line is w1 x1 + w2 x2 + b = 0, or in the more condensed dot
product form, w.x+b=0.
Figure.1: Decision surfaces
In this issue, a point is marked as a cross if w.x+b>0 and a circle if w.x+b=0.
One needs to choose the marker for a point based on where the point lies on the line.
Consequently, this line is referred to as a decision boundary. This decision boundary in this
straightforward 2-D problem is a straight line. There would be a plane as a decision
boundary if this were a three-dimensional plot, as represented in fig.2.

Figure.2: Higher Dimensions hyper-plane
A hyperplane is obtained with higher dimensions. However, the equations contain unknown
variables, like w1, w2, w3, b, etc. these are the parameters. A classification algorithm's task
includes determining appropriate values for these parameters.
Figure.3: ‘n’ Decision Boundaries
As per fig.3., there can be arbitrarily many decision boundaries that separate the classes. But
not all these boundaries are equally good. As in fig.4, the selected decision boundary is an
example of a bad decision boundary.

Figure.4: Bad Decision Boundary
This is because the separating line is too close to the points marked with crosses.
This implies that this boundary is prone to a mistake in classifying. It might mistake a point
that is truly a cross to be a circle. To avoid this problem and to find a good classifier, the
decision boundary should be as far as possible from both the classes to be considered as per
fig.5. The two parallel lines are drawn, and let ‘m’ be the distance between these two lines
which is typically called as the margin. Support Vector Machines (SVM) work by finding the
two parallel lines for which the distance between them, m(margin), is maximum. The actual
decision boundary can then pass along the middle of these two lines. This type of decision
boundary is called a maximum-margin hyperplane. The points through which the two
parallel lines pass are called support vectors, which partially gives the name to this
algorithm.
Figure.5: Decision Boundary with separation ‘m’

3. SVM INTRODUCTION
Support Vector Machine is considered one of the most state-of-the-art and powerful
classification techniques. Support Vector Machines (SVMs) are a type of supervised machine
learning algorithm that can be used for classification or regression tasks. SVMs were first
introduced in the late 1960s by Vladimir Vapnik and Alexey Chervonenkis, but the idea of
support vectors can be traced back to the early 1950s with the work of Bernhard Schölkopf,
John C. Platt, and John Shawe-Taylor.
The basic idea behind SVMs is to find the hyperplane in a high-dimensional space that
maximally separates the two classes. This hyperplane is known as the "maximum margin
hyperplane," The data points closest to it are called "support vectors." By finding the
maximum margin hyperplane, SVMs can achieve good generalization performance, even
when the data is not linearly separable.
SVMs have become popular in various fields, including computer vision, natural language
processing, and finance, due to their ability to handle high-dimensional data and their good
generalization performance. They are also relatively simple to implement and efficient to
train, making them a popular choice for many practical applications.
There are a few points that one needs to know before using the SVM classifier:
• If the data has many features, radial kernels may offer a slight improvement over the
linear kernel. In such a case, it might be better to use the linear kernel by tuning only one
parameter, ‘C.’
• If the data has fewer features and is non-linear, then the preferred choice of non-linear
kernels is using radial kernels.
• It is advisable to avoid fitting more than 4 degrees as that might hamper the performance
of the classifier and can be slower in its operation.
• Be careful while working with polynomial kernels.

3.1 SVM Optimization

SVM mainly focuses on deciding on having an efficient margin value of ‘m’. To perform this
process, a few tweaks will be made to the existing rules, as depicted in fig.6.
Figure.6: Deciding the classes based on parallel lines
For finding the effective margin ‘m’, the decision rules are to be modified as done below,
Original decision rule:

𝑤. 𝑥 + 𝑏 > 0 𝑓𝑜𝑟 𝑥 𝑝𝑜𝑖𝑛𝑡𝑠
𝑤. 𝑥 + 𝑏 < 0 𝑓𝑜𝑟 𝑜 𝑝𝑜𝑖𝑛𝑡𝑠
Modified decision rule:

𝑀𝑎𝑟𝑘 𝑥 𝑖𝑓 𝑤. 𝑥 + 𝑏 ≥ 1
𝑀𝑎𝑟𝑘 𝑜 𝑖𝑓 𝑤. 𝑥 + 𝑏 ≤ −1

Compact form of new decision rule:

𝑦 = 1 𝑖𝑓 𝑝𝑜𝑖𝑛𝑡 ℎ𝑎𝑠 𝑥 𝑚𝑎𝑟𝑘
𝑦 = −1 𝑖𝑓 𝑝𝑜𝑖𝑛𝑡 ℎ𝑎𝑠 𝑜 𝑚𝑎𝑟𝑘
New Rule: 𝒚(𝒘. 𝒙 + 𝒃) ≥ 𝟏
Taking the new rule into consideration, and the two parallel straight line can be of the form
as in fig.7,
Figure.7: Margins
𝑤. 𝑥 + 𝑏 = 𝑐1
𝑤. 𝑥 + 𝑏 = 𝑐2
On applying the formula of distance between two parallel lines,

|𝐶2 − 𝐶1 |
𝑚=
|𝑤|
If C1=1 and C2=-1, by L2 norm of w gives us the algebraic expression for the margin of the
SVM.
2
𝑀𝑎𝑟𝑔𝑖𝑛 = 𝑚 =
|𝑤|
The goal was to maximize the margin m, but one must also ensure that it correctly classifies
crosses and circles. In other words, for the optimum w vector that is found, one needs to
make sure crosses go to the left of the red line and circles go to the right of the blue line.
Notice, however, that the attractive quantity in the objective function, the L2 norm of the w
vector, appears in the denominator and has a square root. Both features add complications

to solving a mathematical problem. The expression to maximize even for the simple 2-
dimensional case appears complicated as depicted below,
Constraints:
𝑦(𝑤. 𝑥 + 𝑏) ≥ 1 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡𝑠
For 2-d case objective function:
√𝒘𝟐𝟏 + 𝒘𝟐𝟐
To get around this, here are a couple of mathematical facts,
Maximizing the inverse of a quantity is same minimizing the quantity itself. In other words,
larger a number, smaller is its inverse.
Additionally larger a number, it’s square is also large. In both these cases, any constants can
be safely ignored.
This gives the opportunity to modify the problem, so that it looks mathematically more
appealing. It can be reformulated as, the problem to minimizing half the squared length of
the vector w subject to the fact that one can classify all data points correctly.
3.1.1 SVM Optimization- Example
Consider a simple email spam classification example to learn about the SVM optimization set
up.

Email Classification:
Email. No. Sender in contact list Number of special Spam

(x1) characters (x2)
1 Yes 4 No
2 No 23 Yes
3 Yes 45 Yes
To Recode the column x1 as 0 if the sender is in the contact list and 1 if the sender is not
there in the contact list.

1 0 4 No
2 1 23 Yes
3 0 45 Yes
Recoding the spam column as 1 if the email is spam and -1 if it’s not spam.

1 0 4 -1
2 1 23 1
3 0 45 1
Based on the above values the SVM formulation which is to minimize the terms w12 + w22
which must accurately classify the emails into respective classes,
1 2
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 (𝑤 + 𝑤22 )
2 1
Such that,
−1 × (𝑤1 × 0 + 𝑤2 × 4 + 𝑏) ≥ 1
1 × (𝑤1 × 1 + 𝑤2 × 23 + 𝑏) ≥ 1
1 × (𝑤1 × 0 + 𝑤2 × 45 + 𝑏) ≥ 1

4. UNDERSTANDING SVM MECHANICS
Support Vector Machines (SVMs) are a type of supervised machine learning algorithm that
can be used for classification or regression tasks. The main idea behind SVMs is to find a
hyperplane in a high-dimensional space that maximally separates different classes.
To understand how SVMs work, it's helpful to consider the case of a binary classification
problem, where the goal is to separate two classes with a hyperplane. Here's a step-by-step
breakdown of the mechanics of an SVM:
• The SVM algorithm inputs a set of labeled training data. Each data point is represented
as a dot on a plot, and the goal is to find a hyperplane that maximally separates the two
classes.
• The SVM algorithm finds the hyperplane that maximally separates the two classes by
maximizing the distance between the hyperplane and the closest points of each class,
known as the margin. The points that are closest to the hyperplane are called the support
vectors.
• Once the hyperplane has been found, the SVM can make predictions on new data by
seeing which side of the hyperplane the data point falls on. If it falls on one side, the SVM
will predict one class, and if it falls on the other side, it will predict the other class.
Several kernel functions can be used to modify the behavior of an SVM and allow it to learn
more complex patterns in the data. For example, the radial basis function (RBF) kernel can
be used to transform the data into a higher-dimensional space, where it may be more easily
separable by a hyperplane.
Understanding SVM mechanics for multi class:
Support Vector Machines (SVMs) can be used for multi-class classification by using a one-
versus-all (OVA) or one-versus-one (OVO) strategy.
In the OVA strategy, the SVM is trained to distinguish between one class and all other classes.
For example, if there are three classes (A, B, and C), the SVM would be trained to distinguish
between class A and the combination of classes B and C, class B and the combination of

classes A and C, and class C and the combination of classes A and B. During prediction, the
class with the highest score is chosen as the predicted label.
In the OVO strategy, the SVM is trained to distinguish between every pair of classes. For
example, if there are three classes (A, B, and C), the SVM would be trained to distinguish
between class A and class B, class A and class C, and class B and class C. During prediction,
the class with the highest score from all the binary classifiers is chosen as the predicted label.
4.1 SVM Kernels
In the context of Support Vector Machines (SVMs), a kernel is a function that takes two input
vectors and returns a scalar value. The kernel function is used to compute the dot product of
the input vectors in a transformed feature space, where the dot product can be calculated
more efficiently than in the original input space.
Several different types of kernels can be used with SVMs, including:
• Linear kernel: This is used when the data is linearly separable. It computes the dot
product of the input vectors in the original feature space.
• Polynomial kernel: This is used when the data is not linearly separable. It computes
the dot product of the input vectors in a transformed feature space, where the
transformation is a polynomial function of the original features.
• Radial basis function (RBF) kernel: This is also used when the data is not linearly
separable. It computes the dot product of the input vectors in a transformed feature
space, where the transformation is a radial function (i.e., a function that depends on the
distance between the vectors).
• Sigmoid kernel: This kernel is similar to the RBF kernel, but it uses a sigmoid function
as the transformation.
The choice of kernel depends on the characteristics of the data and the specific requirements
of the problem. It is often recommended to try different kernels and choose the one that gives
the best performance on the validation set.

4.2 Sample Demonstration of SVM
One way to understand SVMs is to think of them as algorithms that try to find the "maximum
margin" hyperplane that best separates different classes in a dataset. For example, in a
binary classification task with two classes (e.g., "customer will purchase product" and
"customer will not purchase product"), the SVM algorithm will try to find the hyperplane
that maximally separates the two classes.
Here's an example of how you might use an SVM to classify customers in a marketing dataset:
• Collect a dataset of customer information, including their age, income, location, and
whether or not they purchased a product.
• Pre-process the data by handling missing values, scaling numerical features, and encoding
categorical features.
• Split the data into a training set and a test set.
• Train an SVM model on the training set using a suitable kernel (e.g. linear, polynomial, or
radial basis function (RBF)).
• Use the trained model to make predictions on the test set.
• Evaluate the model's performance using accuracy, precision, and recall metrics.
Code Snippets:
The first and foremost step is to import the required libraries from the main python library
as it aids in the respective operations to execute
Next step is to load the dataset on which the model is to be developed for analysis, in this the
marketing data is considered for study which gives the gist of employees age, income earned,
location they are put up as whether it is a metropolitan or non- metropolitan, that is decoded

as ‘1’ if metropolitan and ‘0’ if not, and whether they have purchased a costly product or not,
again the recoding is done which takes the values as ‘1’ if purchased and ‘0’ otherwise.
The next in line is to split the data into training and testing sets, which over here is
considered as 80% as training data and remaining 20% as test data.
It is crucial to perform the scaling process, which refers to standardizing the input features
by transforming them to have zero mean and unit variance. Scaling is usually performed
before training the model, and it is often an essential step in the pre-processing of the data.
In SVM, scaling plays a significant role due to the following reasons:
• SVMs are sensitive to the scale of the input features. Scaling the features ensures that all
of them are on the same scale and contributes equally to the distance between the
samples.
• Scaling can improve the convergence of the optimization algorithm used to train the
SVM. The optimization algorithm used to train an SVM is based on the gradient descent
method, which requires small step sizes to converge.

• Scaling can improve the interpretability of the model. If the input features have very
different scales, the coefficients of the model will be influenced by the scale of the
features. Scaling the features ensures that the coefficients of the model are not
influenced by the scale of the features, which can make the model more interpretable.
Training the model is yet another important step in building any machine learning model as
is the process of fitting the model to a dataset by adjusting the model's internal parameters
to minimize the error between the model's predictions and the true labels of the data. The
goal of training a model is to learn a function that can accurately predict the output for
unseen data based on the patterns and relationships learned from the training data. In our,
case the SVC model is built using a linear function or kernel (one is free to use any kernel
depends on the characteristics of the data and the specific requirements of the problem).
Once the model is trained it is crucial to check the performance of the model where testing
comes into picture, and that which refers to the process of evaluating the model's
performance on a dataset that it has not seen before. The goal of testing a model is important
because it allows us to get an unbiased estimate of the model's performance on unseen data.
The final step is to evaluate the model and check its accuracy, based on which the decision
could be taken as whether to proceed with model or re-build the model for better
performance measures. The model here renders an accuracy of 50%. But, if the process of
re-building is done it may give promising results.

1. For a 3D problem, a decision boundary is a plane, whereas for higher dimensions,

it is a __________.
2. A ________ is a good classifier if it produces a decision boundary that is as far away
as possible from both classes.
3. In a simple 2D problem, the decision boundary is a ______________.
4. The equation w.x+b<0 signifies ____________
5. What is the formula for the margin of the SVM?
6. ___________ kernel is used when data is not linearly separable.
7. ___________ kernel is used when data is linearly separable.
8. ____________ kernel is like the RBF kernel.

1. hyper-plane
2. binary classifier
3. straight line
4. The points should fall right to the actual boundary.
5. m=2/|w|
6. Polynomial
7. Linear
8. Sigmoid
1. Brief about the concepts of SVM.

2. How the SVM deals with the optimization. Justify it with a simple example.
3. Interpret how SVM deals with the binary classification.
4. Describe the usage and importance of SVM Kernels.
5. Interpret how SVM deals with the multi class classification.
1. Refer to section 1
2. Refer to section 2.1

Unit 6
Decision Tree
Table of Contents
Fig No /
SL SAQ /
Topic Table / Page No
No Activity
Graph
1 Introduction - -
3
2 Basic Concepts - - 4-5
3 Decision Tree 1 -
5-11
3.1 Classification of Data using Decision Tree 2, 3, 4, 1 -
4 Decision Tree Models - -
4.1 Classification Tree 5, 6, 7, 8, 2 -
4.2 Creating a DecisionTreeClassifier - -
4.3 Regression Tree 9, 10, 11, 12, 11-36
-
13, 14, 15, 3
4.3.1 Hyperparameters of a Regression tree 16, 17 -
4.4 Creating a Regression Tree Model in
- -
Python
5 Self-Assessment Questions - 1 37
7 Answers - - 38
Unit 6: Decision Tree 2

1. INTRODUCTION
Decision tree is a supervised learning method used in statistics, data mining, and machine
learning. In this formalization, inferences about a set of data are made using a classification
or regression decision tree as a predictive model. The goal of a decision tree's is to assist a
person or an organization in reaching a choice by analyzing several options and the potential
repercussions of each option. A decision tree's tree structure makes it simple for the decision
maker to see the potential outcomes of each option and to assess the advantages and
disadvantages of each option in order to reach the optimal conclusion. In order to make
judgements about investments, resource distribution, and treatment alternatives, decision
trees are frequently used in business, finance, and healthcare. In order to help with complex
issue solving and process optimization, they are also utilized in disciplines including
engineering, computer science, and operations research. The fundamental goal of a decision
tree is to offer a structured and methodical method of analyzing and making decisions based
on a variety of possibilities and their potential outcomes.

❖ Discuss the basic concepts of Decision Tree.

❖ Describe the types of Decision Tree.
❖ Demonstrate effective ways of implementing Classification Tree and Regressor Tree.

2. BASIC CONCEPTS
Charles Sanders Peirce, who is recognized with the invention of the decision-making diagram
in the late 1800s, is the one who is credited with the concept of employing a tree structure
to make judgements. But in order to design a program that could make judgements without
requiring human input, computer scientists built the modern decision tree in the 1950s.
A decision tree can be defined as a type of tree structure that resembles a flowchart and is
used to decide or solve problems. A choice can be made by a human or a computer program
based on the possibilities available when a complicated situation is broken down into
smaller, easier decision points. The nodes of the tree structure represent in for the decisions
that need to be made, while the branches indicate the potential results of those decisions.
The branches that emerge from each node in the tree depict the potential outcomes of that
decision. The final choice or solution to the issue is represented by the leaves, the tree's
terminal points. Decision trees can be used to model and resolve difficult problems in a range
of disciplines, including as business, economics, and computer science.
There are few basic concepts that are important to understand in order to effectively use
decision trees:
a) Root Node: The decision tree's root node is where it all begins. It stands for the first
choice to be made or issue to be solved.
b) Internal node: This is a node where the tree makes a choice. The branches that arise
from the node indicate the potential outcomes of the decision that must be made, and
the node itself represents the decision that must be taken.
c) Leaf node: The decision tree's conclusion is at this node. It stands for the final decision
or outcome of the process.
d) Branch: In a decision tree, this is a path that connects one node to another. It stands for
a potential outcome of a choice.
e) Decision rule: It is a statement that describes the circumstances in which a certain
outcome will take place. It is used to decide which branch in the tree should be taken at
each decision point.

f) Class: This is the consequence of the decision-making process. The class in a

classification tree is the anticipated label or category for a certain input. The projected
numerical value serves as the class in a regression tree.
g) Splitting: Data are split into distinct branches according to the decision-making criteria
at each internal node.
h) Pruning: It involves deleting branches from the decision tree that are redundant or
unnecessary in order to increase its accuracy and clarity.
3. DECISION TREE
The decision tree is a tree structured classifier with a visual representation of various
decisions that could be made depending on specific circumstances. A tree-like model of
decisions and their potential outcomes, including chance event outcomes, resource costs,
and utility, is used in this decision-support tool. One technique to show an algorithm that
solely uses conditional control statements is to use this method.
It is a non-parametric supervised learning approach used for classification and regression

applications. It is organized hierarchically and has a root node, branches, internal nodes, and
leaf nodes.
Fig 1: Hierarchy of a Tree Structure.
The decision nodes (i.e., root node, internal node) and leaf node can be observed in the above
figure where,
Decision node defines a test or choice of an attribute, with one branch for each result and
leaf node is a classification of an example.

Circle represents decision node, (Root node/Internal Node) - D
Arcs/lines represents the states of nature (decision alternative or outcome)
Rectangle represents leaf node.
A root node serves as the first node in a decision tree and has no outgoing branches.
The internal nodes, sometimes referred to as decision nodes, are fed by the root node's
outgoing branches.
The internal nodes, sometimes referred to as decision nodes, are fed by the root node's
outgoing branches. Both node types undertake assessments based on the available attributes
to create homogenous subsets, which are represented by leaf nodes or terminal nodes.
All of the outcomes within the dataset are represented by the leaf nodes.
3.1 Classification of the Data using Decision Tree

• Given an input, a test is run at each node, and a branch is chosen based on the results.
• This procedure begins at the root and is carried out recursively until it reaches a leaf
node, at which time the value entered in the leaf node serves as the output.
• After that, the input is given the class label connected to the leaf node.
Example: Draw a decision tree on Loan Approval.
The first test that we will make is check if the applicant is employed and there are two
outcomes, yes or no.
The applicant's employment status will be verified as our first test, and there are only two
possible results: yes or no.
You have additional test to determine the applicant's income if he is employed. You approve
the loan if the applicant's income is high, and you deny the loan if the applicant's income is
low.

If the applicant is unemployed, we will perform another test, in which we will look up his
credit score to see if it is high. You approve the loan if the credit score is good, and you refuse
the loan if the credit score is poor.
This can be shown in the below decision tree figure,
Employed
Root Node
Y N
Credit Internal Node

Income?
Score
High Low
Approve Reject Approve Reject Leaf Nodes
Fig 2: Decision Tree to check whether loan can be granted or not.
In the above decision tree diagram, we can see three decision nodes (Employed, Income,
Credit Score) and four-leaf nodes with class label. If the new application is received from a
new applicant for a loan, then using the above constructed decision tree, they can check
whether loan can be granted or not.
The two main issues to be addressed while learning decision tree are,
• Determine when to halt data splitting and
• Determine how to halt the data splitting process.
How should the splitting process halt.

The possible criteria to stop splitting are
1. We can stop splitting if all samples for a given node belong to the same class.
2. We can stop splitting if there are no more samples or attributes available for further
partitioning.

How should the training records be divided/split?

Choose an attribute that provides the best split. But how to choose the best
attribute/variable to split on?
The ideal characteristic is one that causes a split that only yields pure child nodes. If all of the
examples in a partition fall under the same class, it is considered to be pure.
Assume that we have a training data set as illustrated below.
The data set has two attributes A1 and A2 and the classification label has positive or negative.
Table 1: Sample Training Data Set
A1 A2 Classification
T T +
T T +
T F -
F F +
F T -
F T -
We'll now examine which attributes offer the best split.

Let's initially divide the information using attribute A1. True and false are the node's 2
outcomes. As seen below, on the true side, there are two positive records and one negative
record, and on the false side, there is one positive record and two negative recordings as
shown below.
Fig 3: Attribute A1 Data Split

Let's now divide the information using attribute A2.

The node also has true and false as its two outputs here. On the true side, there are two
positive records and two negative recordings, and on the false side, there is one positive
record and one negative record as shown below.
Fig 4: Attribute A2 Data Split
This indicates that the child nodes produced by A1 are purer than those produced by A2. As
a result, A1 is the best attribute for selecting as the root node.
We will mathematically measure the impurity of the node rather than the purity of the node.
There are numerous ways to measure a node's impurity, which "specifies how mixed the
resulting subsets are."
Let's look at some of the typical impurity controls.

1. Gini Index: The Gini Index is a popular impurity metric used to find the optimal split.
Consider a dataset segment S with c class labels.
Pi should represent the percentage of examples in training set D that have the ith class label.
Then the Gini Index is given by:
c
Gini index(D) = 1-∑ Pi 2
i=1
The purity of the split increases as the Gini index decreases. Therefore, the split that
minimizes the Gini index will be chosen by the decision tree. Let's calculate the Gini index for
the data set given below.

There are 7 training records in the data set, accept offer is the class label here. There are 5
records with class label yes and 2 records with class label no.
Hence the Gini index for the dataset is
Gini Index = 1- (5/7)2 + (2/7)
= 1-0.51+0.08
= 0.41
Hence the Gini index for this data set is 0.41.
Entropy is the measure of another impurity.
2. Entropy: entropy is a measure of randomness or uncertainty.

consider a segment S of a dataset having c number of class labels. Let Pi be the proportion of
examples in D having the ith class label.
c
Entropy (D) = 1-∑ Pi log2Pi
i=1
Entropy is measured between 0 & 1. The lower the entropy, the purer the node.
Let us compute the entropy for the same data set given above:
Entropy = -5/7 log2 5/7 - 2/7 log2 2/7
= -0.71 log2 0.71 - 0.29 log2 0.29
= -(-0.35) - (-0.52)
= 0.86.
Another impurity measure is a Classification Error

3. Classification Error: It is the ratio of the number of incorrectly classified observations

to the total number of observations.
Classification Error (D) = 1-max[pi]
Let us compute the classification error for the same dataset given above,
Classification error = 1-max [5/7,2/7]
= 1-max [0.71,0.29]
= 1-0.71
= 0.29
4. DECISION TREE MODELS

Generally, we have two types of Decision Tree Algorithms:
1. CART (Classification and Regression Tree)
2. ID3 (Iterative Dichotomiser 3)
CART makes use of Gini index as metric and ID3 uses entropy and information gain as metric.
In this unit will learn Classification and Regression tree.
4.1 Classification Tree

A classification tree categorizes, keeps track of, and assigns variables to distinct classes. A
classification tree can also offer a level of assurance that the classification is accurate. Binary
recursive partitioning is the method used to construct classification trees. The data is divided
into partitions in an iterative process, and each branch of the process further divides the
data.

Table 2: Bank Customer Dataset
Assume we have data on a few bank customers and want to create a predictive model to
differentiate profitable and unprofitable consumers.
If we look at this data we can see that the proportion and profitable and unprofitable
customers is 50% each i.e.,
Total Population = 10
Profitable = 5
Unprofitable = 5
Profitability rate = 50%
In light of this, we will now attempt to subset a data set by the age variable, with one subset
obtained from all the rows where age is greater than 35 (age>35) and the other subset
coming from the rows where age is less than or equal to 35 (age <= 35).
These are the class proportions we get when we subset the data; the diagram below
illustrates them.

Fig 5: Subset a data for age variable
When compared to the proportion of profitable customers in the population, the proportion
of profitable customers in the subset where the age is over 35 is 66%. That suggests there is
a larger likelihood of finding a profitable customer in the data group with an age over 35.
Fig 6: Subset a data for age variable with respect to profit proportion
We can continue to subset the data further. For example, we can subset the segment1 data
by the marital status (i.e., Married or Single) and will get the proportion of profitable
customers as shown below.

Fig 7: Subset a data with respect to marital status
These results can be interpreted as a set of guidelines for dividing clients into profitable and
unprofitable categories. Here, it is quite clear that those who are married and older than 35
have a decent chance of being successful. The DecisionTreeClassifier operates on this
principle.
We can produce interesting patterns by recursively subsetting a data collection. The goal is
to divide the data in such a way that one class of the target variable ends up dominating the
subsets of data.
For example, in the married split you can see all the observations have a class of profitable
customer while in the single split all the observations belong to the other class that is the
group of people who are unprofitable.
How do we decide which variable to split on?

As explained before, the variable which produces the greater class imbalance will be
considered.
For example, if we consider the age and gender variable to split, then

Fig 8: Comparison of a class imbalance
There is a greater class imbalance produced by the variable age v/s the split caused by the
variable gender.
The greater the class imbalance is the better the split.
The Classification tree makes use of Gini index as a purity metrics.
Now let us compute Gini Index for both the variables i.e., age and gender.
Case 1: Computing Gini index for Age variable.
The Gini index is computed for each of the two nodes of age variable i.e.,
i. For the age>35 node, we have 2 classes profitable and unprofitable. Compute their
proportions as given.
Gini = 1-∑ Pi 2
= 1- [(4/6)2 + (2/6)2]
= 0.44
ii. For the age<=35 node. (This node is having higher class imbalance)
Gini = 1-∑ Pi 2
= 1- [(1/4)2 + (3/4)2]
= 0.375
Note: Higher class imbalance lead to lower Gini Index value.

Case 2: Computing Gini index for Gender variable.

The Gini index is computed for each of the two nodes of gender variable i.e.,
i.For the Gender = Male node, we have 2 classes profitable and unprofitable. Compute their
proportions as given.
Gini = 1-∑ Pi 2
= 1- [(3/5)2 + (2/5)2]
= 0.48
ii. Case 2: For the age<=35 node. (This node is having higher class imbalance)
Gini = 1-∑ Pi 2
= 1- [(2/5)2 + (3/5)2]
= 0.48
After computing Gini values in both the cases, we have two Gini values correspondence to
each variable. Ideally, we would want to have a single value per variable and that can be done
by considering the weighted average i.e., by weighing each part of the split with the number
of observations in the split to the total number of observations.
Let consider the age variable now,

You can see that the total number of observations in the data were 10, but only 6 observations
belongs to age>35 node. So, the Gini value for this node would gets a weight of 6/10 and the
other part of the split as well i.e., age <=35 the Gini value for this node gets a weight of 4/10.

i.e, (6/10) + 0.44 + (4/10) + 0.375

= 0.41
Similarly, if we compute for Gender variable, then we will get,
= (5/10) + 0.48 + (5/10) + 0.48
= 0.48
Now, if we compare the Gini value for age and gender, then we can see that the value for age is
low indicating a better split.
4.2 Creating a DecisionTreeClassifier

1. First will import the necessary library i.e., OS module, Pandas module and pydotplus
module.
2. Load the required Data Set (Here we have used Credit_History dataset)
3. Use the head method to look at the first few observations of the data set which I just
suggested after reading this file using the read csv method in the PANDAS module and
creating a PANDAS data frame.

We'll try to predict the value of this default column. Consider this data set as if it belonged to
a credit card business that was trying to figure out why some of its customers were
defaulting on their payments.
4. Let's perform a small data audit before we develop a model to see if any of our columns
have any missing values.
The output shows that column years has 279 missing values.
5. These missing values might be imputed, if possible. Looking at the distribution of this
variable (years) in our dataset could serve as a good place to start for us in order to
impute the missing values in this particular column.

This is how the years column's distribution looks, and a simplistic way to impute the 279
missing values would be to replace them with the median of our data. The current median of
my data is 4, so I will use the fill NA method to fill in the missing values in the column years
of my data set with the number 4 as shown in the below code.
6. After this develop the predictor matrix.
This is how the predictor matrix looks like.
7. Now you can see if my predictor matrix contains any variables that aren't numbers. So,
the next thing is giving these non-numerical variables a numerical representation using
One hot encoding which is the simplest method for converting these non-numerical
variables into numbers i.e., by using the get_dumies method of a PANDAS data frame.
8. After converting non-numerical data to numerical data, this is how the predictive model
looks like.

9. Next, make a target vector by putting all the values in the default column of object Y as
we are predicting the default column.
10. Now, split data set into training and test data components using model_selection
module which has train_test_split method that helps us in splitting a data set into
testing and training components.
11. Now, data is ready to build a decision tree model.

12. We must import the tree module in order to create a decision tree model. We have a
class called decisionTreeClassifier in the tree module. When we create an object of this
class, it accepts a lot of parameters 9but here we are using only one parameter i.e.,
max_depth). We will use the fit method after creating an object of the
DecisionTreeClassifier class and passing my training set. Our DecisionTreeClassifier
would have been trained up until this point. Now we will use our DecisionTreeClassifier
that we just trained to obtain the accuracy score on our test data set.
0.6274256144890039 //Accuracy is around 62%

13. Next if we want to construct other performance matrices, such as the area under the
curve, by importing the matrix module, which contains the roc_auc_score method,
which only requires two inputs: the actual labels and the predicted probabilities from
the decision tree model.
0.6721250820352787
14. Next will visualize a decision tree ‘that we have just created. We will now need to
employ a few packages (i.e., pydotplus library and Graphviz2) in order to view the
decision tree.
15. Now use the trained model which is stored in the object clf to create a representation
model that can be further used to create a visual output of our trained model.
In the above code snippet, we can use the export_ graphviz method found in the tree
modulewhich accepts few parameters, one of the parameter is our trained model and the
other two important parameters are called feature_names (in which the column names are
passed in my predictor matrix) and we have another parameter class_names (if in case you
are building a classification tree, we have to mention what are the classes in our data set. In
our data set, the target variable, we had only two classes, 0 & 1 as shown in the above code
snippet).
16. Next will create a graph representation from the object that we created in the previous
step (i.e., dot_data).

17. Then will use the image module to visualize the tree that we have just built.
This is the visual representation of my tree model.
18. We know that, our DecisionTreeClassifier can take many parameters which we might
need to tune. We can do this tuning process using grid search cross validation.
For example, will do grid search on the max depth parameter.
and we will use the gridsearchcv and the gridsearchcv method in the model selection
module. We will specify the model object and the parameter grid.
Let’s do the grid search on max depth.
GridSearchCV(estimator=DecisionTreeClassifier(max_depth=3,random_state=200),
param_grid={'max_depth': [2, 3, 4, 5, 6]})

Now let’s take a look at the best estimator
DecisionTreeClassifier(max_depth=2, random_state=200)
So, our best estimator is for a max depth of 2.
Let’s look at its score,
0.6314528049645112
The score turns out to be .63.
So, in this specific code demonstration, we have shown how to create a straightforward
DecisionTreeClassifier and how to display a DecisionTreeClassifier and how to use a grid
search to adjust some of a DecisionTreeClassifier’s parameter.
4.3 Regression Tree

Decision tree can be used to do regression. A decision tree regressor can be used when the
target variable is continuous. The mean value of the target variable serves as the prediction
in a decision tree scenario.
Let’s consider an example, to understand how regression trees work.
Let's say we have a dataset about price of various cars along with other attributes such as:
where the car was manufactured; what is the size of wheel rim; size of tire; and type of car
as shown in the below table.

Table 3: Car Manufacture Dataset
Now the task is to predict the price.

To predict the price will build a decision tree model. Here price is a continuous variable,
hence will try to build a regression tree.
To build a regression tree we will recursively subset a data. One of the subsets is shown
below and it done by taking rim variable. Now in a regression framework since the target
variable is continuous, so we look at the average value of the target variable in each subset.
In the below figure we can see that there are two nodes and, in each node, the average price
is 17.50 for Rim_R14=Yes subset and 26.97 for Rim_R14=No subset.
Fig 9: Subset a data for Rim variable

Just like a classification tree, in a regression tree, we recursively sub set the data and as a
prediction we obtain the average value of continuous target variable. Just like a classification
tree, a regression tree can also be summarized as a series of rules as shown below.
Fig 10: Subset a data for Tire variable
How does a regression tree split?

While doing regression, we want our predictions to be accurate. In a regression tree, the
prediction is the average of target variable in a decision node. For regression trees one
usually either computes Mean Squared Error (MSE) or Residual Sum of Square (RSS) as a
proxy of accuracy in each node.
Let us consider an example to understand how MSE or RSS helps in deciding which variable
to choose for a split.
The below figure shows the average value of the subset when the split is done on variable
rim and the subset when the split is done on variable Germanay.

Fig 11: Computing the average value of a subset for Rim and Country Variable.
We can observe that our subsets have changed due to the choice of different variables to
split.
Now, how can we determine whether the rim is a better variable to separate than the
country? Or country is a more useful variable to separate than rim.
Here the decision to be taken based on the variable that helps in creating more accurate
predictions. We can use Mean Square Error (MSE) or Residual Sum of Square (RSS) to
measure the accuracy.
MSE = 1/n ∑(yi-µ)2
MSE is the average of RSS that is nothing but variance in the values of target variable in a
node.
Here we have a subset of data caused due to the split of Rim_R14=Yes and the other subset
caused due to the split of Rim_R14=No. The prediction for each of the subset is given as
shown below

Fig 12: Computing the prediction for RIM Variable
The predictions are the average value for each node computed by taking the average of the
subset of the target variable created.
Same process will be repeated for the tree with Country – Germany variable as shown below.
Fig 13: Computing the prediction for Country Variable
Case 1: For rim variable will compute the accuracy of a prediction.

Now will use , MSE = 1/n ∑(yi-µ)2 to find out how accurate a prediction is in each node.
a. The accuracy would be calculated for Rim_R14 =Yes split using MSE formula:
MSE = 1/6 (11.95-17.50)2 + (18.90-17.50)2 + (24.65-17.50)2 + (13.15-17.50)2 + (20.22-
17.50)2 + (16.14-17.50)2

MSE = 18.67
b. The accuracy would be calculated for Rim_R14 =No split using MSE formula:
MSE = ¼ (24.76 – 26.97)2 + (26.90 – 26.97)2 + (33.20 – 26.97)2 + (23.04 – 26.97)2
MSE = 14.78.
Case 2: For Country variable will compute the accuracy of a prediction.

a. The accuracy would be calculated for Country Germany =Yes split using MSE formula:
MSE = 1/4 (26.90 – 25.91)2 + (18.90 – 25.91)2 + (26.45 – 25.91)2 + (33.20 – 25.91)2
MSE = 26.21
b. The accuracy would be calculated for Country Germany =No split using MSE formula:
MSE = 1/6 (26.90 – 18.21)2 + (18.90 – 18.21)2 + (26.45 – 18.21)2 + (33.20 – 18.21)2 + (16.14
– 18.21)2 + (23.04 – 18.21)2
MSE = 23.221
Now we have computed the MSE for both the trees for each of its nodes and the same has
been represented in the below diagram.
Fig 14: Computing the MSE for RIM and Country Variable
Now we will want to obtain a single number talking about the MSE of the whole split which
can be achieved by computing the weighted average of MSE as shown in the below fig i.e., for
split1 we have got 17.114 and for split 2 we have got 24.416.

Fig 15: Computing the weighted average for RIM and Country Variable
Since the MSE for split1 (i.e., Rim Variable) is lower when compared to the split2 (i.e.,
Country Variable), indicates that rim is a better variable to split on when compared to
country variable.
4.3.1 Hyperparameters of a Regression Tree

• Similar to a classification tree, a user must choose the tree's depth, the number of
observations in its terminal nodes, and other factors. Now, one can utilise a grid search
to determine what these hyperparameters' proper values should be.
• We can determine the relative importance of the predictors using both classification
trees and regression trees. This is accomplished by calculating feature importance. The
overall reduction of purity measure produced by a feature is used to calculate its
importance.
• Let us consider an example for classification tree to show how feature importance is
computed. Let consider the below decision tree model

Fig 16: Decision Tree Model with two different Features
First split happens on Feature A and second split happens on Feature B. Additionally, each
node displays the class proportions and the quantity of observations for each split.
Now we have two features, Feature A and Feature B. Which feature is more important?
While computing Variable importance both the sequence of the split and the purity of a node
should be considered. As you can see, Feature A precedes Feature B but Feature B seems to
create a greater class imbalance or creates greater node purity that is by means of gini
measure (the ability to create class imbalance can be measured by the difference in Gini from
one split to the other.) as shown in the below figure.

Fig 17: Computing the Gini index value and Gini reduction for Feature A and Feature Bsss
The feature importance of variable A is computed as:
The feature importance of variable B is computed as:
As you can see, the drop in gini in Variable B has more significance, but since this node only
contains 400 observations, the term's total influence is minimal.
For a regression tree, one would examine the MSE or RSS decline caused by each feature and
weigh this reduction correctly to determine feature relevance.

4.4 Creating a Regression Tree Model in Python

1. First will import the necessary library i.e., OS module, Pandas module.
2. Set the working directory and load the required Data Set (Here we have used dm.csv
dataset).
It shows the column names of the specific file I just read.
3. Use the head method to look at the first few observations of the data set
The above data is an e-retailer dataset. E-retailer has recorded the demographic details of
each of the clients, and also keeping track of the quarterly mean expenditure by each of them.
Now, for demo purpose, let us consider that the e-retailer wants to predict the amount spent
column using the other information that we have about other customers.
The amount spent column is a continuous column, we will be using a decision tree regressor.

4. Next, we will create a predictor metrics by dropping Amountspent variable and Cust_Id.
(As the amount spent variable cannot be a component of our predictor matrix because
the amount spent is a column that we wish to forecast. Customer Id column shouldn't be
used as a prediction, either. Consequently, we are also removing the customer Id
column.)
5. Next will have a look at our predictive matrix.
We can observe that there are a few non-numeric variables in our dataset. Therefore, we
must convert these non-numeric variables into a numeric form.
6. Will use get_dummies method to convert non-numeric to numeric data.
7. Now again will have a look at our predictive matrix.

This is how the predictor matrix looks like after converting non-numeric variables into their
numeric representation.
8. Next will create the target vector
9. Now will divide the data set into training and test components before creating our
decision tree regressor model using model_Selection Module and by using a method test
train split.
Now we are ready to build a decision tree regressor model by importing a tree module which
has a class called DecisionTreeRegressor. This DecisionTreeRegressor is used to instantiate
an object (reg) of decision tree regressor class. Once we instantiate reg object, then will use
the fit method to fit training data on a decision tree regressor model.
10. Now will check the accuracy score of regression tree on our test data set using score
method.

This displays the model's mean squared error for the test set of data.
11. We can also have a look on which predictors are important by printing out their feature
importance’s.
This gives us an array of feature importance’s of all the predictors that we have used.
But the better way is to create a series object and giving it a name that corresponds to the
magnitude of feature importance using NumPy array. After the series object is created, will
sort its values, arrange them in descending order, and print the top five predictors according
to the significance of each characteristic. i.e.,
Here Salary is the most important feature followed by catalogs and so on.
12. Now let’s visualize the decision tree regressor model that we have just created.

This is how the visual representation of our decision tree regressor looks like.

5. SELF-ASSESSMENT QUESTIONS
SELF-ASSESSMENT QUESTIONS – 1
1. _____________ node where the tree makes a choice.

2. The decision tree's conclusion is at _________ node
3. The _____________ is a tree-structured classifier that displays different decisions
that might be taken based on particular situations.
4. __________ is a measure of randomness or uncertainty C
5. The ____________ is a popular impurity metric used to find the optimal split.
6. In ___________________ the greater the class imbalance is the better the split.
7. The tuning process will be performed using _____________
8. A ____________ can be used when the target variable is continuous
1. Define decision tree and mention DT models. (1, 3)
2. Briefly explain how the data is classified using Decision Tree. (2.1)
3. Describe how the training records be split in classification tree (2.1)
4. Explain the impurity controls with a suitable example. (2.1)
5. How does a regression tree split? (3.3)
6. Elucidate the hyperparameters of a regression tree. (3.3.1)

Self-Assessment
7. ANSWERS Questions
1. Internal Node
2. Leaf
3. Decision tree
4. Entropy
5. Gini Index
6. Classification Tree
7. Grid Search Cross Validation.
8. Decision Tree Regressor
Terminal Questions
1. Refer Section 1 and 3
2. Refer Section 2.1
6. Refer Section 3.3.1

Unit 7
Machine Learning Algorithm K-Nearest
Neighbours
Table of Contents
Fig No /
SL SAQ /
Topic Table / Page No
No Activity
Graph
1 Introduction - -
- - 3
1.1 Objectives
2 Concept and Motivation for K-Nearest Neighbour 1 -
4-5
(KNN)
3 Use cases - - 5-6
4 Algorithm - - 6
5 Distance Measurement - - 7-8
6 Selecting ideal value for K - - 8-9
7 Benefits and Weakness of K-Nearest Neighbours - -
9
Algorithm
8 Implementation of K-nearest neighbor using - -
10-18
Python programming
9 KNN and Regression - - 19
10 Self-Assessment Questions - 1 20
12 Answers - - 21
Unit 7: Machine Learning Algorithm K-Nearest Neighbours 2

1. INTRODUCTION
K-Nearest Neighbor algorithm is one of the simplest algorithm that can be used for
predicting the result. It is based on the concept of ‘neighborhood’. Once the features set and
result for a known data needs to be available, for a new data point, the ‘feature similarity’ is
done to identify the neighbors. The result is computed as a mean or mode of the result of
these ‘N-Nearest Neighbors’. It is first developed by Evelyn Fix and Joseph Hodges in 1951
and later explained by Mr. Thomas.
1.1 Objectives
After learning this unit, you would be able to:
❖ Understand the concept of K-NN

❖ Need for K-NN
❖ List the Building blocks of K-NN
❖ Describe the working Principles of K-NN
❖ List Step by Step Methods of K-NN
❖ Describe the selection of the value of K
❖ Calculate the Euclidian Distance
❖ List the Pros & Cons of K-NN
❖ Implement K-NN using Python

2. CONCEPT AND MOTIVATION FOR K-NEAREST NEIGHBOUR (KNN)

K Nearest Neighbor (KNN) is a type of supervised machine learning algorithm that can be
used for both classification and regression. In KNN, the output (i.e., the class label or value)
of a new data point is determined by comparing the data point to its "nearest" neighbors in
the training data, according to some distance metric.
Consider the diagram shown below. There are four groups of data represented by various
classes - red(3), blue(0), orange(1) and green(2). Consider a new data point represented by
black. To predict the category of this new points, the ‘k’ nearest neighbors are identified, and
the majority category for the neighbors is chosen as the predicted value. In this case, since
most of the neighbors are Category Red (3) , the prediction for the new datapoint is Category
Red (3).
Fig 1: Concept of KNN

Unlike the linear regression algorithm, there are no computation of parameters or derivation
of a function as a part of the training. K-Nearest Neighbor is one of the simplest Machine
Learning algorithms based on non-parametric ML-Technique. KNN approach does not learn
from the training data, it does the computation during the classification itself. Therefore, we
call it a lazy learner algo (Which means it does not make any assumption on underlying data).
This means that the new point is assigned a value based on how closely it resembles the
points in the training set. In other words, the KNN algorithm uses ‘feature similarity’ to
predict the values of any new data points.
In spite of its lack of sophistication or simplicity, it has been demonstrated to be extremely

efficient at specific projects. KNN algorithm is by far more commonly used for classification
challenges and problems. This is done mainly to understand the baseline.
3. USE CASES
KNN can be used for both classification and regression use cases.
Consider the use case of a credit card company’s need to identify fraudulent transactions in
a dataset of credit card transactions. Let us assume that there is a historic set of datasets of
transactions that are labelled as either "fraudulent" or "not fraudulent".
In such cases, K Nearest Neighbor Classification can be used to identify fraudulent

transactions. This method works by finding the K "nearest" transactions in the dataset
(based on their characteristics, such as the transaction amount, the merchant's location, etc.)
and using their labels to determine the label of the new transaction.
For example, let's say you set ‘n_neighbors’ or K=50, meaning the labels of the 50 nearest
transactions are checked to determine the label of the new transaction. The ‘mode’ or the
maximum occurring label in the 50 nearest transactions is used to predict the label of the
incoming transaction. K Nearest Neighbor Classification can be useful for categorizing data
based on previous examples.
Consider use-case to predict the sale price of a house based on its characteristics (e.g., square
footage, number of bedrooms and bathrooms, etc.). If the dataset related to the previous
house sales in the area, including their characteristics and sale prices are available, one could

K Nearest Neighbor Regression to make predictions. This method works by finding the K
"nearest" houses in the dataset (based on their characteristics) and using their sale prices to
estimate the sale price of the new house.
For example, let's say you set ‘n_neighbors’ or K=5, meaning the sale prices of the 5 nearest
houses are used to calculate the sale price of the new house. The 5 nearest houses have been
sold previously for Rs50,00,000, Rs 55,00,000, Rs60,00,000, Rs 65,00,000, and Rs 70,00,000.
In this case, you could use the average of these sale prices, Rs60,00,000, as your prediction
for the new house.
Overall, K Nearest Neighbor Regression and Classification can be useful for making
predictions based on previous data. It can be particularly effective when there is a large
dataset with many examples that are similar to the new data point.
4. ALGORITHM
KNN Algorithm involves the following steps :
• Let k(m) be the no of train dataset examples. Let q (p) be an undetermined data-point.
• Store the train example in an array of data-points array[]. means to say that, each
element of this array represents a tuple (x, y).
• for i=0 to m:
Compute the distance d(array[I], q).
• Compute the set S of K lowest gaps found.

• Determine the majority label among S if it is a classification problem. Calculate the
average

5. DISTANCE MEASUREMENT
There are several different distance metrics that can be used with the K-nearest neighbors
(KNN) algorithm. They are
1. Euclidean distance: This is the most commonly used distance metric, and it simply
calculates the straight-line distance between features representing two points.
2. Manhattan distance: This distance metric calculates the distance between two points
by summing the absolute differences of their coordinates along each dimension.
Manhattan Distance = |X2 – X1| + |Y2-Y1|
3. Cosine similarity: This distance metric calculates the cosine of the angle between two
vectors. It is commonly used in text classification and other applications where the data
is represented as high-dimensional vectors.
4. Hamming distance: This distance metric is commonly used when working with
categorical data. It calculates the number of different categories between two points.

5. Minkowski distance: Both the Euclidean and Manhattan distance metrics are
generalised using minkowski distance. The distance metrics utilised are determined by
the "p" parameter that is supported. The Manhattan distance is used when "p" is set to
1, while the Euclidean distance is used when "p" is set to 2.
The choice of distance metric will depend on the specific data and problem. Some distance
metrics may be more suitable than others for certain types of data or certain applications.
It's often helpful to experiment with different distance metrics to see which one works best
for the particular problem.
6. SELECTING IDEAL VALUE FOR K

There is no single value of 'k' that would work for all cases, and it becomes necessary to
experiment with different values of 'k' to find the one that is ideal for the use-case. The
optimum value for 'k' in the K-nearest neighbors (KNN) algorithm will depend on the dataset.
Often an ‘elbow method’ is used to choose a good value for the number of neighbours to use
when fitting the model. It is called the "elbow method" because the optimal number of
neighbours is often found at the "elbow" of a graph of the model's performance on the
training data as a function of the number of neighbours. To use the elbow method, you would
typically plot the model's performance metric on the training data as a function of the
number of neighbours, and look for the "elbow" in the plot, which is the point of diminishing
returns in model performance. The number of neighbours at the elbow is then chosen as the
optimal number.
For example, suppose you are using the KNN algorithm for classification and you plot the
classification accuracy on the training data as a function of the number of neighbours. If the
plot looks like an arm bent at the elbow, then the "elbow" would be the point where the
accuracy plateaus and starts to decrease. The number of neighbours at this point would be
chosen as the optimal number.
Grid Search with Cross Validation is another method used to determine the ideal ‘k’. This
involves training the model using a range of different values for 'k' and selecting the value
that produces the best performance. Cross Validation is a recommended to be used to select

the ideal value for "k". This involves splitting the data into a training set and a validation set.
The model is then trained on the training set using different values of 'k' and evaluated
against the validation set. The value of 'k' that generates the best performance on the
validation set can then be used for prediction.
The ideal value for 'k' will be a small, odd number if the number of classes is even, and a small
even number if the number of classes is odd. This will help in resolving conflicts where two
or more classes have the same number of votes.
7. BENEFITS AND WEAKNESS OF K-NEAREST NEIGHBOURS ALGORITHM

The advantage of KNN- Algorithm are it is straightforward or uncomplicated to execute.
Therefore, it is used during the initial stages of creating a model to identify a baseline.
The main disadvantages of KNN-Algorithm are that it can be computationally intensive to

compute the distance between an incoming data point and all of the existing data points in a
large dataset using the K-nearest neighbors (KNN) algorithm. This is because the distance
must be calculated for each point in the dataset and this can require a significant amount of
computational power, especially for large datasets with many dimensions.
However, there are several ways to make this process more efficient. One common approach
is to use approximate nearest neighbor (ANN) algorithms, kd-tree or ball-tree algorithms.
This uses data structures such as locality-sensitive hashing or spatial partitioning to quickly
find the nearest neighbours without having to calculate the distance for every point in the
dataset.
Another approach is to use parallel computing, where the distance calculation is performed
in parallel on multiple processors or computers, which can greatly reduce the computational
time required.

8. IMPLEMENTATION OF K-NEAREST NEIGHBOR USING PYTHON

PROGRAMMING
The sklearn package contains pre-processed datasets that can be used for understanding and
evaluating the machine learning algorithms. The ‘iris’ dataset is one such dataset. This data
sets consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica). The petal and
sepal length, of the different flowers are stored in a 150x4 numpy.ndarray. Using the petal
and sepal length and width, the type of iris can be predicted.
We will use the iris dataset along with KNN algorithm to predict the type of iris. The steps
involved are :
o Exploratory Data Analysis
o Data Processing phase
o Fitting the KNN algo to the training dataset
o Expecting or forecasting the test outcome(result)
o Test correctness of the outcome (Design of CM(Confusion matrix)
Data Processing Step:

The scikit-learn or sklearn package of ‘Python’ consists of implementation of various
machine learning algorithms. It provides a robust and flexible set of methods to train and
tune the machine learning models. As a first step, import the necessary libraries as shown
below.
import numpy as np
import pandas as pd
from sklearn import datasets

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
Datasets is a sub-package that contains curated datasets. Load the ‘iris’ dataset into a
dataframe as shown below and display the first five rows of data.
df = datasets.load_iris( as_frame=True).frame
df.head()

Output
As a part of pre-processing, it is important to check for missing data or incorrect data.

The code below checks for any ‘nulls’ in the data.
df.isnull().sum()
Output
sepal length (cm) 0

sepal width (cm) 0
petal length (cm) 0
petal width (cm) 0
1
target 0
dtype: int64
There are no missing data in the datasets.

Compute the basic statistics for the data – max, min, mean, standard deviation, quantiles.
This helps us to identify if there are any outliers or incorrect values. This also helps us to
understand any skewness in the data.
df.describe()

Output
Univariate analysis involves plotting the values of a variable in a graphical format to

understand the characteristics of the single variable. The graphical format will depend on
the type of data. In this case, all the variables of the dataset are ‘continuous variables’. They
can take any values within a range. Therefore, a histogram can be plotted on these variables.
The code below prints a different histogram for each category of ‘iris’. This helps us to get an
understanding of the influence of these variables.

plt.figure(figsize=(20,10))
plt.subplot(2, 2, 1)
plt.hist(df[df['target'] == 0 ]['sepal length (cm)'] , color='red')
plt.hist(df[df['target'] == 1 ]['sepal length (cm)'] , color='blue')
plt.hist(df[df['target'] == 2 ]['sepal length (cm)'] , color='green')
plt.title ('sepal length (cm)')
plt.hist(df[df['target'] == 0 ]['petal width (cm)'] , color='red')
plt.hist(df[df['target'] == 1 ]['petal width (cm)'] , color='blue')
plt.hist(df[df['target'] == 2 ]['petal width (cm)'] , color='green')
plt.title ('petal width (cm)')
plt.hist(df[df['target'] == 0 ]['sepal length (cm)'] , color='red')
plt.hist(df[df['target'] == 1 ]['sepal length (cm)'] , color='blue')
plt.hist(df[df['target'] == 2 ]['sepal length (cm)'] , color='green')
plt.title ('sepal length (cm)')
plt.hist(df[df['target'] == 0 ]['sepal width (cm)'] , color='red')
plt.hist(df[df['target'] == 1 ]['sepal width (cm)'] , color='blue')

plt.hist(df[df['target'] == 2 ]['sepal width (cm)'] , color='green')
plt.title ('sepal width (cm)')
Output
‘Bivariate’ plots involve comparison of two or more variables in a graphical format. Since all
them are continuous variable, we can use the scatter plots to perform a ‘bivariate’ analysis.

The code below demonstrates the same. In the code below, the bivariate analysis is
differentiated using the ‘target’ column.
plt.scatter(df['petal length (cm)'], df['petal width (cm)'], c=df['target'])
plt.title ('petal length (cm) vs petal width (cm)')
plt.scatter(df['sepal length (cm)'], df['sepal width (cm)'], c=df['target'])
plt.title ('sepal length (cm) vs sepal width (cm)')
plt.scatter(df['sepal length (cm)'], df['petal length (cm)'], c=df['target'])
plt.title ('sepal length (cm) vs petal length (cm)')
plt.scatter(df['sepal width (cm)'], df['petal width (cm)'], c=df['target'])
plt.title ('sepal width (cm) vs petal width (cm)')
Output
The graphical images indicate that there is a clear distinction between classes when certain
attributes are considered.

Once the data is pre-processed and analysed, the data should be split into ‘x’ and ‘y’. The ‘x’
contains the features that are used in prediction and is available for the new data. ‘y’ contains
the predicted variable.
To test any model, we need to check the efficacy on unseen data. To enable this, the available
data is split into two- train and the test set. The sklearn package provides the method
train_test_split() to randomly divide the data into two subsets. The percentage split can be
provided as a parameter to the method. The code below splits the data into two sets – 80%
for training and 20% for testing.
x=df.iloc[1:,:3]
y=df.iloc[1:,4:]
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
KNeighborsClassifier is a class in sklearn that contains the methods , attributes and

implementations for the KNN Algorithm. An instance of a model can be invoked using
KNeighborsClassifier(). This takes as parameters, the value of ‘k’ or the n_neighbours that
needs to be considered. It also takes as parameter the distance metric. The code below uses
the ‘minkowski’ metric. This along with value of p as ‘2’ will result in the algorithm using the
Euclidean distance to compute the distances between two points. Note that if the metric is
custom or unique to a use-case, then a custom ‘metric’ can be provided.
#Fitting K-NN classifier to the training set

from sklearn.neighbors import KNeighborsClassifier
knn= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
The basic model that we created needs to be optimized for the best value of parameters. For
KNN, the only parameter that is of interest is n_neighbours or the ‘k’ nearest neighbours. To
find the best value of ‘k’, we can use grid search technique as shown below. In this case, we
are using the values between 1 and 25 and repeating the KNN algorithm. Please note that the
‘cv’ parameter is set to 10. Therefore, cross validation is also performed with the grid search
while determining the best value. The scoring parameter that we are considering is accuracy.

‘Accuracy’ as defined earlier is the the number of correct classifications divide by the total
number of test examples. Depending on the use-case other criteria like precision , f1-score
and so on can also be specified.
from sklearn.model_selection import GridSearchCV

param_grid = dict(n_neighbors= list(range(1, 25)))
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy',
return_train_score=False,verbose=1)
grid.fit(x_train, np.ravel(y_train) )
Once the grid search is complete, search instance can be used used to identify the most
optimal parameter. As shown in the code below, the best value of ‘k’ or ‘n_neighbours’ was
determined to be 11.
grid.best_params_
{'n_neighbors': 11}
Now this optimized instance can be used to predict using the unknown data. To check how
generalized the model is, use the unseen data , i.e., x_test and y_test and calculate the
accuracy as shown below.
y_pred= grid.predict(x_test)
y_pred

array([0, 1, 2, 1, 1, 2, 1, 2, 1, 0, 1, 1, 2, 0, 1, 1, 0, 0, 2, 2, 1, 1,
2, 1, 1, 2, 1, 0, 0, 0])
We can create a confusion matrix using the predicted values and the actual values using the
code shown below.
from sklearn.metrics import confusion_matrix

cm= confusion_matrix(y_test, y_pred)
cm
array([[ 8, 0, 0],
[ 0, 13, 1],
[ 0, 1, 7]], dtype=int64)
The above output indicates that the all the eight samples were correctly classified for class
1. For class 2, 13 were correct and 1 was incorrect , similarly for class 3, 7 was correct and 1
was incorrectly classified as class 2.
The accuracy score and the classification report can be computed using the inbuilt functions
as shown below.
accuracy_score(y_test, y_pred)
0.9333333333333333
print ( classification_report(y_test, y_pred) )

Output
We can see that the simple code with KNN algorithm is able to produce an accuracy of 93%
with the iris dataset. As mentioned earlier, this is an algorithm that can be easily
implemented and tested. Therefore, it is commonly used by data scientists to create a
baseline.
Note that the above code uses the ‘brute’ method for computing the distances. This means, a
new incoming point is compared to all the points in the training set and the best subset is
chosen. However, we can use algorithms like kd-tree and ball-tree for spatial partitioning of
the entire data. This will ensure that computation of the distance metric is quicker and more
efficient.

9. KNN AND REGRESSION

KNN algorithm can also be used for regression problems. For example, to compute the
mileage for a car, predict the stock price and so on. Please note, in this case the average of ‘n’
nearest neighbors is used to predict the value for a new data. The code below shows how
KNeighborsRegressor class can be used to train and predict values. Note that since it is a
regression use-case, the scoring will be based on the ‘r2_score’ .
X = [[0], [1], [2], [3]]
y = [2, 3, 7, 11]
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=2)
knn.fit(X, y)
print (knn.score(X,y))
print(knn.predict([[5]]))
Output:
0.8325123152709359
[9.]

10. SELF-ASSESSMENT QUESTIONS
SELF-ASSESSMENT QUESTIONS – 1
1. K Nearest Neighbor (KNN) is a type of supervised machine learning algorithm

that can be used for both classification and regression - True or False ?
2. KNN is called as ____________ algo beacuse it does not make any assumption on
underlying data.
3. KNN algorithm uses _____________ to predict the values of any new data points.
4. If k is set as 25 and knn algorithm is implemented with a dataset for a
classification problem, then the _________________ of the 25 nearest occuring
datapoints is computed for prediction.
5. ____________ is the most commonly used distance metric, and it simply calculates
the straight-line distance between features representing two points.
6. The main disadvantages of KNN-Algorithm are that it can be
______________________________
7. from __________ import ______________ This line of code imports that class that
implements KNN classification algorithm.
8. If metric='minkowski'and p=2 , then the algorithm uses the _______________ to
compute the distance metrics.
9. The consolidated report displaying the precision, f1-score, accuracy and so on
can be obtained by using _________________ method.


1. Describe the different metrics that can be used in KNN Algorithm to compute the
distances between two points with examples .
2. Describe the advantages and disadvantages of using KNN algorithm. How are the
disadvantages mitigated?
3. How is the ideal value of 'k' selected for a KNN Algorithm
4. Describe with a code snippet how regression using KNN Algorithm can be performed
using sklearn package in python
12. ANSWERS
Self-Assessment Questions
1. True
2. lazy learner algorithm
3. ‘feature similarity’
4. mode or the maximum occurring label
5. Euclidean distance
6. computationally intensive to compute the distance between an incoming data point and
all of the existing data points in a large dataset
7. sklearn.neighbors , KNeighborsClassifier
8. Euclidean distance
9. classification_report()
Terminal Questions
1. Refer section “Distance Measurement”
2. Refer section “Benefits and Weakness of K-Nearest Neighbours Algorithm”
3. Refer section “Selecting ideal value for K”
4. Refer section “Implementation of K-nearest neighbor using Python programming”

Unit 8
Naïve Bayes
Table of Contents
Fig No / Table SAQ /

= Topic Page No
/ Graph Activity
1 Introduction - -
3
2 Naïve Bayes - Basics Concepts - -
2.1 Concepts of Conditional Probability - - 4–6
2.2 Concepts of Bayes Theorem - -
3 Naïve Bayes Classifiers - - 7
4 Naïve Bayes Algorithm and its working 8
5 Naïve Bayes Implementation - - 9 – 12
6 Challenges and limitations of Naïve Bayes - - 13
Variations and extensions of the Naïve Bayes
7 - 1 14 – 16
algorithm
Unit 8: Naïve Bayes 2

1. INTRODUCTION
An algorithm for supervised learning used for classification tasks is the Naive Bayes. As with
other supervised learning techniques, naive Bayes predicts a target variable using features.
Naive Bayes is a simple and effective technique for constructing classifiers. A probabilistic
classifier is the Naive Bayes algorithm for classification. It is based on probability models
that make substantial assumptions about independence. The independence presumptions
frequently do not affect reality. This relies on the idea of the conditional probability principle.
The concepts cover the background details of Naïve Bayes, Its working principle, types of
classifiers, its implementation aspects of Naïve Bayes and a few of its challenges.
❖ Discuss the basic concepts of Naïve Bayes

❖ Explain the algorithm principle and different classifiers
❖ Demonstrate the working of Naïve Bayes with a real-time dataset
❖ Describe a few of its limitations and other extended features.

2. NAÏVE BAYES - BASICS CONCEPTS
Naïve Bayes is a type of machine learning algorithm that is used for classification tasks. It is
based on using Bayes' Theorem, a mathematical formula for calculating probabilities, to
predict the likelihood of an event occurring.
In the context of Naïve Bayes, the event that we are interested in predicting is whether a
given input belongs to a particular class or not. To make these predictions, the naive Bayes
algorithm uses the probabilities of certain features occurring, given a particular class, to
calculate the probability that a new set of features belongs to that class.
For example, consider a spam detector that is trying to predict whether an email is a spam
or not spam. The Naïve Bayes classifier might consider the following features:
• The presence or absence of certain words, such as "apple" or "orange."

• The presence or absence of certain characters, such as exclamation points or dollar
signs.
• The sender of the email.
• The domain of the email.
Using these features, the classifier would calculate the probability that an email is a spam,
given its observed features. It would then make a prediction based on this probability.
Probability is a fundamental machine learning concept used to make predictions and

measure uncertainty. In machine learning, the probability is a way of quantifying the
likelihood of an event occurring. It is usually expressed as a value between 0 and 1, where 0
indicates an event is impossible, and 1 indicates that it is sure to happen.
Probability is a valuable concept in machine learning because it allows us to predict the

likelihood of different outcomes occurring. Probability is also used in machine learning to
measure the uncertainty of a prediction. For example, suppose a classifier is highly confident
that an email is spam (i.e., it assigns a probability close to 1). In that case, we can be more
confident that the prediction is correct. On the other hand, if the classifier is less confident
(i.e., it assigns a probability closer to 0.5), then we should be less sure about the prediction.

2.1 Concepts of Conditional Probability

A conditional probability is the probability of an event occurring, given that another event
has occurred. For example, one might want to know the probability of it raining tomorrow,
given that it is cloudy today.
Conditional probabilities are written as P(A|B), which is read as "the probability of A given
B." In the example above, A would be the event of it raining tomorrow, and B would be cloudy
today.
To calculate a conditional probability, one needs to know the probability of both events
occurring independently. For example, suppose one knows that the probability of it raining
tomorrow is 0.2 and the probability of it being cloudy today is 0.5. In that case, one can use
these probabilities to calculate the probability of it raining tomorrow, given that it is cloudy
today:
P (rain tomorrow | cloudy today) = P (rain tomorrow and cloudy today) / P (cloudy today)
Plugging in the values from the example above, we get:
P (rain tomorrow | cloudy today) = 0.1 / 0.5 = 0.2
So, in this example, the probability of it raining tomorrow, given that it is cloudy today, is 0.2.
Conditional probabilities are useful in machine learning because they allow us to make
predictions about the likelihood of an event occurring, given certain conditions. For example,
a spam filter might use conditional probabilities to calculate the probability that an email is
a spam, given certain features of the email (such as the presence of certain words or the
sender of the email).
In Naïve Bayes, conditional probability is used to calculate the probability that a given input
belongs to a particular class, given certain input features. Considering the same email spam
filter example, if one is building a spam filter, one might use conditional probability to
calculate the probability that an email is spam, given the presence of certain words or the
email's sender.

To do this, the Naïve Bayes algorithm first estimates the probability of each class occurring
independently (i.e., the prior probability). It then estimates the probability of each feature
occurring, given each class (i.e., the likelihood). Using these probabilities, it can then
calculate the probability of a given set of features belonging to each class using Bayes'
Theorem. The use of conditional probability is a crucial aspect of the naive Bayes algorithm.
It allows it to make predictions about the likelihood of an input belonging to a particular class
based on the probabilities of certain features occurring.
2.2 Concepts of Bayes Theorem
Bayes' Theorem is a mathematical formula used to calculate the probability of an event

occurring, given specific prior knowledge. It is named after the Reverend Thomas Bayes, who
first proposed it in the 18th century.
In the context of Naïve Bayes, Bayes' Theorem calculates the probability that a given input
belongs to a particular class, given certain input features. For example, if we are building a
spam filter, we could use Bayes' Theorem to calculate the probability that an email is a spam,
given the presence of certain words or the email's sender.
The theorem is written as follows:
P(A|B) = P(B|A) * P(A) / P(B)

where,
P(A|B) is the probability of A occurring, given that B has occurred (i.e., the conditional
probability)
P(B|A) is the probability of B occurring, given that A has occurred (i.e., the likelihood)
P(A) is the probability of A occurring independently (i.e., the prior probability)
P(B) is the probability of B occurring independently
In the context of naive Bayes, A is the class that one is trying to predict (e.g., spam or not
spam), and B is the set of features of the input (e.g., the presence of certain words or the
sender of the email). Using Bayes' Theorem, given its features, one can calculate the
probability of the input belonging to a particular class.
Bayes' Theorem is also a vital aspect of the naive Bayes algorithm.

3. NAÏVE BAYES CLASSIFIERS

There are several types of Naïve Bayes classifiers, each of which makes different
assumptions about the distribution of the features. The most common types of Naïve Bayes
classifiers are:
• Gaussian Naïve Bayes: This classifier assumes that the features are typically distributed
(i.e., they follow a bell curve). It is often used in cases where the features are continuous
variables, such as the height and weight of a person.
• Multinomial Naïve Bayes: This classifier is used when the features are count data (e.g.,
the number of times a word appears in a document). It is often used in text classification
tasks.
• Bernoulli Naïve Bayes: This classifier is like the multinomial naive Bayes but is used
when the features are binary variables (i.e., they take on only two values). It is also often
used in text classification tasks.
• Complement Naïve Bayes: This classifier is designed to correct the "naive" assumption
of independence between features. It is less commonly used than the other types of naive
Bayes classifiers.
• Adaptive Naïve Bayes: This classifier is like the multinomial naive Bayes but allows the
class prior probabilities to be adjusted dynamically. It is less commonly used than the
other types of naive Bayes classifiers.
The choice of which type of naive Bayes classifier to use depends on the nature of the data
and the specific classification task.

4. NAÏVE BAYES ALGORITHM AND ITS WORKING
The Naive Bayes algorithm is a simple probabilistic classifier based on applying Bayes'
theorem with a strong assumption of independence between the features. Bayes' theorem
states that the probability of a hypothesis (in this case, a class label) given some observed
data is equal to the prior probability of the hypothesis multiplied by the likelihood of the data
given the hypothesis.
Here's how the algorithm works:
• First, the algorithm needs to be trained on a dataset. This dataset consists of records
labeled with the class they belong to. For example, if one tries to classify emails as spam,
the dataset might consist of several emails labeled as either spam or not.
• During training, the algorithm calculates the probability of each class (e.g., spam or not
spam) and the probability of each feature (e.g., a particular word occurring in an email)
given each class. For example, it might calculate the probability that an email is a spam,
given that the word "free" appears in it.
• Once the algorithm has been trained, it can classify new records. To classify a new record,
the algorithm calculates the probability of each class given the feature values of the
record. For example, it might calculate the probability that an email is spam given that it
contains the words "free" and "viagra."
• The class with the highest probability is then chosen as the predicted class for the record.
The assumption of independence between features is strong and often needs to be met in
real-world data. However, the algorithm works well in practice, especially when the
assumption is reasonable.

4. NAÏVE BAYES IMPLEMENTATION

The algorithm is called "naive" because it assumes that all the data’s features are
independent, which is not always accurate in real-world data. Despite this assumption, the
algorithm works well in practice and is often used for text classification tasks, such as spam
filtering.
Here is an example of how to implement a Naive Bayes classifier in Python for a spam filter:
Import the necessary libraries
Sample training and test data are created with the help of numpy array
The next step is to create a CountVectorizer() object, which refers to a class in the sci-kit-
learn library in Python that is used to convert a collection of text documents to a matrix of
token counts. It can encode documents as numerical feature vectors, which can be used in
machine learning models. CountVectorizer works by tokenizing the input text into individual
words and then creating a vocabulary of all the words in the text. It then counts the number
of occurrences of each word in each document and creates a matrix where each row
represents a document, and each column represents a word in the vocabulary. The resulting
matrix is known as a document-term matrix.
The next step is to fit and transform the training data,

the fit method is used to estimate the parameters of a model that best fit the training data.
This is typically done by minimizing a loss function, which measures the difference between
the model's predictions and the true labels in the training data.
The transform method is then used to apply the estimated parameters to the training data
or to new data to create a new representation of the data. This new representation is often
more suitable for tasks such as classification or regression.
For example, in the case of CountVectorizer, the ‘fit’ method is used to learn the vocabulary
of the training data, and the transform method is used to create the document-term matrix
from the training data. The resulting document-term matrix can then be used as input to a
machine-learning model for classification or clustering.
The fit_transform method combines both the fit and transform methods into a single call,
which is often convenient when you want to fit a model to the training data and then
immediately transform the data.
The next step is to transform the test data,
The ‘fit’ and ‘transform’ methods used on training data must also be used on test data. Using
the same transformation on both the training and test data helps ensure that the model is
evaluated fairly, as it is being tested on data that has been transformed in the same way as

the training data. This helps prevent overfitting, which is the tendency of a model to perform
well on the training data but poorly on new, unseen data.
Proceeding to the next step is to create a multinomial Naïve Bayes object,
The purpose of creating a Multinomial Naive Bayes object is to train a model on labeled
training data and use it to predict the class labels for new, unseen data. To create a
Multinomial Naive Bayes object, one must import the MultinomialNB class from the
sklearn.naive_bayes module and create an instance of the class.
The purpose of fitting the training data is to learn the patterns in the data that can be used
to make predictions on new, unseen data. Once the model fits into the training data, it can
then be used to make predictions on new animals that it has not seen before. The ‘fit' method
requires two inputs: the training data and the training labels.

The final step is to predict the accuracy of the built-in model; in our case, the accuracy of the
built model account for 80%.

5. CHALLENGES AND LIMITATIONS OF NAÏVE BAYES

There are several challenges and limitations to using the Naive Bayes algorithm:
• Assumption of independence: The algorithm assumes that all the data’s features are
independent, which is not always the case in real-world data. This can lead to poor
performance of the model.
• Sensitivity to irrelevant features: The algorithm is sensitive to the presence of features,
which can lead to poor performance.
• Data requirements: The algorithm requires a large amount of data to make reliable
predictions. The model may only generalize well if the data is representative of the
population.
• Assumption of a normal distribution: The algorithm assumes that the features follow a
normal distribution, which may not always be the case.
• Limited to binary and categorical data: The algorithm is limited to binary and
categorical data and cannot be used with continuous data.
• Limited to linear decision boundaries: The algorithm is limited to linear decision
boundaries and may need to perform better on data with complex patterns.
Overall, while the Naïve Bayes algorithm is simple and efficient, it may not always be the best
choice for a given task due to its assumptions and limitations. It is always essential to
evaluate the algorithm’s performance on your specific data and choose the best model for
your needs.

6. VARIATIONS AND EXTENSIONS OF THE NAÏVE BAYES ALGORITHM

Several variations and extensions of the Naive Bayes algorithm have been developed to
address some limitations and improve its performance. Here are a few examples:
• Gaussian Naive Bayes: This algorithm variation assumes that the features follow a
normal distribution and are suitable for continuous data.
• Multinomial Naive Bayes: This variation is suitable for data distributed according to a
multinomial distribution and is commonly used for classification tasks such as text
classification.
• Complement Naive Bayes: This variation is designed to correct the bias introduced by
the assumption of independence in the Naive Bayes algorithm.
• Logistic Regression Naive Bayes: This variation combines the Naive Bayes algorithm
with logistic regression and is suitable for data with many features.
• Bayesian Network: This is an extension of the Naive Bayes algorithm that allows for
modeling dependencies between features.
• Kernel Naive Bayes: This extension allows the algorithm to model non-linear decision
boundaries using kernel functions.
• Bayesian network: A Bayesian network is a graphical model representing the
probabilistic relationships between different variables in a system. It can encode
probabilistic knowledge and predict the variables based on their dependencies and
available evidence. Bayesian networks generalize the naive Bayes algorithm and can
be used to model more complex systems with dependencies between variables.
• Markov chain Monte Carlo (MCMC) method: The MCMC method is a computational
algorithm that can estimate the posterior distribution of a set of random variables,
given some observed data. It works by sampling from the posterior distribution using
a Markov chain, a sequence of random variables that satisfies the Markov property (i.e.,
the probability of transitioning to the next state depends only on the current state).
MCMC can be used to make inferences about the posterior distribution in situations
where the distribution cannot be computed analytically, such as Bayesian networks
with many variables.

• Conditional Random Field (CRF): A CRF is a probabilistic graphical model that can label
or segment sequential data, such as text or time series. It is like a Markov model but
allows for incorporating features that capture dependencies between the output
variables and the input data. CRFs are often used in natural language processing tasks,
such as part-of-speech tagging and named entity recognition.
• Hidden Markov Model (HMM): An HMM is a statistical model representing a sequence
of observations generated by a Markov process with hidden (i.e., unobserved) states.
HMMs are commonly used in speech recognition and machine translation, as they can
capture the dependencies between the observed data and the underlying hidden states.
• Support Vector Machine (SVM): An SVM is a supervised learning algorithm that can be
used for classification and regression tasks. It works by finding the hyperplane in a
high-dimensional space that maximally separates the different classes in the training
data. SVMs are often used in text classification and image classification tasks.
The choice of which variation or extension of the Naive Bayes algorithm depends on the
data's characteristics and the task's specific requirements.

1. Naïve Bayes is a type of machine learning algorithm based on ____________

2. A ______________ probability is the probability of an event occurring, given that
another event has occurred.
3. __________ classifier assumes that the features are typically distributed.
4. _____________ classifier is used when the features are counted data and are often
used in text classification tasks.
5. ______________classifier is like the multinomial naive Bayes but is used when the
features are binary variables.
6. ______________ classifier is like the multinomial naive Bayes but allows the class
prior probabilities to be adjusted dynamically.
7. The Naïve Bayes algorithm is called "naive" because it assumes that all the
data’s features are ________.
8. ___________ works by tokenizing the input text into individual words and then
creating a vocabulary of all the words in the text.
9. The Naïve Bayes algorithm is limited to binary and ________________.
10. __________________algorithm variation assumes that the features follow a normal
distribution and are suitable for continuous data.
11. ______________This is an extension of the Naive Bayes algorithm that allows for
modeling dependencies between features.
12. ______________method is a computational algorithm that can estimate the
posterior distribution of a set of random variables, given some observed data.
13. ______________is a probabilistic graphical model that can label or segment
sequential data, such as text or time series.

5. SELF-ASSESSMENT QUESTION - ANSWERS

1. Bayes' Theorem
2. Conditional
3. Gaussian Naïve Bayes
4. Multinomial Naïve Bayes
5. Bernoulli Naïve Bayes
6. Adaptive Naïve Bayes
7. Independent
8. CountVectorizer
9. Categorical data
10. Gaussian Naive Bayes
11. Bayesian Network
12. Markov chain Monte Carlo
13. Conditional Random Field
1. Brief about the concepts of Naïve Bayes.

2. Explain the various types of Naïve Bayes.
3. Outline the algorithmic working on Naïve Bayes.
4. List the various challenges and limitations of Naïve Bayes.
5. Discuss the variations of Naïve Bayes algorithm.
1. Refer to section “2”


Unit 9
Gradient Descent Algorithms
Table of Contents

No / Graph Activity
1 Introduction - -
3
2 Gradient Descent with Linear Regression 1, 2 -

4-6
3 Gradient Descent with Logistic Regression - - 7
4 Types of Gradient Descent - -
4.1 Stochastic Gradient Descent - - 8-9
4.2 Mini-Batch Gradient Descent - -
5 Hyper Parameters - - 10
6 Adaptive Gradient Descent Algorithms - - 11
7 Advantages and Disadvantages - - 12 - 13
8 Code Demo - 1 14 - 22
11 Terminal Question Answers - - 23
Unit 9 : Gradient Descent Algorithms 2

1. INTRODUCTION
Gradient descent is an optimization algorithm used to minimize a function. It is a widely used
algorithm in machine learning and deep learning to optimize the parameters of a model.
Gradient descent typically works by iteratively adjusting the parameters of a function in the
direction that reduces the function's loss, error or cost. The algorithm starts with an initial
set of parameters, and in each iteration, it computes the inverse gradient of the function with
respect to the parameters. The gradient is a vector that points in the direction of the steepest
increase of the function. The algorithm then adjusts the parameters in the opposite direction
of the gradient until a minimum of the function is obtained.
At the end of this unit, you will be able to
❖ Describe the concept behind gradient descent

❖ Explain how optimization of parameters is done in linear and logistic regression
❖ List and explain the hyperparameters of gradient descent algorithm
❖ Describe adaptive gradient descent technique
❖ Write a code to use the stochastic gradient descent implementation of scikit learn package
to predict a classification use-case

2. GRADIENT DESCENT WITH LINEAR REGRESSION
Let us consider a prediction using linear regression represented as y = wx+b . Here w and b
are the coefficients of the function or otherwise reffered to as ‘parameters’ of the model.
Figure 1: https://en.wikipedia.org/wiki/Regression_analysis
We can derive the best values of ‘w’ and ‘b’ using the gradient descent algorithm. Loss is the
mean square sum of the difference between actual and predicted value. Figure 1 plots the
different values of ‘w’ and the loss corresponding to that value. As you can observe, the graph
is a concave function and the there is an ideal value of ‘w’ where the loss is minimum. The
goal of the gradient descent algorithm is to find the ‘w’ and reach the minimum loss point.
At each point in the graph, we can compute the gradient or the vector and use that to move
towards this minimum point.

Figure 2 : ‘w’ vs loss curve
Let us understand this concept in detail. In linear regression, the loss or the error is
computed as
𝑛
1
𝐸 = ∑(𝑦𝑖 − 𝑦̂𝑖 )2
𝑛
𝑖=0
Here y is the actual value and y ̂ is the predicted value. Substituting y ̂

𝑛
1
𝐸 = ∑(𝑦𝑖 − (𝑤𝑥𝑖 + 𝑏))2
𝑛
𝑖=0
The gradient for the loss is computed using the partial derivative approach. To find the
gradient we need to find the partial derivative wrt to w and b .
Gradient wrt to w is :
𝑛
1
= ∑ 2(𝑦𝑖 − (𝑤𝑥𝑖 + 𝑏))(−𝑥𝑖 )
𝑛
𝑖=0

𝑛
−2
= ∑(−𝑥𝑖 )(𝑦𝑖 − 𝑦̂𝑖 )
𝑛
𝑖=0
Similarly gradient wrt to b is :

𝑛
−2
= ∑(𝑦𝑖 − 𝑦̂𝑖 )
𝑛
𝑖=0
Once the gradient is computed,this is used to move in the direction of minima. In other
words, iteratively adjust the ‘w’ and ‘b’ in the direction that reduces the error between the
predicted values and the true values.
Ideally, we could have subtracted the gradient from the ‘w’ and ‘b’ values. However, it is
recommended to use a learning rate , η, that is multiplied to the gradient. This will help us to
reach the convergence faster.
𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 − η* ( gradient wrt ‘w’ )
𝑏𝑛𝑒𝑤 = 𝑏𝑜𝑙𝑑 −η* ( gradient wrt ‘b’ )
This is repeated several times till ‘convergence’ is reached or the minima is obtained. This
value of ‘w’ and ‘b’ can now be considered as the best value for the prediction function.

3. GRADIENT DESCENT WITH LOGISTIC REGRESSION
Similar to linear regression, gradient descent can be used to optimize the parameters of a
logistic regression model. The goal of logistic regression is to find the best set of parameters
that maximizes the likelihood of the observed data, given the input features. The likelihood
or the cost function is typically expressed as a log-likelihood function, which is a function of
the model parameters.
In logistic regression, the log-likelihood function is often defined as the sum of the logarithms
of the conditional probabilities of the observed data, given the input features and the model
parameters.
Cost or the log likelihood = -yi * log (𝑦̂𝑖 ) - ( 1-yi) * log( 1 - 𝑦̂𝑖 )
1
Here 𝑦𝑖
̂ = 1 + 𝑒 −(𝑤𝑥+𝑏)
The partial derivate or the gradient is computed for this log likelihood. This is done wrt to w
and b as in the previous case.
Gradient wrt to w will be :
= ∑𝑛𝑖=0(𝑦𝑖 − 𝑦𝑖
̂ ) ∗ 𝑥𝑖
Gradient wrt to b will be :
= ∑𝑛𝑖=0(𝑦𝑖 − 𝑦𝑖
̂)
To find the best set of parameters, the gradient descent algorithm starts with an initial set of
parameters and iteratively updates them in the direction of the negative gradient of the log-
likelihood function, with the aim of reaching a local minimum of the function.
In each iteration, the algorithm computes the gradient of the log-likelihood function with
respect to the parameters and adjusts the parameters in the opposite direction of the
gradient. The algorithm stops when the parameters converge to a local minimum, or when a
stopping criterion is met, such as a maximum number of iterations or a small change in the
parameters.

4. TYPES OF GRADIENT DESCENT
Typically, Gradient descent considers the entire set of data. This is referred to as Batch
Gradient Descent Algorithm. However, this can be cumbersome. Therefore, there are
variations to the gradient descent algorithm. They are
1. Stochastic Gradient Descent

2. Mini-batch Gradient Descent
4.1 Stochastic Gradient Descent
The term "stochastic" refers to something that is random or determined by chance. In the
context of optimization algorithms like stochastic gradient descent (SGD), it means that the
algorithm makes updates to the model parameters based on random samples of the data
rather than the entire dataset.
In other words, instead of using the full dataset to compute the gradient of the cost function
at each iteration, SGD uses a subset of the data to estimate the gradient. This subset is
randomly chosen from the dataset at each iteration, leading to randomness or stochasticity
in the algorithm. The use of a random subset of the data at each iteration can introduce some
noise in the optimization process, but it also has the benefit of allowing the algorithm to
escape from local minima and converge to a better solution.
4.2 Mini-Batch Gradient Descent
Stochastic gradient descent (SGD) and mini-batch gradient descent are both variations of the
gradient descent algorithm. Stochastic gradient descent uses only one training example to
calculate the gradient at each step. It is the most computationally efficient variant of gradient
descent, but it also the noisiest and can converge slower. Mini-batch gradient descent, on the
other hand, uses a small subset of the training examples, typically between 10 and 1000, to
calculate the gradient at each step. It is less noisy than SGD and can converge faster, but it is
still more efficient than batch gradient descent which uses the whole dataset.
In other words, SGD uses only a single example, while mini-batch uses a subset of examples,
but less than the entire dataset. Theoretically, SGD is computationally more efficient than

min-batch, but mini-batch gradient descent is less noisy and tends to converge faster, it's
also more common to use in practice, especially when dealing with large datasets

5. HYPER PARAMETERS
The main hyperparameters for the stochastic gradient descent (SGD) algorithm are:
1) Learning rate (also called step size or alpha or eta η ): Learning rate is the step size or
proportion of the gradient that is considered in the next iteration. A smaller learning
rate may result in a more accurate solution, but it will take longer to converge. A larger
learning rate may converge faster, but the solution may be less accurate.
2) Epoch (Number of iterations): This controls the number of times the algorithm will go
through the dataset. More iterations can result in a more accurate solution, but it will
take more time.
3) Mini-batch size: This controls the number of samples used in each iteration of the
algorithm. A larger mini-batch size can result in a more stable gradient, but it will take
more computer resources .
4) Regularization term: This controls the amount of regularization applied to the weights.
Regularization helps to prevent overfitting by adding a penalty term to the cost
function.

6. ADAPTIVE GRADIENT DESCENT ALGORITHMS
Adaptive gradient descent optimization algorithms are a class of optimization algorithms

that adjust the learning rate during training. They differ from traditional gradient descent,
which uses a fixed learning rate, by adapting the learning rate during the course of the
training. This allows the algorithm to converge faster and with a better final solution.
Some examples of adaptive gradient descent optimization algorithms include:
Adagrad: This algorithm adapts the learning rate for each parameter based on the historical
gradient information.
RMSprop: This algorithm adapts the learning rate based on the historical gradient
information, similar to Adagrad, but also uses a moving average over the second moments of
the gradients.
Adam: This algorithm adapts the learning rate based on the historical gradient and second-
moment information, similar to RMSprop, but also has a bias correction mechanism. It
computes adaptive learning rates for each parameter by using running averages of the
second moments of the gradients; the term "Adam" is an acronym for "Adaptive Moment
Estimation".
These algorithms usually have better performance than traditional gradient descent.
In this case, there may be additional hyperparameters like
1) Decay rate: This controls the rate at which the learning rate is reduced over time. This
can help the algorithm converge faster and avoid getting stuck in local minima.
2) Momentum: This controls the amount of previous gradient information that is
retained when updating the weights. A higher momentum can help the algorithm
converge faster, but it can also cause the algorithm to overshoot the optimal solution.

7. ADVANTAGES AND DISADVANTAGES
Gradient Descent is a popular optimization algorithm that can be used for a wide range of
machine learning tasks, such as linear regression, logistic regression. It is used in neural
networks extensively during the training.
SGD would be a good fit for training a large-scale linear regression model. Linear regression
is a supervised learning algorithm that is used to predict a continuous target variable based
on one or more input variables. The algorithm aims to find the best-fitting line that
minimizes the sum of the squared differences between the predicted and actual values.
When the dataset is very large and the number of features is high, it can be computationally
expensive to use other methods to find the best hyper-parameters. SGD, on the other hand,
updates the model's parameters after each training example, which makes it more
computationally efficient.
Gradient Descent or the variations of GD are often used in deep learning models have a large
number of parameters as it can be computationally infeasible to use other methods. In this
case, using SGD can be a good choice for optimization as it is computationally efficient and
can handle large datasets and a large number of parameters.
There are several disadvantages of the gradient descent algorithm:
1) Local Minima: Gradient descent can get stuck in local minima, which are points in the
parameter space that are lower than the surrounding points, but not the global
minimum. This can happen if the function has multiple local minima or if the initial
parameters are not set to a good starting point.
2) Slow convergence: The convergence of the gradient descent algorithm can be slow,
especially if the function is highly non-linear or has a large number of parameters. The
learning rate, which controls the step size of the update, also affects the speed of
convergence. A large learning rate can cause the algorithm to overshoot the minimum,
while a small learning rate can cause the algorithm to converge too slowly.
3) Sensitivity to the learning rate: Gradient descent is sensitive to the learning rate, which
can cause the algorithm to converge too quickly or too slowly. If the learning rate is too

high, the algorithm may overshoot the minimum and oscillate around it. If the learning
rate is too low, the algorithm may converge too slowly.
4) Can be sensitive to the initialization: The initial values of the parameters can also affect
the performance of the algorithm. If the parameters are initialized to a poor starting
point, the algorithm may converge to a poor solution.

8. CODE DEMO
We will use the ‘iris’ pre-processed dataset that is present in the sklearn package to
demonstrate the stochastic gradient descent algorithm. This data sets consists of 3 different
types of irises’ (Setosa, Versicolour, and Virginica). The petal and sepal length, of the different
flowers are stored in a 150x4 numpy.ndarray. Using the petal and sepal length and width,
the type of iris can be predicted.
To begin with, the necessary libraries are imported. The implementation for the Stochastic
Gradient Descent algorithm for Classification is available in the SGDClassifier class. Besides
this, we will import the train_test_split method, metrics for measuring the model and the
GridSearchCV class.
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import recall_score , precision_score , roc_auc_score ,roc_curve

Load the data into a dataframe and display the first five rows.
df = datasets.load_iris( as_frame=True).frame
df.head()
Output:
It is always a good practise to check if there are any missing values in the data. This is done
using the isnull().sum() methods.
df.isnull().sum()
Output:
sepal length (cm) 0

sepal width (cm) 0
petal length (cm) 0
petal width (cm) 0
target 0
dtype: int64
Check the descriptive statistics of the data using the describe() method.
df.describe()

Output:
This displays the count, mean, standard deviation, minimum, maximum and the values at the
25, 50 and 75th quantile. This is important to understand if there is any skewness in the data.
We will use a pairplot from the searborn library to display the relationship between each
pair of the features. This step will help in understanding visually if there is any relationship
between the features. We can also use the corr() or the heatmap() to do the same.
sns.pairplot(df)

Output:
We can see that the petal length and petal width has a positive correlation.
We can also use a 3D representation to understand the relationship between more than two
features. We can use the target variable to color code the points. This will help us to
understand the position of the target with respect to the features.

fig = plt.figure(figsize=(15,15))
ax = plt.axes(projection='3d')
ax.scatter3D(df['sepal length (cm)'], df['sepal width (cm)'], df['sepal length (cm)'], c=

df['target']);
Output:

Once the EDA and Visualisation is done, we will have to split the data into features and
predictions. The features are stored in ‘x’ and the predicted variable in ‘y’. We split the data
into train and test subsets using the train_test_split() method.
x=df.iloc[1:,:3]
y=df.iloc[1:,4:]
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
Initialize the model by instantiating the SGDClassifier with the default parameters.
model = SGDClassifier()
The hyper parameters of this model are log loss, or hinge loss. Note that ‘hinge’ represents
the loss in ‘SVM’. Besides this, the different types of learning rate, the ‘ets’ representing the
learning rate and alpha can be configured. The different values corresponding to this is
provided as a parameter grid to the GridSearchCV method. The cross validation is set to 10
and the model is fit with the best hyper parameters.
param_grid = dict ( { 'loss' : ['log', 'hinge'],
'alpha' : [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000] ,
'learning_rate' : ['constant', 'optimal', 'invscaling', 'adaptive'] ,
'eta0' : [1, 10, 100] }
grid = GridSearchCV(model, param_grid, cv=10, scoring='accuracy',

return_train_score=False,verbose=1)
grid.fit(x_train, np.ravel(y_train) )
The best parameters are stored in the attribute _best_params_.

grid.best_params_
Output:
{'alpha': 0.001, 'eta0': 1, 'learning_rate': 'adaptive', 'loss': 'l

og'}
The data that is unseen by the model ‘x_test and y_test’ can be used to evaluate the model.
The tuned model can be applied to x_test to predict y_pred. The known values y_test and the
predicted values y_pred can be used to evaluate the model. The confusion matrix can be used
to display the True Positives, False Positives, False Negatives and False positives for different
classes or targets .
y_pred= grid.predict(x_test)
cm= confusion_matrix(y_test, y_pred)
cm
Output:
array([[ 9, 0, 0],
[ 0, 8, 2],
[ 0, 1, 10]], dtype=int64)
The above output shows that that there were 2+1 = 3 misclassifications.
y_pred and y_test can be used to compute the accuracy score as shown below.
accuracy_score(y_test, y_pred)

Output:
0.9
The classification_report() function can be used to display the overall performance report as
shown below.
print ( classification_report(y_test, y_pred) )
Output:
The output above displays the precision, recall and f1 values.
Note the code provided above has not been completely optimized. This is just to demonstrate
how the implementations available in the scikit-learn library can be used to perform
predictions using the stochastic gradient descent algorithm.

Self-Assessment Questions – 1
1) Gradient descent is an _________ algorithm used to minimize a function

2) ____________ controls the amount of previous gradient information that is retained
when updating the weights
3) Give an example of adaptive gradient descent optimization algorithm
4) _______ optimization algorithms are a class of optimization algorithms that adjust
the learning rate during training.
5) _______ is the step size or proportion of the gradient that is considered in the
next iteration
6) A smaller learning rate may result in a ________ accurate solution
7) The plot of the coefficient and the loss of a linear regression problem is a _________
curve
8) We can obtain the gradient of a curve using the ________ approach.
9) In logistic regression, the loss is expressed in the form of ______________
10) One of the disadvantages of gradient descent algorithm is that it can get stuck
in a ____________.

1. Explain how the parameters of the linear equation is optimized using the gradient
descent algorithm
2. Explain what hyper-parameters are used in the gradient descent algorithm.
3. What are adaptive gradient descent algorithms. Name a few algorithms that uses this
technique
4. Name a few advantages and disadvantages of gradient descent algorithm
1) optimization
2) Momentum
3) Adagrad, RMSGrad, Adam
4) Adaptive gradient descent optimization algorithms
5) Learning rate
6) more
7) concave
8) partial differentiation
9) log likelihood
10)local minimum
11. TERMINAL QUESTION ANSWERS
1. Refer section “Gradient Descent with Linear Regression”

2. Refer section “Hyper Parameters”
3. Refer section “Adaptive Gradient Descent Algorithms“
4. Refer section “Advantages and Disadvantages”

Unit 10
Ensemble Learning
Table of Contents

No / Graph Activity
1 Introduction - -
3
2 Tree-Based Ensemble Models 1, 2 ,3, 4, 5, 6, 7 -
2.1 Bagged Trees / Bagging / Bootstrap -

- 4 - 12
aggregation
2.2 Characteristics of a Bagged Model - -
2.3 Random Forests - -
3 Creating a Bagged Tree and Random Forest - -

Model
13 - 34
3.1 Creating a Bagged Tree - -
3.2 Creating Random Forest Tree - -
4 Tree-Based Ensemble Adaboost and Gradient 8 -

Boosting 35 - 36
4.1 Data Re-Weighing Strategy - -
5 Gradient Boosting 9, 10, 11, 12, -

13, 14 37 - 40
5.1 Partial Dependence Plot - -
6 Creating Boosted Tree Ensemble - 1 41 - 50
7 Self-Assessment Questions- Answers - - 51
9 Terminal Questions- Answers - - 51
Unit 10 : Ensemble Learning 2

1. INTRODUCTION
Machine learning models can be made more accurate and robust using ensemble learning. It
combines the results of various models to create a more precise and consistent output. This
is especially helpful when dealing with complicated and ambiguous issues because not every
pattern in the data may be captured by a single model. Ensemble approaches can improve
performance on unobserved data by reducing overfitting and increasing the diversity of
solutions by integrating the strengths of various models.
At the end of this unit, you will be able to:
❖ Explain the concept of Ensemble Learning and its importance.

❖ Explain the concept of Tree-Based Ensemble Models.
❖ Differentiate ADA Boost Model with Bagged Trees
❖ Implement Tree-Based Ensemble Models.

2. ENSEMBLE LEARNING
A machine learning technique called ensemble learning combines numerous trained models
to enhance performance on a specific task, such as classification or regression i.e., it is a
technique which combines individual models to improve a model's stability and predictive
power. This can be accomplished by averaging the predictions of various models, such as
decision trees and support vector machines, or by training various models on various
subsets of the data. Ensemble learning can be used to minimize overfitting and enhance a
model's ability to generalize.
We have discussed single models up to this point. Either one reliable regression model or
one reliable classification model existed. When we claim that we are developing ensemble
models, we really mean that we merge a number of simple models into one larger metamodel
that we refer to as an ensemble.
One can produce very effective models by employing an ensemble technique. Regression
ensembles and classification ensembles can both be built.
The final prediction of an ensemble model is a simple average of the predictions from all of
the simple regression models that make up this ensemble. Regression ensembles combine
numerous simple regressors. Similar to this, an ensemble classifier's final prediction is
determined by a majority vote of all the predictions given by the individual classifiers that
comprise this ensemble.
Figure 1 shows the schematic of the ensemble work, which consist of three models that make
up this ensemble. The final prediction will be in the form of either an average of predictions
if the constituent models are regressors or a majority vote if the constituent models are
classifiers, i.e., an ensemble model can be built by combining any set of simple models but
the popular models are tree models.

Figure 1: Schematic working of Ensembles
Decision tree models are frequently used as the basic model in ensembles. This is because
decision trees can handle both numerical and categorical data and are simple to interpret.
Another feature of the tree-based ensemble model is that each base learner only uses a
portion of the total data during training, and the way this data set is supplied to each of the
base learners is based on a data sampling technique. Different sampling techniques result in
different kinds of ensembles based on trees.
2.2 Tree-Based Ensemble Models
The three popular tree-based ensemble models are
(i) Bagged Trees

(ii) Random Forests
(iii) Boosted Trees
All these models differ from each other mainly in terms of the data sampling strategy they
use.
2.2 Bagged Trees / Bagging / Bootstrap Aggregation
Bagging is a technique in ensemble learning where many instances of a basic model (such as
decision trees) are trained on various subsets of the training data.

The subsets are produced by randomly choosing the training data. By adding randomness in
the training process, bagging aims to lower the variance of the base model.
This is accomplished by building several models, each trained on a separate subset of data
i.e., bagging is a homogenous weak learner, where multiple models of machine learning
algorithms are trained with subsets of the dataset which are randomly picked from the
training dataset. The ensemble's predictions are then averaged or put to a vote to create a
final prediction as shown in figure 2.
Figure 2: Bagging Method
In the above figure, we can see how the bagging model works. The dataset has been split into
a test dataset and a training dataset for training and validation respectively. Here multiple
models of the same algorithms have been used but with different data sets (also called
bootstrap samples) that have been picked randomly and named DS1, DS2 and so on. Using
these datasets train the models i.e., using DS1 train model M1, using DS2 train model M2 and
so on as shown in above figure 2. Now the models have been trained and combine all the
trained models to create a final ensemble model i.e., M* which is a strong model compared
to all other models like M1, M2 and so on because the accuracy and predictive power of M*

will be higher than the other models and error rate is less compared to other models and this
can be verified using the test dataset.
For example, Random Forest where the bagging is frequently used in combination with
decision trees to produce random forests, which is a powerful and popular machine learning
approach.
Using tree-based models as base learners, bagged ensemble models reduce in-sample error.
During training the individual tree models in a tree-based ensemble can grow many levels
deep and also it ensures that the details of the training data are thoroughly captured,
reducing the in-sample error.
To reduce the out-sample error every tree model receives resampled data. Each of the
unpruned trees in the case of "Bagged trees" is fed bootstrapped samples of the original data
set.
Figure 3: Bootstrapped Sampling to reduce out-of-sample error.
But, what is Bootstrapped Sampling?
Bootstrapped sampling simply refers to sampling by replacement. In the below figure 4 (a)
we can see the sample with the replacements, as many observations are getting repeated in
the sample more often than they are present in the data (You can see that blue and red dots
are repeated more often in the samples than they are present in the original data.)

Figure 4 (a): Bootstrapped Sample with the replacements Figure 4 (b): Bootstrapped
sample at a data level
In figure 4(b) we can see the bootstrapped sampling at the data level and you can observe
that some rows get repeated more often in the samples.
How bootstrapping reduces the out-of-sample error?
Let us consider that we have fit an unpruned decision tree into any data set then the model
will have a very high out-of-sample error because the test data can be very different from
training data and since we are overfitting our tree model on training data, the out of sample
error will be high.
But hypothetically, we can see that the error will be very small if we fit our unpruned tree
model to whole population data because the model has seen every possible set of data. But
this hypothetical situation, in which we have access to data on the entire population, is
unfeasible. Since we used population data to train our model, we will notice little error
because the model has already seen all the variation and variety in the data.
The purpose of the Bootstrapped Sampling is to,
- Simulate the population data

- Realistic view of the data generation process
- Synthetically generate variation and variety in the training data itself.

2.3 Characteristics Of A Bagged Model
The bagged model is characterized by:
(i) The use of unpruned decision trees as base learners and it is trained using a different
random subset of the training data.
(ii) Use of bootstrapped sampling to create samples that are fed to each of the base
learners.
(iii) The final forecast is often made by averaging the predictions of all the individual
models or, in classification problems, by conducting a majority vote.
(iv) By averaging the results of numerous models, bagged models lower the likelihood of
overfitting and lower the variance of the final forecast.
(v) They are especially beneficial for decision tree-based models because they can lessen
the variance brought on by the models' high sensitivity to small differences in the
training set.
(vi) These models have minimal variation and are resistant to noise.
One of the peculiarities of a bagged tree ensemble is that, since the model is comprised of
100s of decision trees, such a model is not as interpretable as a linear model or a simple
decision tree. Having said that, there are some qualitative statements that one can still make
even if they are using an ensemble model like a bagged tree.
For example, one can always figure out which predictors are more important by looking at a
metric called variable importance.
Variable importance is computed by averaging or summing the improvement in a purity

metric such as Gini or Entropy for a classification model and RSS for a regression model for
all the variables.
Let’s consider, the bagged tree ensemble model consists of many tree models and we can
determine the feature relevance of each variable in each of the constituent trees in a bagged
tree ensemble model, which is made up of several different tree models. This variable
importance can be measured by tracking the decrease in the Gini metric and weighing this
decrease appropriately.

Bagged tree models also are parametrized in some way i.e., the user can specify the number
of trees to be used to build an ensemble, the depth of the tree, and the number of
observations per node of a tree and these becomes the user-specified parameters or
hyperparameters.
These user-provided parameters are significant since they can result in a different ensemble
model depending on how they are specified.
For example, we can have a model with 100 trees and a depth of 4 or a model with 150 trees
and a depth of 3 or a model with 500 trees and a depth of 4.
Now, out of these models which model is to be chosen or which is best among these three
models?
This can be computed using K-Fold Cross Validation to get an estimate of out-of-sample
error for each of these models.
The model which has the lowest out-of-sample error will be selected. But the only drawback
is computation using K-Fold CV is very expensive.
In most tree-based models the way to estimate out-of-sample model performance is to

compute something known as an Out of Bag Error.
Out of Bag Error – Usually in bootstrap sampling, within each sample, it is possible to leave
out some observations from the original data.
Figure (5): Out-of-Bag Error

For example, if we look at figure (5), the sample supplied to the first tree, the second row of
the original data is left out. Now this will happen with all the trees in the model. These
observations then become out of the bag and we use each tree to make predictions on these
out-of-bag observations to arrive at an estimate of out-of-sample model performance.
On average when bootstrapped sampling is done around 33% of observations become out
of the bag. The benefit of the OOB error measure is that it can be computed with no additional
work because out-of-bag observations are made as soon as bootstrapped samples are taken.
2.3 Random Forests
This is another very popular tree-based ensemble model. They are similar to bagged trees.
The only difference is how the data is resampled in the case of random forest i.e., at each split
of the tree, just a subset of the feature is used, and random bootstrapped samples are taken
for each tree. We still utilize random bootstrap samples, much as bagged trees.
Working Procedure:
Let’s consider we have a dataset with n variables.
• To build a random forest model we take one bootstrapped sample.

• Use the sample to build a decision tree model
• For the first split of a decision tree instead of choosing across all N features, we
randomly sample only 3 features (just for an example, but in reality, this number can be
different from 3) and out of these three features we decide which one is most suitable
for split as shown in figure 6.
• Instead of choosing across all N features, we randomly sample only 3 features and out
of these three features we decide which one is most suitable for split, which is a
hyperparameter that the user of the algorithm specifies.
Like this n number of tree models can be created based on the user requirement.

Figure 6: Bootstrapped Sample in Random Forest
Similar to a bagged tree model, Random Forests also use tree models as their base learners.
As a result, one may extract variable importance and compute OOB error to gain an estimate
of the out-of-sample model performance for parameter tuning.
Parameters of random Forests:
1. Number of Trees - The number of decision trees in the model is determined by the
number of trees.
2. Depth of tree models - Maximum decision tree depth is determined by tree model
depth.
3. Number of observations in the root node - The number of observations in the root node
determines how many samples must be taken to split a node.
4. Number of features considered for each split- The maximum number of features that
can be used in each split depends on the number of features that are considered for
each split.
It is crucial to comprehend how these variables operate and how to modify them if necessary
because they can significantly affect the Random Forest model's performance and accuracy.

3. CREATING A BAGGED TREE AND RANDOM FOREST MODEL
Two popular ensemble learning techniques for classification and regression are bagged trees
and random forests.
3.1 Creating A Bagged Tree
module.
2. Load the required Data Set (Here we have used the hr.csv dataset) and read the dataset.
3. Use the head method to look at the first few observations of the data set.
The columns here correspond to the matrix that an HR department will be interested
in and also in this dataset talking about if a given employee has left the organization or
not.
Now using this dataset, let us try to build a classification model that can use these
features to predict if an employee will leave our organization or not.

4. Before building a model, let’s do the data audit just to check for any missing values in
the data.
5. To check the data types of different columns to find any discrepancy anywhere.
Here we have a column called “Sales” and now will look at the unique values of the Sales
column.

6. The “Sales” column contains values like ‘Sales’, ‘Accounting’, ‘hr’ etc, which gives
information about whether the employee belongs to the sales department or the
accounting department, etc, and the column name “sales” has been not named properly.
So, let’s rename this column so that once we build the model, it will be easier for us to
interpret this variable.
Here we have renamed the column name “Sales” to “Dept”.
Similarly, will have a look at the variable/column called “salary” in our dataset.
Here we can observe that salary is a non-numeric variable.

7. Before we build our classifier model, let’s split our dataset into a predictor matrix as
well as a target column. Since we are predicting the column “left” in our original dataset,
so we will not include this column in the predictor matrix.
Now, will have a look at Predictor Matrix,
8. Here we can see there are some non-numeric columns. We would want to have a
numeric representation for these columns. One way to do that is to one hot encode
them or to create dummy variables out of them.
Let’s take a look at our predictor matrix after we have created the non-numeric
variables as you can see, everything in our predictor matrix is numbers.

9. Before building the model, let’s divide our data into testing and training components.
10. For building a bagging tree ensemble, we will need to import two classes, the first class
is called the “BaggingClassifier” class which is in the ensemble module, the second class
is the “DecisionTreeClassifier” class which is in the tree module.
Since we want to build a bagged tree ensemble, we will need the decision tree classifier
class so that we can specify that our base learner is a decision tree classifier.
Here BaggingClassifier is a general class and can use a variety of learners. Here we
specifically want to use the decision tree as a base learner, so we are importing the
decision tree classifier class as well.
11. Here we are instantiating an object of the bagging classifier class and you can see that
for the base estimator parameter, we are giving the decision tree classifier class as the
input. You can also see there is a parameter called “oob_score”. This parameter controls
if the out-of-bag errors will be computed or not. So here we are supplying a value of
true and out-of-bag errors that will be computed for whatever model is fitted on this
object.
12. Let’s instantiate this object and let’s call the fit method to fit our bagging tree ensemble.

13. After we fit the model, let’s take a look at its out-of-bag error score.
Here the accuracy of OOB is 98%.
14. Let’s look at the accuracy of this model on our test data
Here also the accuracy is around 98%.
15. Now we may want to do parameter tuning for some of the parameters in the bagging
tree classifier. One of the user-defined parameters is the number of estimators that we
would want to have in this ensemble.
The number of estimators in this scenario would mean the number of trees that we
would want to have in our bagging classifier estimator.
Now as discussed earlier, the out-of-bag error is a good proxy for the Out-of-sample
accuracy of ensemble tree models, so instead of using grid search CV, we will be running
a for loop to build a bagging tree classifier for different values of estimators and then
we will be reporting the out-of-bag error.

We get a good model with the number of estimators.
After this for loop is run, it will print out the OOB score corresponding to a different
number of trees in our bagged tree classifier and till the number of estimators of 150,
the out-of-bag score increases and then we notice a slight dip in our score or slight

tapering in our score, so we can say that a good value of the number of trees in our
bagged tree classifier for this dataset is 150.
16. Now, will build our bagged tree classifier again with 150 trees now.
So, we have fit a bagged tree classifier model with 150 trees. Now let’s talk about how
we can extract feature importance from this bagged tree classifier.
Since this classification object is made out of many trees.
17. Let’s see if we can list them all. This classifier, after fitting the models, will have an
attribute called “estimators_”.
It lists all the trees that are a part of this particular bagged model.
[DecisionTreeClassifier(random_state=1859118377),
DecisionTreeClassifier(random_state=1559503733),







DecisionTreeClassifier(random_state=1634624159)]
18. We can get details of each tree as well.
let’s extract the first tree model as shown below.

This is the first tree model instance and within this tree model instance, there is an
attribute of feature importance.
So, these are the feature importance for the first tree of our bagged classifier.
Now we can loop through all the trees and get the feature importance of all the variables
across all the trees, as shown below.
19. After we get the feature importance of all the variables in all the trees, we can get a
mean estimate of each feature by just computing the mean of each feature across all
trees.
20. Now let’s convert our summarized feature importance measures into a series of object
and let’s list out the feature importance sorted in the descending order of their
magnitude.

From the above output we can see that satisfaction level seems to be the most
important variable followed by the last evaluation and time spent in the company.
We can also visually represent the feature importance that we have just computed.

3.2 Creating Random Forest Tree
Now, will fit a Random Forest Classifier on the same dataset.
1. To fit a random forest classifier, we will import the random forest classifier class from
the ensemble module
2. Once we have this class loaded, we can simply instantiate an object of this class. with
some hyper-parameters like n_estimators, oob_score, n_jobs, random_state.
In the initial initialization of the object, we are specifying that the random forest
classifier which has 80 trees, oob_score controls if the oob accuracy matrix will be
computed or not.
3. So, let’s instantiate a random forest classifier object and let’s call the fit method to train
our initial model.
4. Now, let’s have a look at the out-of-bag error or out-of-bag accuracy score, it is around
99 %.
5. Again, we might want to figure out what is a good value of a number of estimators, so
we can run a grid search using the simple for loop, as shown below.

Now after running this loop, we can see the oob score corresponding to a different
number of estimators, i.e., different numbers of trees in the random forest ensemble.
Now if you look at the output, you can see that the oob score is increasing till the
number of estimators is 190, then we see a slight dip in the oob score.
6. So, I can finalize a random forest classifier with 190 trees.

7. Now, let’s fit our model,
8. Now, will look at its oob score, it is around 99 %.
9. Now for a random forest classifier, the sci-kit learn API gives a feature where you can
directly extract the feature importance, and you don’t have to run a loop as shown
below.
So, these are the feature importance for different predictors in my data.
10. Now to make this more interpretable, will convert this into a series object and give each
value the name corresponding to the column to which this feature importance belongs.
11. Let’s arrange the feature importance in the descending order

and we can observe that our random forest classifier also suggests that the satisfaction
level is an important predictor followed by several projects, time spent in the company,
etc.
12. Now will plot this and visually represent the feature’s importance.


4. TREE-BASED ENSEMBLE ADABOOST AND GRADIENT BOOSTING
Boosting is another ensemble technique that can use decision trees as base learners. Boosted
trees work differently as compared to bagged trees and random forests.
Instead of taking bootstrapped samples, boosted trees use a data re-weighing strategy to
build an ensemble.
Adaboost is a very popular boosting technique, also unlike bagged trees and random forests,
the tree models in a boosted ensemble aren’t grown very large. Mostly the trees are grown
to a depth of two or three levels.
4.1 Data Re-Weighing Strategy: In the context of Adaboost ensembles.
Let’s consider, we have a dataset and we are doing a classification task. In an Adaboost
ensemble, the sequence of events looks as below.
This dataset is passed to a decision tree learner usually this is a tree which is shallow, hardly
two to three levels deep.
This model is then used to score the dataset, as the model makes mistakes for some rows.
This is where the re-weighing comes into the picture. Wherever the model makes a mistake,
that row is given more importance.
In figure 7, we can see, rows 1 and 2 have been mislabeled by the model, these rows are given
more weight.
This re-weighted dataset is then passed to another decision tree model (say T2) which again
scores the dataset and wherever it makes a mistake, that row is given more importance. This
process is repeated many times in succession.

As mentioned earlier, the tree models will be shallow tree models could be stumped as well.
The final model is a combination of these trees.
Figure 7: Data Re-weighing Strategy
The rationale of this re-weighing strategy is that we are making sure that each successive
tree pays more attention to the parts of the data that preceding trees have failed to correctly
predict. In this way, successive trees try to improve the error rate.

5. GRADIENT BOOSTING
Gradient Boosting is another popular boosting technique.
Gradient Boosting working procedure:
Let us consider an example of regression.
Just like Adaboost, Gradient Boosting is an iterative algorithm.
Step 1: Suppose we have this dataset with one predictor and one target variable, we fit a
simple tree model to this data and obtain predictions. Notice that there is some error in our
predictions. This error is captured in the column of residuals. These are nothing but the
difference between the actual values of the target variable and the predicted values of the
target variable. We fit another model now on the residuals obtained.
Step 2: We fit the combination of two tree models on a training dataset again and we get the
predictions, this time our prediction seems to have improved as can be seen in the residual
column. We again fit a tree model to the residuals obtained via the combination of tree 1 and
tree 2.
We keep on repeating this process quite a few times and eventually end up with an ensemble
of trees and the Gradient Boosting is a general ensemble framework.
Here we have discussed boosting taking decision trees as base learners. (We can use other
base learners other than decision trees as well).

Figure 8: Gradient Boosting
5.1 Partial Dependence Plot
The ensemble models discussed so far are not as interpretable as simple models such as
decision model trees or linear models.
The ensembles do give us important predictors by computing variable importance measures.
For example, one might want to know if a given predictor positively or negatively impacts
the dependent variable.
One way to understand relationships between a dependent variable and an independent

variable is to create a partial dependence plot. This plot helps in establishing the direction of
the impact of a predictor on a target variable.
One drawback of using partial dependence plots is that only bivariate relationships can be
understood but unearthing interaction effects might be challenging. Also, creating partial
dependence plots is computationally very expensive.
Depending upon which machine learning framework you are using, partial dependence plots
for the ensembles may or may not be supported.
Partial dependence plots tell us how the value of the target variable changes as the value of
a predictor variable is changed after considering the effect of all the other variables.

For example, the below figure 9 exhibit shows three partial dependence plots.
Figure 9: Partial Dependence Plot
The x-axis represents the value of a predictor while the y-axis represents the value of a target
variable.
As can be seen in the first two plots, the value of the target variable doesn’t change while
changing the value of the predictor variable.
In the third plot, you can see that there is a positive relationship between the target variable
and the predictor variable.
Now, will see how partial dependence plots can be created.
Let us consider figure 10(a), which shows how the predictor matrix looks like for our dataset.
Figure 10(a): Predictor Matrix Fig 10(b): Sequence of events with respect to predictor 1 Fig
10(c) Varied Prediction
We have predictor 1, predictor 2 and predictor 3 and we have their corresponding values.
Now to create a partial dependence plot, i.e., the relationship between predictor 1 and the
target variable, and the sequence of events looks like as shown in figure 10(b).

We first enumerate for each unique value of predictor 1, all the possible combinations of
values for predictors 2 and 3. Then, we pass each row (for example, the first row we have
1,2,3, then the second row and so on) to our model to obtain the prediction of a target
variable (last column in fig 10(b)).
Now we do this process for each unique value of our first predictor variable and then we
obtain the average of the predictions that we got from our model by passing each of these
rows.
Now once we do that, we obtain a table shown in figure 10(c) which tells us how the
prediction varies as the values of the first predictor are changed, considering all the possible
values for the other predictors.
Now, this can be easily plotted in a bivariate scatterplot as shown in figure 11, to unearth the
relationship between the first predictor and the target variable.
Figure 11: Partial Dependence plot (bivariate Scatterplot)
Here we can observe that to create a partial dependence plot, we’ll have to look at all the
possible combinations of the other predictor variables as well.
This makes the whole process of finding out the partial dependence plots computationally
very expensive.

6. CREATING BOOSTED TREE ENSEMBLE
module.
2. Load the required Data Set (Here we have used the hr.csv dataset) and read the dataset.
3. Use the head method to look at the first few observations of the data set.
Using the data set collected by the HR department of a company wherein using these
features, we would try to predict if someone is likely to leave the organization or not.
4. Let’s do a bit of data audit to check if any missing values are there or not.

5. Now will check the data types
6. Now will check the column called sales and check what unique values we have in this
column.

It shows that this column needs to be renamed from sales to the department.
7. Also, will check the salary column in our data set
and you can observe that this is not a numeric column, but a categorical column, (so
we’ll need to do something about columns with non-numeric data later on).
8. Now, let’s create a predictor matrix and the target vector.
9. Let’s look at our predictor matrix.

Here you can observe some non-numeric variables.
10. Let’s hot encode them or create dummy variables out of them.
and let’s again look at the snapshot of our data once we have one hot encoded our non-
numeric variables.
11. Now let’s split our data into test and training components.
12. Next will import the gradient boosting classifier class from the ensemble module.
13. Let’s instantiate the gradient-boosting classifier object.
This object while instantiating accepts many parameters, some of which are
hyperparameters of the gradient-boosting classifier.

One of the hyperparameters is n_estimators and it specifies how many trees we should
have in my gradient-boosting classifier.
Here we are specifying the value of 80 (later on we will do a grid search to figure out
what could be a good value for several trees in our gradient-boosting classifier).
14. Now let’s call the fit method on my training data. This will fit the gradient-boosting
classifier.
15. Let’s see what is the accuracy score on the test data.
The accuracy of test data is around 97%.
16. Now to tune the hyperparameters of a gradient-boosted classifier, we will make use of
GridSearchCV again.
Since there is no bootstrapped sample taken when we build a gradient-boosted or

AdaBoost kind of ensemble so, we’ll not be able to use out-of-bag errors. Hence, we’ll
again resort to GridSearchCV.
17. Now we will be using the GridSearchCV API, on the number of estimators (The grid of
estimators is 60,80,100,120,140,160).

18. Now let’s see what’s the best classifier or best estimator out of the grid of different
models that we have done a grid search on.
So, the best model seems to be the one where the number of estimators is 160.
19. Let’s use this suggestion by our grid search to now create a classifier with 160
estimators or 160 trees and retrain our model.
20. Now, let’s look at the score and the accuracy are around 97%.

21. Let’s look at the feature importance. For gradient boosted classifier, the scikit-learn API
has an attribute called feature_importances_ and that attribute lists down the array of
all the features
and will convert this array into a series and assign names to each of these numbers and
see which variable importance corresponds to which variables in our data.
22. Hence, will convert this feature importance array into a series object, assign the names
and then arrange it in descending order to check what are the top variables by their
feature importance.
Here, the satisfaction level is a top variable by feature importance, followed by the
number of projects, followed by average monthly hours and so on.
23. We can also visualize all of this and the visual for feature importance looks like it as
shown below.

To build a partial dependence plot with our ensemble models.
1. Within scikit-learn, scikit-learn allows us to build partial dependence plots for

gradient-boosted ensembles, it doesn’t allow us to build partial dependence plots for
random forest or backed trees.
2. So, let’s look at the scikit-learn API to build a partial dependence plot on gradient
boosted classifier.
So, first will import the plot_partial dependence method from the partial dependence
class.
3. Next, we will call the plot_partial dependence method, by passing on the model that we
have just trained, the predictor matrix, and will also tell concerning which variable we
want a partial dependence plot. (Remember in a partial dependence plot, the Y-axis is
always the target variable, and the X-axis is always the predictor)

So, the first predictor in the data set is satisfaction level and it predicts that fewer
people will go out of my organisation as their satisfaction level increases.

1. _____________ is a machine learning technique that combines numerous trained

models to enhance performance on a specific task.
2. ____________models are frequently used as the basic model in ensembles.
3. ____________ is a technique in ensemble learning where many instances of a
basic model (such as decision trees) are trained on various subsets of the
training data.
4. ___________have minimal variation and are resistant to noise.
5. ____________is computed by averaging or summing the improvement in a purity
metric such as Gini or Entropy for a classification model and RSS for a
regression model for all the variables.
6. Usually in bootstrap sampling, within each sample it is possible to leave out
some observations from the original data called _______________
7. ____________ is a very popular boosting technique
8. _____________is used to build an ensemble in boosted trees
9. Bootstrapped samples are used to build an ensemble in ___________ & ___________
10. __________ is an iterative algorithm.

8. SELF-ASSESSMENT QUESTIONS- ANSWERS
1. Ensemble Learning
2. Decision tree
3. Bagging
4. Bagged models
5. Variable importance
6. Out of Bag Error
7. Adaboost
8. Data re-weighing strategy
9. Bagged Trees and Random Forest
10. Gradient Boosting
1. Elucidate the working of the Ensemble Model.

2. Briefly explain the Bagging with a neat diagram
3. Define Bootstrapped Sampling and explain bootstrapped samples with replacements.
4. How bootstrapping reduces out-of-sample error
5. Briefly explain out of Bag Error
6. Elucidate the Bootstrapped sample in Random forest
7. Briefly explain the Adaboost and Gradient Boosting.,
8. Explain the Partial Dependence plot with an example.
10. TERMINAL QUESTIONS- ANSWERS
1. Refer to Section 1
2. Refer to Section 2.1
7. Refer to Sections 4 & 5

Unit 11
Validation Measures and Tuning of Models
Table of Contents

No / Graph Activity
1 Introduction - -
3
2 Validation measures - -
2.1 Importance - - 4 - 14
2.2 Different Validation measures - -
2.3 Usage - -
3 Tuning Models - 1
3.1 Importance - - 15 - 21
3.2 Different Tuning models and its working - -
4 Summary - - 22
5 Self- Assessment Questions- Answers - - 23
7 Terminal Questions- Answers - - 23
Unit 11 : Validation Measures and Tuning of Models 2

\ 1. INTRODUCTION
Validation measures are used to evaluate the performance of a model on a validation dataset,
which is a set of data that is separate from the training data and is used to gauge the model's
ability to generalize to unseen data. Few validation measures include accuracy, precision,
recall, F1 score, and area under the receiver operating characteristic (ROC) curve. Validation
measures are important because they provide an objective way to evaluate the performance
of a model on unseen data, help to identify areas where the model can be improved, and can
provide different perspectives on the model's performance.
Tuning of models, also known as hyperparameter tuning, is the process of adjusting the
parameters of a model to optimize its performance on the validation set. This can be done
through techniques such as grid search and random search. It is an important step in the
model development process as it can significantly improve the performance of the model on
unseen data. Tuning models is important because it can help to improve the performance of
a model on unseen data, prevent overfitting, save computational resources, and reduce
variance of the model's performance.
At the end of this unit, students will be able to:
❖ Learn the basic concepts of validation measures and tuning models

❖ Interpret the process of calculating the validation measures and tuning models
❖ Discuss the importance and usage of validation measures and tuning models

2. VALIDATION MEASURES
Validation measures are used to evaluate the performance of a model on a validation dataset,
which is a set of data that is separate from the training data and is used to gauge the model's
ability to generalize to unseen data. These measures provide an objective way to quantify
how well a model can make predictions on unseen data. They are used to assess the quality
of a model and to identify areas where the model can be improved.
There are different validation measures for different types of problems, such as classification
and regression. For classification problems, common validation measures include accuracy,
precision, recall, F1 score, and area under the receiver operating characteristic (ROC) curve.
For regression problems, common validation measures include mean squared error (MSE)
and root mean squared error (RMSE).
It is important to choose the right validation measure for the problem at hand, as different
measures provide different perspectives on the model's performance. Additionally, it's also
important to have a clear understanding of the problem and the data before selecting the
validation measure.
2.1 Importance
Validation measures are important because they provide an objective way to evaluate the
performance of a model on a dataset that is separate from the training data. This is important
because a model that performs well on the training data may not necessarily perform well
on unseen data. By evaluating the model's performance on a validation dataset, we can get a
better idea of how well the model is likely to perform on new, unseen data.
Validation measures also help us to identify the areas in which the model is performing well
and the areas where it is not, which can help us to improve the model. For example, if a model
has a high accuracy but a low recall, this may indicate that the model is not identifying all the
positive cases in the dataset. This information can be used to adjust the model and improve
its overall performance.

Additionally, different validation measures can provide different perspectives on the model's
performance. For example, accuracy can be an easy and straightforward measure to
understand but, in some cases, it may not be a good indicator of model's performance. For
example, in an imbalanced dataset where the positive class is rare, the model can achieve a
high accuracy by always predicting the majority class. In this case, metrics such as precision,
recall, F1-score, AUC-ROC, AUC-PR and Log Loss can provide more insight into the model's
performance.
2.2 Different Validation Measures
a. Accuracy:
Accuracy is a commonly used validation measure for classification problems. It is

defined as the proportion of correct predictions made by the model out of all
predictions made. It can be calculated using the following formula:
Accuracy = (Number of correct predictions) / (Total number of predictions)
For example, let's say that a model is trained to classify images of animals as either cats
or dogs. The validation dataset consists of 100 images, of which 50 are cats and 50 are
dogs. The model makes the following predictions:
40 cats are correctly classified as cats
10 cats are misclassified as dogs
45 dogs are correctly classified as dogs
5 dogs are misclassified as cats
The accuracy of the model can be calculated as follows:
Accuracy = (40 + 45) / 100 = 0.85
This means that the model correctly classified 85% of the images in the validation
dataset.

It is important to note that accuracy can be a misleading measure in some cases,

such as when the classes are imbalanced (i.e., one class has many more instances than
the other) or when the costs of false positives and false negatives are different. In these
cases, other measures such as precision, recall, F1 score, and area under the receiver
operating characteristic (ROC) curve may provide a more informative picture of the
model's performance.
b. Precision:
Precision is a validation measure for classification problems that is used to quantify the
proportion of true positive predictions made by a model out of all positive predictions
made. It can be calculated using the following formula:
Precision = (True Positives) / (True Positives + False Positives)
True positives (TP) are the number of instances that the model correctly predicted as
positive. False positives (FP) are the number of instances that the model predicted as
positive but are actually negative.
The precision of the model for cats can be calculated as follows:
Precision = (40) / (40 + 5) = 0.89
This means that the model correctly classified 89% of the instances it predicted as cats.

It is important to note that precision gives an idea of how reliable the classifier is when
it predicts the positive class. High precision indicates that the classifier has a low false
positive rate and is unlikely to label a negative instance as positive. However, precision
alone does not give the full picture and it's usually combined with recall, which is
another measure.
c. Recall:
Recall, also known as sensitivity or true positive rate, is a validation measure for
classification problems that is used to quantify the proportion of true positive
predictions made by a model out of all actual positive instances. It can be calculated
using the following formula:
Recall = (True Positives) / (True Positives + False Negatives)
True positives (TP) are the number of instances that the model correctly predicted as
positive. False negatives (FN) are the number of instances that the model predicted as
negative but are positive.
For example, let us say that a model is trained to classify images of animals as either
cats or dogs. The validation dataset consists of 100 images, of which 50 are cats and 50
are dogs. The model makes the following predictions:
The recall of the model for cats can be calculated as follows:
Recall = (40) / (40 + 10) = 0.8
This means that the model correctly identified 80% of the cats in the validation dataset.

It's important to note that recall gives an idea of how many of the total actual positive
instances are captured by the classifier. High recall indicates that the classifier has a
low false negative rate and can identify most of the actual positive instances. However,
recall alone does not give the full picture and it's usually combined with precision,
which is another measure.
d. F1 Score:
The F1 score is a harmonic mean of precision and recall and is a commonly used
validation measure for classification problems. The F1 score is a single metric that
represents the balance between precision and recall and is defined as:
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
Where Precision is the ratio of correctly predicted positive observations to the total
predicted positive observations, and Recall is the ratio of correctly predicted positive
observations to all observations in actual class.
The precision and recall of the model for cats can be calculated as follows:
Precision = (40) / (40 + 5) = 0.89
Recall = (40) / (40 + 10) = 0.8
The F1 Score of the model for cats can be calculated as:

F1 Score = 2 * (0.89 * 0.8) / (0.89 + 0.8) = 0.84
e. Area Under the Receiver Operating Characteristic (ROC) Curve:
The Receiver Operating Characteristic (ROC) curve is a graphical representation of the

performance of a binary classifier. The ROC curve plots the True Positive Rate (TPR)
against the False Positive Rate (FPR) at various threshold settings. TPR is also known
as Sensitivity or Recall, and it is calculated as the number of true positive predictions
made by the model divided by the total number of actual positive instances. FPR is the
number of false positive predictions made by the model divided by the total number of
actual negative instances.
The Area Under the Receiver Operating Characteristic (ROC) Curve (AUC-ROC) is a
single scalar value that summarizes the overall performance of a binary classifier. The
AUC-ROC value ranges from 0 to 1, with 1 indicating a perfect classifier and 0.5
indicating a classifier with no predictive power.
f. Confusion Matrix
A confusion matrix is a table that is used to evaluate the performance of a binary

classifier. It summarizes the true positive (TP), false positive (FP), true negative (TN),
and false negative (FN) predictions made by the classifier. The confusion matrix is
typically represented in the form of a 2x2 table.
Consider an example of a confusion matrix for a binary classifier:
Actual Positive Actual Negative
Predicted Positive TP FP
Predicted Negative FN TN
The terms TP, FP, TN, and FN can be defined as follows:
True Positive (TP): The number of instances correctly classified as positive.
False Positive (FP): The number of instances incorrectly classified as positive.

True Negative (TN): The number of instances correctly classified as negative.
False Negative (FN): The number of instances incorrectly classified as negative.
The confusion matrix can be used to compute various performance metrics, such as
accuracy, precision, recall, and F1-score.
g. R2/ Adjusted R2:
R2 (Coefficient of Determination) and Adjusted R2 are metrics used in regression

analysis to evaluate the goodness of fit of a model.
R2 is defined as the proportion of variance in the dependent variable (y) that is

predictable from the independent variables (X). It ranges from 0 to 1, where 1 means
that all the variance in the dependent variable can be explained by the model and 0
means that the model cannot explain any variance in the dependent variable. R2 is
calculated using the below formula:
R2 = 1 - (sum of squared residuals / total sum of squares)
Adjusted R2 considers the number of independent variables in the model and adjusts
R2 to penalize models that include too many independent variables. It is defined as:
Adjusted R2 = 1 - ( (1 - R2) * (n - 1) / (n - k - 1) )
where n is the sample size and k is the number of independent variables in the model.
In general, a higher R2 or Adjusted R2 value indicates a better fit of the model to the
data. However, R2 has some limitations. For example, R2 increases as the number of
independent variables increases, even if the new variables do not provide any
significant improvement in the fit of the model. This is where Adjusted R2 comes into
play.
h. AUC-PR (Precision-Recall Curve):
The Precision-Recall (PR) curve is a graphical representation of the performance of a

binary classifier. The PR curve plots the Precision against Recall (also known as

Sensitivity or True Positive Rate) at various threshold settings. Precision is defined as

the number of true positive predictions made by the model divided by the sum of true
positive and false positive predictions. Recall is defined as the number of true positive
predictions made by the model divided by the total number of actual positive instances.
The Area Under the Precision-Recall (AUC-PR) curve is a single scalar value that
summarizes the overall performance of a binary classifier. The AUC-PR value ranges
from 0 to 1, with 1 indicating a perfect classifier and 0.5 indicating a classifier with no
predictive power.
i. Mean Squared Error (MSE).
Mean Squared Error (MSE) is a commonly used regression evaluation metric. It

measures the average of the squared differences between the predicted and actual
values for a set of data. MSE is defined as:
MSE = (1/N) * sum((y_pred - y_true)^2)
where N is the number of data points, y_pred is the predicted value, and y_true is the
actual value.
The MSE value provides information about the magnitude of the error between the
predicted and actual values. A lower MSE value indicates a better fit of the model to the
data, while a higher MSE value indicates a poor fit classifier and 0.5 indicating a
classifier with no predictive power.
j. Log Loss
Log Loss, also known as cross-entropy loss, is a commonly used evaluation metric for
classification models, especially for models that predict probabilities of class
membership. Log Loss measures the difference between the predicted probabilities
and the true class labels in a logarithmic scale.
The Log Loss is defined as:

Log Loss = -(1/N) * sum(y_true * log(y_pred) + (1 - y_true) * log(1 - y_pred))
where N is the number of data points, y_pred is the predicted probability of the positive
class, and y_true is the binary indicator of the actual class (1 for positive, 0 for negative).
The Log Loss value ranges from 0 to infinity, with a lower value indicating a better fit
of the model to the data, and a higher value indicating a poor fit. A Log Loss value of 0
indicates a perfect fit of the model to the data, with the predicted probabilities exactly
matching the true class labels.
k. Root Mean Squared Error (RMSE)
Root Mean Squared Error (RMSE) is a commonly used evaluation metric for regression
models. It measures the average magnitude of the differences between the predicted
and actual values for a set of data. RMSE is defined as the square root of the Mean
Squared Error (MSE), as follows:
RMSE = sqrt(MSE) = sqrt((1/N) * sum((y_pred - y_true)^2))
where N is the number of data points, y_pred is the predicted value, and y_true is the
actual value.
The RMSE value provides information about the magnitude of the error between the
predicted and actual values in the same units as the target variable. A lower RMSE value
indicates a better fit of the model to the data, while a higher RMSE value indicates a
poor fit.
i. Cross Validation:
Cross-validation is a technique used to evaluate the performance of a machine learning

model on unseen data. The goal of cross-validation is to get an estimate of the model's
performance that is not biased by the choice of training and testing data.
There are several types of cross-validation, but the most common one is k-fold cross-
validation. In k-fold cross-validation, the original data is partitioned into k subsets of
equal size. The model is trained on k-1 subsets and evaluated on the remaining subset.

This process is repeated k times, with each subset being used as the validation set
exactly once. The performance metric is calculated as the average of the performance
metrics across the k folds.
Cross-validation is a time-consuming process, especially for complex models and large

datasets. However, it provides a more robust estimate of the model's performance and
helps to prevent overfitting, which is when a model is too complex and fits the training
data too well, but performs poorly on unseen data.
2.3 Usage of Validation Measures
Validation measures are used to evaluate the performance of machine learning models. The
choice of validation measure depends on the type of problem being solved, such as
classification or regression, and the desired properties of the model. Here are some of the
common use cases for different validation measures:
Accuracy: Used for classification problems to measure the proportion of correct predictions
made by the model.
Precision: Used for classification problems where the goal is to minimize false positive
errors. Precision measures the proportion of true positive predictions among all positive
predictions.
Recall: Used for classification problems where the goal is to minimize false negative errors.
Recall measures the proportion of true positive predictions among all actual positive
instances.
F1 Score: Used for classification problems where the goal is to balance precision and recall.
The F1 score is the harmonic mean of precision and recall.
AUC-ROC (Receiver Operating Characteristic): Used for classification problems to measure

the model's ability to distinguish between positive and negative classes. AUC-ROC is the area
under the ROC curve, which plots the true positive rate against the false positive rate.

Confusion Matrix: Used for classification problems to summarize the performance of the
model and to provide a more detailed view of the true positive, true negative, false positive,
and false negative predictions.
R2 and Adjusted R2: R2 and Adjusted R2 are useful metrics for evaluating the fit of regression
models, but they should be used together with other metrics, such as residual plots and
cross-validation, to get a complete picture of the performance of the model.
AUC-PR (Precision-Recall Curve): Used for classification problems when the positive class is
rare or the class imbalance is high. AUC-PR is the area under the Precision-Recall curve,
which plots the precision against recall.
Mean Squared Error (MSE): Used for regression problems to measure the average magnitude
of the difference between the predicted and actual values.
Log Loss: Used for classification problems with multiple classes or for models that predict
probabilities. Log loss measures the accuracy of the predicted probabilities relative to the
actual class labels.
Root Mean Squared Error (RMSE): Used for regression problems to measure the average
magnitude of the difference between the predicted and actual values in the same units as the
target variable.
Cross Validation: The main use of cross-validation is to provide an estimate of a model's

performance on unseen data, which can be used to select the best model among several
candidates. It also provides an estimate of the uncertainty of the performance estimate,
which is important for making informed decisions about model selection and deployment.

3. TUNING MODELS
3.1 Importance
Tuning models is important because it helps to optimize model performance by finding the
best set of hyperparameters that result in the best prediction accuracy. Without tuning,
models may underperform or overfit to the training data, leading to poor generalization on
unseen data. Additionally, tuning helps to ensure that the model is not overcomplicating the
problem, which can lead to overfitting, and improve its ability to generalize to new data. By
finding the best hyperparameters, model tuning can also help to prevent overfitting, improve
training speed and reduce the risk of overfitting.
3.2 Different Tuning models
a. Grid Search:
Grid search is a method of model tuning in which a range of hyperparameters is defined

and a search is conducted over all possible combinations of these hyperparameters. For
each combination of hyperparameters, the model is trained and evaluated, and the
hyperparameter set that results in the best performance is selected. Grid search can be
computationally expensive, especially for models with many hyperparameters or a
large search space, but it is a simple and effective method for finding good
hyperparameters for a given model. Grid search can also be combined with other tuning
methods such as random search or be guided by prior knowledge or experience to
improve its efficiency.
Here is the process of grid search tuning:
Define the hyperparameter search space: The first step is to specify the
hyperparameters that will be searched over, along with their respective ranges.
Train the model for each combination of hyperparameters: The next step is to train the
model for each combination of hyperparameters in the search space. This can be done
using cross-validation to obtain an estimate of the model's performance.

Evaluate the model performance: For each combination of hyperparameters, the

performance of the model is evaluated using a pre-defined metric, such as accuracy or
F1 score.
Select the best hyperparameters: The combination of hyperparameters that result in

the best performance is selected as the final set of hyperparameters for the model.
Refit the model using the best hyperparameters: Finally, the model is refitted using the
best hyperparameters on the entire training dataset. This will be the final model that
can be used for making predictions on new data.
Note that grid search can be computationally expensive, especially for large search
spaces, so it's important to keep the number of combinations to a minimum.
Alternatives such as random search and Bayesian optimization can be used to find
optimal hyperparameters more efficiently.
b. Random Search:
Random search is another method for hyperparameter tuning in machine learning. It

involves randomly sampling hyperparameters from a predefined distribution, training
the model with the selected hyperparameters, and evaluating the model's performance.
The process is repeated multiple times until a satisfactory set of hyperparameters is
found.
Here is the process of random search tuning:
Define the hyperparameter search space: The first step is to specify the
hyperparameters that will be searched over, along with their respective distributions.
Sample hyperparameters randomly: The next step is to randomly sample

hyperparameters from the defined distributions. This can be done multiple times to
obtain multiple combinations of hyperparameters.
Train the model for each combination of hyperparameters: The model is trained for
each combination of hyperparameters, using cross-validation to obtain an estimate of
the model's performance.

Evaluate the model performance: The performance of the model is evaluated for each
combination of hyperparameters, using a pre-defined metric such as accuracy or F1
score.
Select the best hyperparameters: The combination of hyperparameters that result in

the best performance is selected as the final set of hyperparameters for the model.
Refit the model using the best hyperparameters: Finally, the model is refitted using the
best hyperparameters on the entire training dataset. This will be the final model that
can be used for making predictions on new data.
Random search is computationally more efficient than grid search, as it only requires
evaluating a smaller number of combinations of hyperparameters. It also has the
advantage of exploring the search space more thoroughly, as it does not limit the search
to a pre-defined grid of values. However, random search may still be computationally
expensive for large search spaces, in which case other methods such as Bayesian
optimization can be used.
c. Bayesian Optimization:
Bayesian Optimization is a global optimization method for finding the maximum or

minimum of an unknown function. It is mainly used in situations where the function is
expensive to evaluate or has a noisy output.
The working process of Bayesian Optimization involves the following steps:
Initialization: Start with a small set of initial points to evaluate the function.
Modeling: Build a probabilistic model of the function based on the observed values at
the initial points.
Acquisition Function: Use the model to select the next point that is most likely to
improve the current best observation.
Evaluation: Evaluate the function at the selected point and update the model with the
new observation.

Repeat: Repeat steps 3 and 4 until a stopping criterion is met.
The probabilistic model used in Bayesian Optimization can be a Gaussian Process (GP)
or a Tree-structured Parzen Estimator (TPE). The acquisition function balances
exploration and exploitation, allowing the algorithm to choose points that are likely to
be high-performing while also exploring uncharted regions of the space.
Bayesian Optimization has been applied to various fields, including machine learning
hyperparameter tuning, robotic control, and chemical design.
d. Evolutionary algorithms:
Evolutionary algorithms are a family of optimization algorithms inspired by the process

of natural selection. They are used to find the optimal solution to a problem by
mimicking the process of evolution in nature.
The working process of an evolutionary algorithm typically consists of the following

steps:
Initialization: Generate an initial population of candidate solutions.
Evaluation: Evaluate the fitness of each candidate solution.
Selection: Choose the best solutions (parents) from the current population to generate
the next generation.
Variation: Apply genetic operators such as mutation and crossover to generate new
solutions (offspring) from the selected parents.
Evaluation: Evaluate the fitness of the new solutions.
Repeat: Repeat steps 3-5 until a stopping criterion is met, such as reaching a maximum
number of generations or finding a satisfactory solution.
The choice of genetic operators, selection methods, and stopping criteria can
significantly affect the performance of the evolutionary algorithm. Some popular
variations of evolutionary algorithms include genetic algorithms, differential evolution,

and particle swarm optimization. These algorithms have been applied to various fields,
including optimization, machine learning, and engineering design.
e. Gradient-based optimization:
Gradient-based optimization is a family of optimization algorithms that use the

gradient of the objective function to iteratively improve the solution. The gradient is a
vector that points in the direction of the steepest increase in the function, and the
optimization algorithm uses this information to move towards the optimum.
The working process of a gradient-based optimization algorithm typically consists of

the following steps:
Initialization: Choose an initial guess for the solution.
Evaluation: Evaluate the objective function and its gradient at the current solution.
Update: Use the gradient information to update the current solution in the direction
that decreases the objective function. This step may involve choosing a step size or
learning rate, which determines the size of the update.
Repeat: Repeat steps 2 and 3 until a stopping criterion is met, such as reaching a certain
number of iterations or a satisfactory solution.
Gradient-based optimization algorithms include gradient descent, conjugate gradient,

BFGS, and L-BFGS. These algorithms have been widely used in machine learning,
particularly for optimizing neural networks, and in various other fields, including
engineering design and optimization. The choice of optimization algorithm and its
parameters can significantly affect the performance and convergence of the
optimization process.
f. Hyperband:
Hyperband is an algorithm for performing a large number of expensive trials of a

hyperparameter optimization problem. It works by dividing the total budget for
hyperparameter tuning into smaller, overlapping brackets and only keeping the best

models from each bracket for further optimization. This way, it avoids wasting
resources on poorly performing models and focuses more on the most promising ones.
The algorithm uses early stopping and a successive halving approach to speed up the
process. The basic steps of Hyperband include:
Selecting the initial number of trials
Dividing the trials into brackets
Selecting the models with the best performance from each bracket
Halving the number of trials for each subsequent bracket
Repeating the process until the final bracket is reached
Selecting the best model from the final bracket as the output.
Hyperband has been shown to be faster and more efficient than traditional grid search
and random search methods for hyperparameter tuning.
g. Optuna:
Optuna is a Python library for hyperparameter optimization. It automates the process

of tuning machine learning models by searching for the best set of hyperparameters
that optimize a given objective.
The working process of Optuna can be summarized as follows:
Define the objective function: The objective function is a function that takes in
hyperparameters and returns a score representing how well the model performs with
those hyperparameters.
Specify the hyperparameter search space: The search space defines the range of values
that the hyperparameters can take. Optuna supports a variety of search distributions,
including uniform, log-uniform, and categorical.

Run the optimization: Optuna uses a variety of algorithms to search for the best set of
hyperparameters. These algorithms include random search, grid search, Bayesian
optimization, and gradient-based optimization.
Obtain the best set of hyperparameters: Optuna returns the best set of
hyperparameters that optimize the objective function. These hyperparameters can
then be used to train the final machine learning model.
Optuna also provides a number of features to help manage the optimization process,
such as early stopping, resuming interrupted runs, and parallel execution.
1. ___________ measures are used to evaluate the performance of a model.

2. Tuning of models, also known as ___________
3. Accuracy is a commonly used validation measure for ___________ problems
4. The ___________ is a harmonic mean of precision and recall
5. A ____________ is a table that is used to evaluate the performance of a binary
classifier.
6. __________ measures the difference between the predicted probabilities and the
true class labels in a logarithmic scale.
7. Root Mean Squared Error (RMSE) is a commonly used evaluation metric for
__________ models.
8. __________ is an algorithm for performing a large number of expensive trials of
a hyperparameter optimization problem.
9. _____________ automates the process of tuning machine learning models by
searching for the best set of hyperparameters.
10. Bayesian Optimization is a ______________ tuning method.

4. SUMMARY
Validation measures and tuning models are essential components in the machine learning
process that help ensure the quality of the model predictions and improve the model
performance.
Validation measures are metrics used to evaluate the performance of a machine learning
model on a validation set, separate from the training data. Examples of common validation
measures include accuracy, precision, recall, F1-score, and AUC.
Tuning models is the process of adjusting the hyperparameters of a machine learning model
to optimize its performance on the validation set, as measured by the validation measures.
This process involves finding the set of hyperparameters that result in the best performance
on the validation set.
The combination of validation measures and tuning models allows for an iterative process
of training and refining machine learning models to reach an acceptable level of
performance.

5. SELF-ASSESSMENT QUESTIONS - ANSWERS
1. Validation
2. hyperparameter tuning
3. classification
4. F1 score
5. confusion matrix
6. Log Loss
7. regression
8. Hyperband
9. Optuna
10. probabilistic model-based
6. TERMINAL QUESTIONS:
1. What is the importance of validation measures?

2. Explain the different validation measures.
3. Brief about the importance of tuning models.
4. Write in detail about the various tuning models.
5. Illustrate the working process of the tuning models.
7. TERMINAL QUESTIONS- ANSWERS:
1. Refer to section 1.1.1


Unit 12
Clustering Algorithms
Table of Contents
No / Graph Activity
1 Introduction - -
3
2 Unsupervised Learning 1, 2 -
4-5
3 KMeans Clustering – Concept 3, 4, 5, 6 -
3.1 KMeans Clustering – Algorithm - - 6-8
3.2 Scaling and Creating Dummy Variables - -
4 Number of Clusters 7, 8, 9 - 9 – 21
5 Agglomerative Clustering 10, 11, 12, 13, 1

14, 15, 16, 17, 22 - 32
18, 19, 20
7 Self-Assessment Questions – Answers - - 33
8 Terminal Questions – Answers - - 33
Unit 12 : Clustering Algorithms 2

\ 1. INTRODUCTION
The previous topics cover the concepts of ‘Supervised Learning Algorithms’. In supervised
learning algorithms, the algorithm is trained on labelled data, with the goal of predicting the
output for unseen data based on the patterns in the labelled data. During the training, it has
a known output for each input example. In unsupervised learning, the algorithm is trained
on unlabeled data, with the goal of discovering structure or patterns in the data. During the
training, there is no known output. ‘Clustering’ is an unsupervised machine learning concept
where the data is segmented into various clusters or groups.
At the end of this unit, you will be able to:
❖ Explain the concept of unsupervised learning and clustering.

❖ Explain the concept of K-Means Algorithm.
❖ List the steps in K-Means Algorithm.
❖ Elaborate on the techniques available to decide on the number of clusters.
❖ Explain the concept and technique of agglomerative clustering technique with an
example.

2. UNSUPERVISED LEARNING
Often, we may not want to predict a particular feature of data. We may want to find
interesting patterns in the data and make business decisions based on that. Consider the
sales of a product with its revenue as shown in the Table 1.
Table1 – Sales vs Revenue
If we plot a scatter plot with the percentage sales of a product and the revenue of the store,
we see that there are 3 different groups of data. One in which the percentage of sales and
revenue is less, the other where both are moderate and the third in which the percentage of
sales is less, but the revenue of the store is quite high.
Figure 1: Scatter plot of Sales vs Revenue
Note in this case, there is no classification or regression being performed and there was no
target variable However, we can get interesting categories of the stores. This type of
unsupervised learning where we can segregate the data into categories/groups or clusters
is referred to as ‘Clustering’.

This is used in several areas. Let us look at another example. Suppose we have a web
analytics platform. Then we can capture data about the number of hits on our webpage, the
duration of stays of people on a given webpage, the number of pages visited etc. Now with
this data many categories can be potentially discovered. For example, we may discover the
group of people with a high hit rate, small amount of money spent, group of people with short
duration stay, a large amount of money spent etc. etc
In Figure 1, we could say that there are three groups because we were able to visualize the
data. And we could visualize the data because there were only two variables. Invariably,
there are more than two features in a dataset. In that case, visualizing data and finding out
the number of clusters would be impossible. Therefore we need an algorithm to find out the
clusters in our data. One of the very popular algorithms is KMeans.

3. KMEANS CLUSTERING – CONCEPT

The KMeans is an iterative algorithm. The ‘K’ in KMeans stands for the number of clusters to
be found in the data. That is a user-specified parameter. The output of the KMeans algorithm
is the cluster label for each row of the data. When we use any clustering algorithm such as
KMeans, we end up finding out which data point belongs to which group.
Table 2: Sales Revenue with Cluster Labels
At a data level, this is akin to finding out which row in our data belongs to which group. The
creation of this label column is one of the potential outputs of a clustering algorithm.
3.1 KMeans Clustering – Algorithm
1) The first step in a KMeans clustering algorithm is to assign ‘k’ points as the centres of
three different clusters. If we assume that we want k=3 clusters, to begin with, we
randomly assign 3 points as the center of these 3 clusters.
2) In the next step, we will find out the distance of each of the points in our data from these
centres and based on the closeness of that point to a particular cluster centre we will
say that point belongs to that cluster.
3) Next, we will try to find the centre of each of these newly formed cluster.
4) We will again compute the distances of each of the points from these three cluster
centres and assign cluster labels and after that, we will again re-compute the cluster
centres.
5) We will keep on doing this task until we obtain a stable solution. Now a stable solution
would be one in which in successive iterations, no point changes its class membership
or cluster membership.

Figure 2: Clustering Algorithm Steps
3.2 Scaling And Creating Dummy Variables
There are a couple of things that we need to keep in mind while using the KMeans algorithm.
We need to make sure that the data we are using to build a KMeans model should always be
numeric and should always be on the same scale. We also need to figure out how many
clusters a data has. Till now figuring out the number of clusters was very straightforward as
we could plot data and take a call. In real-life scenarios knowing how many clusters a data
has is a little involved process.
In a typical business dataset, we have categorical variables like colour, size(s,m,l) and so on.
For clustering, all our data should be numeric. One way to convert categorical data into
numeric data is to create dummy variables. This process is also known as one hot encoding
the categorical data.

Figure 3: Example of converting a categorical variable into dummy variables
The other thing we need to keep in mind is that all the variables in the data should be on the
same scale. Before we feed data to a clustering algorithm it becomes imperative to bring our
data on the same scale. Now there are a couple of ways in which one can scale the data. One
most common way is to compute a Z-transform or Z-Standarisation.
In Z-standardization, the data is transformed to have a mean of zero and a standard deviation
of one. This is done by subtracting the mean from each data point and then dividing by the
standard deviation. The resulting transformed data is referred to as standard scores, or Z-
scores, which allow for comparison of data values across different distributions.
Figure 4: Computation of Z-Scores

The above figure shows the z transformation of the income and children’s data. As you can
observe, the values are in a common range, i.e., -1 to 1.
4. NUMBER OF CLUSTERS
The number of clusters can be derived visually or from a business context. Very often, both
these strategies may not work. There are also scenarios where the context determining the
number of clusters to be considered is absent. In this case, we must iteratively determine the
best number. A good choice of the “number of clusters” should lead to compact and well-
separated clusters.
One algorithmic way to find out an optimum number of clusters is to calculate some
measure of average cluster compactness for a series of values of clusters. If, for 3 clusters,
the measure of compactness obtains minima or becomes asymptotic. Then we will choose 3
as the optimum number of clusters.
To measure the compactness of a cluster we compute a measure called WSS or “Within Sum
of Squares” for this cluster. This would measure the compactness of the cluster. This is
computing the total squared distance of each point of this cluster from its cluster centre. To
arrive at a consolidated measure of cluster compactness we can sum up the WSS for each
cluster or we can take an average. We can do this computation for each value of K and then
take a decision.
We usually plot the total “Within SS” for each value of K to arrive at an idea of what would be
the optimum number of clusters. Such a plot is known as Scree plot or Elbow plot.

Figure 5: Scree plot or Elbow plot
Now if you look at the Scree plot, we can see that the decrease in WSS is not substantial after
around 8 clusters. This means that the improvement in cluster compactness is very minimal
after 8 clusters and this improvement flattens out at around 12 clusters. So, it seems like 8
to 12 clusters can be the optimum number of clusters. There may be some subjectivity
involved on the part of the analyst while determining the ballpark number of optimal
clusters.
Once we determine the range of optimal clusters, the next task that we do is, we create
cluster profiles. Once cluster models are created, we should try to understand what each
cluster represents. This process is known as creating cluster profiles. Based on this we
finalize if an 8 cluster model is better or 11 cluster model is better as suggested by the Elbow
curve.
Once we do clustering, we end up with a data view like the one in Figure 6 where we know
which row in the data stands for which cluster. We can always find out what would be the
mean values of the variables in each of the clusters. And we can also find out the average
value of the variables in the data itself. Let’s call these average values the global means. Now
doing this can help us in deciding if the clusters that we’ve created are meaningful from a
business point of view or not. Our clusters will be more meaningful if they are different.
Now we can easily see that compared to the global average in cluster 1, both the average
revenue is high and the percentage of sales are also high. For cluster 2, the revenue and the
percentage of sales are near the global average. For cluster 3, the revenue and percentage of
sales are low compared to the global average.
Figure 6- Global mean of each cluster

To aid in the uniform comparison of differences, instead of looking at the absolute difference
we can compute the z values for each variable in each cluster (Figure 6).
Figure 6- Global mean of each cluster with z-transformation
You can see that these two text boxes represent the same information as represented by
these two text boxes above. But here we are computing z values to do a comparison with the
global mean. z values with high positive magnitudes signify that the cluster means are larger
than the global means. z values with negative magnitude signify that the cluster means are
smaller than the global means. Variable profiling helps in understanding if the clusters
created are meaningful or not.
To summarize, in the process of finding out the good value of K, what we first do is we first
try to figure out if, in our problem statement, we have any context which hints at the number
of clusters to be created. If there is a context, then we create those many clusters. If a context
is missing, then the first step we do is to create an Elbow plot to get a ballpark figure of the
number of clusters. Then for each of the number of clusters suggested by the elbow plot, we
create cluster profiles. And by looking at the cluster profiles, we decide the value of ‘k’
We can also use the Silhouette Measure to measure the effectiveness of ‘k’. This is a number
that lies between -1 to 1, a higher value represents a good fit of the points given their cluster
assignments.
The silhouette measure is a way of quantifying how well each sample in a dataset is assigned
to its corresponding cluster. It's calculated as follows:

1. For each sample x in the dataset, calculate its average distance a(x) to all other points
in the same cluster.
2. For each sample x, also calculate the average distance b(x) to all samples in the closest
cluster.
3. The silhouette score for each sample is then given by s(x) = (b(x) - a(x)) / max(a(x),
b(x)).
4. Finally, the overall silhouette score for the clustering is calculated as the average of s(x)
for all samples x.
This score ranges from -1 to 1, with a score of 1 indicating that the sample is well-matched
to its own cluster and poorly matched to other clusters, and a score of -1 indicating that the
sample is assigned to the wrong cluster.
We can also use Silhouette Measure to arrive at an idea of an optimal number of clusters. For
whatever number of clusters we get an average Silhouette value closer to 1, we’ll choose
those many clusters.
Code Demo
The code snippets below demonstrate the clustering of ‘housing data’ using the KMeans
algorithm.
To begin with we import the required libraries and then read the housing data into a
dataframe.
import os
import pandas as pd
import numpy as np
%matplotlib inline

data_dir=r'C:\Users\User1\IML\KMeans'
os.chdir(data_dir)
data=pd.read_csv("kc_housingdata.csv")
data.head()
Output
The dtypes attribute displays the columns with their corresponding types.
data.dtypes

Output
id int64
date object
price float64
bedrooms int64
bathrooms float64
sqft_living int64
sqft_lot int64
floors float64
waterfront int64
view int64
condition int64
grade int64
sqft_above int64
sqft_basement int64
yr_built int64
yr_renovated int64
zipcode int64
lat float64
long float64
sqft_living15 int64
sqft_lot15 int64
dtype: object
Though some of the columns like “waterfront”, “view” are marked as int64, they represent
Boolean values. The code below selects those columns that have true continuous values.
## Choose columns that are numeric and have a numeric interpretation
data_num=data[['price','bedrooms','bathrooms','sqft_living']]
data_num.dtypes
Output
price float64
bedrooms int64
bathrooms float64
sqft_living int64
dtype: object

The values in these columns should be scaled. Scaling involves subtracting the mean and
dividing by the standard deviation. This will result in the mean becoming zero and the SD as
1. The values will be between -1 and 1. The code below does this transformation.
## Scale the data, using pandas
def scale(x):
return (x-np.mean(x))/np.std(x)
data_scaled=data_num.apply(scale,axis=0)
data_scaled.head()
Output
Another way of doing this is by using the methods in the ‘preprocessing’ library of sklearn.
The code below uses the scale() method to perform a similar operation as above.
## Scale the data using sklearn
import sklearn.preprocessing as preprocessing
dat_scaled=preprocessing.scale(data_num,axis=0)
print (dat_scaled)
print ("Type of output is "+str(type(dat_scaled)))
print ("Shape of the object is "+str(dat_scaled.shape))

Output
[[-0.86671733 -0.39873715 -1.44746357 -0.97983502]

[-0.00568792 -0.39873715 0.1756067 0.53363434]
[-0.98084935 -1.47395936 -1.44746357 -1.42625404]
...
[-0.37586519 -1.47395936 -1.77207762 -1.15404732]
[-0.38158814 -0.39873715 0.50022075 -0.52252773]
[-0.58588173 -1.47395936 -1.77207762 -1.15404732]]
Type of output is <class 'numpy.ndarray'>
Shape of the object is (21613, 4)
As you can observe, the values are same. Only the datatype of the output is different.
The sklearn package contains the class called KMeans that performs the KMeans Clustering.
It takes as parameter the number of clusters that need to be created. The class can be
instantiated and the data can be fitted as shown below.
## Create a cluster model
import sklearn.cluster as cluster
kmeans=cluster.KMeans(n_clusters=3,init="k-means++")
kmeans=kmeans.fit(dat_scaled)
Fitting will result in labels or ‘cluster value’ being assigned to each row of the data.
kmeans.labels_
Output
array([0, 1, 0, ..., 0, 1, 0])
The details of the cluster can be displayed as shown below.
kmeans.cluster_centers_

Output
array([[-0.46344721, -0.72134068, -0.85923278, -0.78805675],

[-0.0295446 , 0.36219246, 0.37636536, 0.23320139],
[ 1.91042099, 1.08474966, 1.54690937, 1.93696596]])
We can find the ideal value of ‘k’ using the ‘Elbow’ method as shown below. For different
values of cluster, the average of the minimum distance to the of the centre points is
computed.
## Elbow method
from scipy.spatial.distance import cdist
K=range(1,20)
wss = []
for k in K:
kmeans = cluster.KMeans(n_clusters=k,init="k-means++")
kmeans.fit(dat_scaled)
wss.append(sum(np.min(cdist(dat_scaled, kmeans.cluster_centers_, 'euclidean'),
axis=1)) / dat_scaled.shape[0])
The number of clusters vs the average distance to the cluster center is plotted in a graph as
shown below.
plt.plot(K, wss, 'bx')
plt.xlabel('k')
plt.ylabel('WSS')
plt.title('Selecting k with the Elbow Method')
plt.show()

Output
It can be observed that after 12, the benefit of increasing the number of clusters is minimal.
The ideal value could be between 8 and 12. This can be fixed by bringing in the business
context or displaying the cluster profiles for each of these values.
The code below calculates the ‘silhouette score’ with 8 clusters.
labels=cluster.KMeans(n_clusters=8,random_state=200).fit(dat_scaled).labels_
metrics.silhouette_score(dat_scaled,labels,metric="euclidean",sample_size=10000,random_stat
e=200)
Output
0.2831038800191117
The code below displays the silhouette score for different cluster values. Note that the closer
the value to 1, the better compactness of the cluster.
for i in range(7,13):
labels=cluster.KMeans(n_clusters=i,random_state=200).fit(dat_scaled).labels_
print ("Silhoutte score for k= "+str(i)+" is

"+str(metrics.silhouette_score(dat_scaled,labels,metric="euclidean",
sample_size=1000,random_state=200)))

Output
Silhoutte score for k= 7 is 0.2763712840436325

Create a python file with the functions as shown below. They calculate the global mean of the
cluster and the z-transformed global mean.
def get_zprofiles(data,kmeans):
data['Labels']=kmeans.labels_
profile=data.groupby('Labels').mean().subtract(data.drop('Labels',axis=1).mean(),axis=1)
profile=profile.divide(data.drop('Labels',axis=1).std(),axis=1)
profile['Size']=data['Labels'].value_counts()
return profile
def get_profiles(data,kmeans):
data['Labels']=kmeans.labels_
profile=data.groupby('Labels').mean().divide(data.drop('Labels',axis=1).mean(),axis=1)
profile['Size']=data['Labels'].value_counts()
return profile

Import the library into the notebook as shown below.
import cluster_profiles as cluster_profiles
Fit the model for ‘8’ clusters as shown in the code below.
## Let's look for profiles for 8,9,10 clusters
kmeans=cluster.KMeans(n_clusters=8,random_state=200).fit(dat_scaled)
Display the zprofiles for the model.
cluster_profiles.get_zprofiles(data=data_num.copy(),kmeans=kmeans)
Output
The global mean without transformation can be displayed as shown below.
cluster_profiles.get_profiles(data=data_num.copy(),kmeans=kmeans

Output
Thus, clustering algorithms can be used to categorize the dataset into different categories.

5. AGGLOMERATIVE CLUSTERING
Another very popular algorithm that is used to create clusters in data is called agglomerative
clustering or hierarchical clustering.
Agglomerative clustering as the name suggests takes a bottom-up approach. In this kind of
clustering each data point is considered an individual cluster and based on the similarity of
data points we form clusters.
Figure 7: Agglomerative clustering
For example, let’s say we have five points in our data p, q, r, s and t then we will figure out in
the first iteration of this algorithm which two points are the most similar points. Now it turns
out that the point s and t out of all these five points are most similar to each other so we will
cluster them together and say that s and t forms one cluster.
In the second iteration, we will try to find out, out of the cluster {s,t}, point {r} and points {p}
and {q}, which points or clusters of points are closer together. Now in the second iteration,
it turns out that the point {p,q} is more similar to each other as compared to others. So, in
the second iteration, we will say that points p and q will form one cluster.
In the third iteration, we will have to decide what should we cluster together, cluster{ p,q },
point {r} or cluster {s,t}. Now in the third iteration of this algorithm, let’s say we figure out
that point {r} is closer to cluster {s,t} hence we will cluster together points {r,s,t} and say that

they form our cluster and the second cluster would be point {p,q} and at the final iteration of
this algorithm, all the data points will form one cluster.
To figure out the similarity between points or the similarity between points and clusters, let
us assume that we have seven rows of data across two variables.
Table 3: Data with Two Values
Just like K-means clustering to do hierarchical clustering or agglomerative clustering, data

again should be scaled and should be numeric. Now we will need to find out the distance
between point 1 and 2, point 1 and 3, point 1 and 4 and all the possible pairs of point which
is what I’ve done here by creating a dissimilarity matrix.
Table 4: Dissimilarity Matrix
Now, this dissimilarity matrix is nothing but the use of simple distance formula from our
high school coordinate geometry. Here you can think of each row as one data point and using
these data points, we can find out what is the distance between them. Now once we create
this dissimilarity matrix, let us see what each number in this dissimilarity matrix means. The
number “4.03112887” represents the distance between points 1 and 2 so on and so forth.
The distance between a point and itself will always be 0 hence all these diagonal entries are
0.

Now within this dissimilarity matrix, you can see that this is the minimum number which
means the distance between point 1 and point 4 is the least so we use simple cartesian
distances to figure out the similarity between points.
Now let us look at a slightly more involved numerical example to see how the algorithm
works through different iterations. Now assume that we had five data points a, b, c, d, e and
let us assume that this data was scaled and was numeric, now we can create a dissimilarity
matrix out of this data .
Figure 8: Dissimilarity matrix for 5 points
As you can see once we create a dissimilarity matrix, we know that the points a and b are
most similar to each other. So at the end of the first iteration, we will combine points {a , b}
and say that they are one cluster and points {c}, {d} and {e} individually will be recorded as
individual clusters and we will also record that we have agglomerated points a and b at a
dissimilarity level of 2. Now in the next step, we will need to find out the distance between
cluster {a , b} and point {c}, cluster {a, b} and point {b} and so on and so forth.
Figure 9: Dissimilarity matrix after creating 1 cluster
The distances between clusters and points are computed is slightly differently compared to
how the distance between a pair of points is computed. Within hierarchical clustering or
agglomerative clustering, the way to compute the distance between a cluster and any other

cluster or a cluster and any other point involves the use of something that is known as
linkages. Linkages are mathematical constructs that help us in figuring out the distance or
dissimilarity between a cluster and a point.
Here is an example of the calculation of dissimilarity between two clusters, one of the ways
to define dissimilarity between clusters is this:
Figure 10: Formula to compute distance between clusters
So, the distance between any cluster r and q is equal to 1 by the number of elements in cluster
r multiplied by the number of elements in cluster q and the sum of distances of the individual
elements of each of the clusters.
Now let us take an example to understand how this formula would work,
Figure 11: Computation of distance between cluster and point
To figure out the distance between clusters {a,b} and point {c}, count the elements in cluster
{a,b}. Substitute in Formula in Figure 10 to find the distance between points ‘a’ in cluster
{a,b} and point{ c} and point {b} in cluster {a,b } and point {c}. We can now compute the
distance between cluster {a,b} and point {c} and complete the dissimilarity matrix.
Figure 12: Updated Dissimilarity Matrix

Now, if I look at my dissimilarity matrix now, I can see that point {d} and point {e} are most
similar, so point {d} and {e} will be combined together in iteration 2 of the algorithm and
after iteration 2 of the algorithm we’ll have three clusters – {d, e} { a, b} {c}. We will repeat
this process now again for whatever cluster of points.
Figure 13: Computation of distance between cluster and point In the 2nd iteration
We have to find out which pair of points or clusters are similar.
Figure 12: Updated Dissimilarity Matrix after iteration 2
Point {c} is similar to or is closer to clusters {d , e} so we will merge {c} with {d, e}. Hence at
the end of step 3, we will remain with only two clusters, one is the cluster of point {a,b} and
other is the cluster of point {c,d,e}. In step 4, we will combine all the clusters together.
This is the visual representation of what just happened,
Figure 13: Dendogram

This visual representation is referred to as a dendrogram and within this dendrogram, you
can also see the information about at what level of dissimilarity were these agglomerations
done. Now when a clustering algorithm runs through the whole number of iterations and
combines all the data points into a single cluster, it also remembers at each step what
happened.
For example, if we want to query from this algorithm only three clusters then the clustering
procedure would remember that at step 3 there were three clusters and point {d , e} was one
cluster, {a ,b} was one cluster and point {c} was another cluster and hence it’ll be able to label
the rows appropriately with cluster labels.
To decide the optimum number of clusters for a hierarchical clustering model, just like K-
means clustering, if the business problem that we are trying to solve has some context
around it which gives us an implicit or explicit hint about how many clusters are to be used
then based on that we will decide how many clusters are to be extracted out of a hierarchical
clustering model. For example, a marketer might want to look at only four segments in a
given data then we will just use hierarchical clustering to look at four clusters and see what
those four clusters mean. If we want to figure out theoretically or algorithmically what would
be an optimum number of clusters then as discussed in the previous section, we can make
use of the silhouette measure. Now at whatever number of clusters, the silhouette measure
approaches one or is nearer to one, we may want to go with that hierarchical clustering
model. Thirdly, it is also important to look at the clusters that you’ve created in the context
of the business by closely scrutinizing the cluster profiles, the cluster profiles make business
sense then we go with that model otherwise we keep on iterating.
Code Demo

import os
import pandas as pd
import numpy as np
%matplotlib inline
data_dir=r'c:\Users\sreejaj\Downloads'
os.chdir(data_dir)
data=pd.read_csv("pollution_india_2010.csv",na_values='Null')
data.head()
Output
data.isnull().sum()

Output
City 0
NO2 4
PM10 1
SO2 5
State 0
dtype: int64
data.shape
Output
(181, 5)
data=data.dropna()
data.shape
Output
(175, 5)
data.dtypes
Output
City object
NO2 float64
PM10 float64
SO2 float64
State object
dtype: object
data_pol=data.groupby('State',as_index=False)[['NO2','PM10','SO2']].agg(np.sum)
data_pol.head()

Output
def scale(x):
return (x-np.mean(x))/np.std(x)
data_num=data_pol.drop("State",axis=1)
data_scaled=data_num.apply(scale,axis=1)
data_scaled.head()
Output
from scipy.cluster.hierarchy import dendrogram, linkage
data_scaled=np.array(data_scaled)
Z=linkage(data_scaled,method="ward")

fig, ax = plt.subplots(figsize=(15, 20))
ax=dendrogram(Z,orientation="right",labels=np.array(data_pol['State']),leaf_rotation=30,leaf_f
ont_size=16)
plt.tight_layout()
plt.show()
Output

1) In ______________, the algorithm is trained on unlabeled data, with the goal of

discovering structure or patterns in the data. During the training, there is no
known output.
2) The output of the KMeans algorithm is the _________ for each row of the data
3) In the KMeans algorithm, once we find the distances to the center of the
cluster and recompute the cluster the point belongs to , we have to
_____________________.
4) The KMeans algorithm stops when ___________________________________.
5) What are the two conditions needed to perform KMeans Algorithm ?
6) Name one method to ensure that the data is uniformly scaled
7) After z-standardisation, the mean will be ___ and standard deviation will be
__________.
8) The plot of the total “Within SS” for each value of K to arrive at an idea of what
would be the optimum number of clusters. Such a plot is known as
_________________.
9) The Silhouette Measure takes value between ____________.
10)____________ is a kind of clustering where each data point is considered an
individual cluster and based on the similarity of data points we form clusters.

1) Explain the steps in KMeans Algorithm

2) Explain how scaling can be performed with an example.
3) Explain how the categorical variables are converted into numerical form with example.
4) How is the ‘ideal’ k computed in KMean algorithm.
5) Explain Agglomerative clustering with example
7. SELF-ASSESSMENT QUESTIONS – ANSWERS
1) Unsupervised learning,
2) Cluster label
3) Recompute the center of the cluster.
4) no point changes its class membership or cluster membership.
5) All features should be uniformly scaled and should be numerical.
6) Z-Standarisation
7) 0 and 1.
8) Scree plot or Elbow plot.
9) -1 and 1.
10)Agglomerative clustering
8. TERMINAL QUESTIONS – ANSWERS
1) Refer section “KMeans Clustering – Algorithm “

2) Refer section “Scaling and Creating Dummy Variables”
3) Refer section “Scaling and Creating Dummy Variables”
4) Refer section “Number of Clusters”
5) Refer section “Agglomerative Clustering”

Unit 13
Recommendation System
Table of Contents
SL Fig No / Table SAQ /
Topic Page No
No / Graph Activity
1 Introduction 1 -
3
2 Recommendation Engine 2 - 4-5
3 Implementation of Collaborative Filtering 3, 4, 5, 6 -
3.1 Memory Based / Neighborhood Methods - -
6 - 10
3.2 User-Based Collaborative Filtering - -
3.3 Item-Based Collaborative Filtering - -
4 Building Collaborative Filtering Model Using
- - 11 - 18
Memory-Based Approach
5 Model-Based Recommendation Engines - -
5.1 Matrix Factorization 7 -
5.2 Prediction and Estimation Using SVD and
8, 9 - 19 - 25
NMF
5.3 Singular Value Decomposition 10, 11 -
5.4 Non-Negative Matrix Factorization - -
6 Code Demo - 1 26 - 28
7 Self-Assessment Questions – Answers - - 29
8 Terminal Questions – Answers - - 29
Unit 13 : Recommendation System 2

\ 1. INTRODUCTION
An artificial intelligence (AI) algorithm, commonly linked to machine learning, is referred to

as a recommendation system when it makes suggestions or recommends more products to
customers using big data. These can be determined by a number of parameters, including
previous purchases, search history, demographic data, and other elements. As they guide
consumers to discover goods and services they might not have found on their own,
recommender systems are extremely helpful.
There are many ways to create recommendation engines, but, in this module, we focus only
on collaborative filtering. We can build collaborative filtering models using memory-based
methods as well as model-based methods.
Figure 1: Methods to build Collaborative Filtering
❖ Understand the user's preferences, behavior to make personalized recommendations to the

users.
❖ Predict the user's future interests and preferences.
❖ Improve the user's experience by providing relevant and useful suggestions
❖ Analyze the effectiveness of the recommendation algorithm by measuring metrics such as
accuracy and diversity.

2. RECOMMENDATION ENGINE
Software that creates product recommendations for customers based on their preferences
or previous behaviour is known as a recommendation engine. By recommending items like
books, movies, music, or products that the user may find interesting, these suggestions can
be utilised to enhance the user experience.
Building a recommendation engine can be done in a number of ways, such as:
• Collaborative Filtering
• Content-Based Filtering
• Hybrid Approaches
In recommendation systems, a technique called collaborative filtering is used to forecast a

user's preferences or rating for a product based on both their prior activity and the
behaviour of other users.
Collaborative filtering can be divided into two basic categories: user-based and item-based.
In user-based collaborative filtering, suggestions are made for a user based on both that
person's prior behaviour and the behaviour of users who are similar to them. In item-based
collaborative filtering, suggestions are created based on the user's previous behaviour and
the connections between items.
You require information about user preferences or item ratings in order to create a
collaborative filtering recommendation engine. With the use of this information, a user-item
matrix can be created, in which the rows correspond to users and the columns to items. Then,
a variety of methods can be used to create suggestions, such as assessing the similarity of
items based on the behaviour of users who have interacted with both items or identifying
the k nearest neighbours to a user based on their past activity.
Recommendation engines can also be classified as unsupervised or semi-supervised learning

algorithms. Collaborative filtering is necessarily unsupervised.
Predicting User Preferences

Recommender systems use historical user preference data to predict user preferences. This
is a common use case for recommendation algorithms.
For instance, if we visit amazon.com, Amazon attempts to suggest similar things to us based
on our browsing history. Companies like Netflix, and Amazon sell a wide variety of goods.
For such organizations, being able to cross-sell and up-sell their products become essential.
In this module, we will focus solely on collaborative filtering. Collaborative filtering works
on the premise of similar users, selecting similar kinds of products or if we know how similar
one product is to a set of products, then relevant recommendations can be made.
For doing collaborative filtering, we usually work with a user-item rating matrix as shown in
figure 2, the numbers in this matrix, for example, this number 3, 5 & 1 (First 3 column values
of User 1), represent the ratings of user-item pair. In the context of a company such as
Netflix.com, the items would be movies or TV series and the user-item rating matrix would
be the user movie rating matrix.
Figure 2: User Item Rating Matrix
Usually, a user-item rating matrix is partially populated as not all users tend to rate all items.
Here, user 1 has rated all the items, but user 2 has not rated item number 5. Similarly, the
other users have also not rated certain items.
Recommendation engines usually try to predict these missing ratings. If the predicted value
of these ratings is high enough, then this item is recommended to the user.

3. IMPLEMENTATION OF COLLABORATIVE FILTERING

Collaborative filtering can be implemented in 2 ways, i.e.,
• one using Memory based or Neighborhood methods

• The other uses model-based methods.
3.1 Memory Based / Neighborhood Methods
Memory-based or neighborhood methods are further divided into 2 types,
• one is user based and

• the other is item based.
User-Based Collaborative Filtering is a method for predicting the products that a user could
like based on the ratings given to that product by other users whose interests are similar to
the target users. Collaborative filtering is a popular method for creating recommendation
systems on websites, whereas Item Based Collaborative Filtering is a kind of
recommendation system where things are compared for similarity based on user ratings.
Figure 3: User-Based Collaborative Filtering Figure 4: Item-Based Collaborative Filtering
In Figures 3 & 4, the dashed lines show a recommendation, while the solid lines show the
user's preference.

One of the more well-liked approaches to putting collaborative filtering into practice is
neighborhood strategies. Because the procedure is so simple to understand, putting it into
practice is likewise not too difficult.
Both the users based collaborative filtering and item-based collaborative filtering can be
implemented using a neighborhood approach.
For example, in user-based collaborative filtering, assume we have a user-item rating matrix
as shown in figure 5 and also assume we want to predict the rating for item 5 for Alice.
Figure 5a: User Item Rating Matrix Figure 5b: Evaluating the rating for item 5 based
on user 1 &2 ratings.
First, we will find users which are similar to Alice. We can use any measures of similarity
such as a Pearson co-relation or cosine similarity. Then based on the most similar users, we
predict what would be the rating for item 5 by Alice.
Let’s say, user 1 and user 2 turn out to be the most similar users to Alice, then the rating for
item 5 for Alice will be predicted based on how user 1 and user 2 have rated item 5 as shown
in figure 5(b).
3.2 User-Based Collaborative Filtering
Let’s consider a numerical example to understand the notion of similarity of users and how
based on similar users, and how we will eventually come up with a prediction for rating for
a given user.
As mentioned earlier, we want to predict the rating for user Alice for item 5. Now we have
this user-item rating data as shown in figure 5a.

• First find out the similarity between Alice and user1, Alice and user2, Alice and user3
and Alice and user4.
• To find out the similarity, calculate the correlation between the ratings of Alice and
each of the users.
Another way to find the similarity between a user and another user based on their
ratings is known as cosine similarity.
The formula to calculate the cosine similarity between the ratings of users is as shown
below,
• Now, the user has to specify how many closest neighbors one has to look into to
decide the final rating for any given user, for our collaborative filtering algorithm to
work.
Hence, in this example, let us assume that only two most common users to Alice (we
are considering user 1 and user 2 based on the correlation value) and based on that,
decide the predicted rating.
• Now, to do rating prediction, various formulations can be used but it depends upon
the machine learning framework that is used to make the predictions.
Most machine learning frameworks support either (i) or (ii) predicted rating
methodology.
Formula (i) Formula (ii)
• Formula (i) is used in predicted rating methodology, which can be represented as

{(The similarity of User 1 * his rating for the item that has to be predicted) + (The
similarity of User 2 * his rating for the item that has to be predicted)} / (The
similarities of User 1 + The Similarities of User 2)
• Formula (ii) is used to do rating prediction, which considers the average rating of the
user.
• Hence, first will find the average rating of Alice and then, sum up the,
{(similarity of user 1 * the rating given by user 1 to item 5)- (the average of the user1’s
total ratings)} + {(similarity of user 2 * the rating given by user 2 to item 5)- (the
average of the user2’s total ratings)} / (The similarities of User 1 + The Similarities of
User 2).
3.3 Item-Based Collaborative Filtering
Let’s consider the same example, assuming that we still want to predict the rating for item 5
for user Alice.
Now we will find out the items similar to item 5 that are rated by Alice using either co-
relation or cosine similarity. Then based on the items more similar to item 5, we will compute
the rating of item 5.
Assume that item 1 and item 4 end up being the items that item 5 is most comparable to. The
prediction regarding a rating for item 5 would now be made based on Alice's ratings for items
1 and 4.
Let’s consider a numerical example to show how item-based collaborative filtering works.
Let’s assume that we have a user-item rating matrix as shown in figure 5a, and we want to
predict the rating for item 5 for user Alice.
In an item-based collaborative filtering, we have to find, which items are similar to the item
for which we want to create a rating i.e., for item 5. So, first, we have to find the similarity
between item 1 and item 5, item 2 and item 5, item 3 and item 5, and item 4 and item 5 using
co-relation or cosine.

The cosine similarity is computed by doing a dot product between the ratings of different
items.
For example, to find the cosine similarity between item 1 and item 5, then a dot product will
be performed between item 1’s ratings and item 5’s ratings, then for user 2, the rating of item
1 and item 5 and so on. And then we will divide by the product of the sum of squares and
take a square root of that.
Then, we have to decide on how many neighbors we consider for creating a rating.
Next, we have to make predictions. There are 2 ways to make predictions.
• Considering only ratings and similarities,

• Considering only average effects
Considering the rating and similarities, we will use the below formulation.
Considering the average effects, we will use the below formulation.
The framework might use ratings and similarities formulation or average effects
formulation based on the machine learning framework.

4. BUILDING COLLABORATIVE FILTERING MODEL USING MEMORY-

BASED APPROACH
1. First will Import the required standard libraries like OS module, the panda’s module and
the surprise library
import os
import pandas as pd
import surprise // For this module we will be using a new Python library known as a
surprise. To run this library, we will have to install the surprise library in your installation of
Python.
2. Next read the data (here we have used sample_data.csv file) on our system inside Python
such that surprise can use the dataset to build various kinds of recommendation engines.
df=pd.read_csv("sample_data.csv")
df.head()
3. Now, we need to create a reader object to convert the data frame into an object so that the
surprise would understand, also in the reader method we should specify the line format i.e.,
the sequence in which the user ID, the rating and the item IDs occur and the ratings have
been done on a scale of 1 to 5 (a/c to the data).
reader=surprise.dataset.Reader(line_format='user rating item',rating_scale=(1,5))
4. After creating the reader object, will use the load from the data frame method within the
surprise library to convert this data frame into an object that surprise will be able to build
different recommendation engines.
data=surprise.dataset.Dataset.load_from_df(df,reader=reader)

Within this object, we have an attribute called raw_ratings.

data.raw_ratings
It contains a list of tuples which contain this data now in the below format.
5. Let’s consider the bigger dataset (now we are using the ratings.csv dataset). First, read
the file as a panda’s data frame and then print the first few observations.
Here, we can observe that the first column is the column of user IDs, movie ID, rating and
timestamp.
Since timestamp will not be used by the recommendation engine, will be dropping this. And
also, we have to rename the columns because surprise expects the column to have names
which we have defined before.
mr=pd.read_csv("ratings.csv")
print(mr.head())
mr.drop('timestamp',axis=1,inplace=True)
mr.rename(columns={'userId':'user','movieId':'item','rating':'rating'},inplace=True)
6. As before, will have to create a reader object.

reader=surprise.dataset.Reader(line_format='user item rating', rating_scale=(1,5))
7. Read the full dataset from the data frame mr (which we have created in step 5) based on
the reader object (which we have created in step 6). After executing this code, it will create
an object called mr_train.
mr_train=surprise.dataset.Dataset.load_from_df(mr,reader=reader)
This will have all the data that are required to build a recommendation engine.

8. After creating this object, will have to create a training set object using this method called
build_full trainset. (All these steps are required because surprise expects all of these steps to
be done for it to build recommendation engines).
mr_trainset=mr_train.build_full_trainset()
9. Now let’s use this training set that we have created to build some collaborative filtering
models both user-based and item based. Now within surprise, you have a prediction
algorithm module, that has various prediction algorithms and some of the algorithms are
based on the neighborhood approach. Hence, let’s import the module which contains
libraries that are based on the neighborhood approach like KNNBasic which implements the
very basic collaborative filtering model.
Here the k stands for the number of neighbors that we would consider for the closeness of a
user or an item and we have a parameter called similarity options (sim_options), here we
will be defining the options.
import surprise.prediction_algorithms.knns as knns
knnbasic=knns.KNNBasic(k=40,min_k=1,sims_options={'name':'cosine','user_based':True})
So, we will instantiate an object of KNNBasic class and then we will use the training method
to work on the training dataset that we created to build a user-based collaborative filtering
model based on the cosine similarity.
knnbasic.train(mr_trainset)
10. Now the model is created, let’s see the data points in that model by executing this
command.
mr.head()

11. Now once this model is trained then we can use it to predict the rating for a certain user
or a certain item ID.
Let’s do a rating prediction for user 1 by defining the parameters in predict function for Uid
1, for item ID 31 also we know that the actual rating here is 2.5.
knnbasic.predict(uid=1,iid=31,r_ui=2.5)
After executing this the predicted rating comes out to be 2.91.
12. Now, let’s build an item-based collaborative filtering model with the KNNBasic method
using the same parameters but the only change is to set the user-based value of FALSE to
build an item-based collaborative filtering model.
knnbasic=knns.KNNBasic(k=40,min_k=1,sims_options={'name':'cosine','user_based':False})
13. After executing the above code then train the model.
and now do the prediction for the same user id1 based on Item-Based collaborative filtering.
knnbasic.predict(uid=1,iid=31)
Here also the estimated rating is 2.91 which is similar to user-based collaborative filtering.
14. Now the average ratings of users and items can be obtained by using the KNNWithMeans
method.
Let’s build an item-based model (as we have given a value of False to the user-based
parameter in step 12).
knnbasic=knns.KNNWithMeans(k=40,min_k=1,sims_options={'name':'pearson','user_based'
:False})
Let us train before making the prediction.
Now, will predict user ID 1 and item ID 31.

knnbasic.predict(uid=1,iid=31)
Now we can observe that the predicted rating is 2.09 and the actual rating was 2.5
15. We can get an estimate of model performance by splitting the total data into three folds.
So that will train data on two folds and test on one-fold. For that, we will use the split method
to split training data and then will use the evaluate method to evaluate the KNNBasic
algorithm on these three folds.
mr_train.split(n_folds=3)
surprise.evaluate(knns.KNNBasic(k=40,sims_options={'name':'cosine','user_based':False}),mr_train)
Here it gives us an idea of the out-of-sample performance of this model which here is being
measured by Root Mean Squared Error.
Repeat the same process with the model which considers the average effects. We can
evaluate this model on three folds out of the sample.
surprise.evaluate(knns.KNNWithMeans(k=40,sims_options={'name':'cosine','user_based':False}),mr
_train)

So now if I compare the accuracy of a model which considers means and the model which
doesn’t, it seems like the model which considers the mean effects is slightly more accurate
compared to the model which doesn’t.
16. Now let’s perform the grid search. To work we’ll first define a grid.
Let’s define a grid that considers the number of neighbors of 10 or 20, and within the
similarity options, will be searching between the cosine similarity and the msd and will only
be building an item-based collaborative filtering model.
param_grid = {'k': [10, 20],

'sim_options': {'name': ['msd', 'cosine'],
'user_based': [False]}
}
Next, define an estimator.
algo=knns.KNNWithMeans

Then the surprise module has a GridSearch method which will supply the algorithm and the
parameter grid and the accuracy measures that we want to do the defined grid search.
grid_search = surprise.GridSearch(algo,param_grid=param_grid, measures=['RMSE', 'MAE'])
Now we will use the evaluate function on the dataset. Grid search is time-consuming since
the algorithm goes through multiple permutations and combinations.
grid_search.evaluate(mr_train)

After executing this grid search, we can check the best parameters based on the Root Mean
Squared Error as a metric or Mean Absolute Error as a metric.
print(grid_search.best_params['RMSE'])
print(grid_search.best_params['MAE'])
It shows that the parameters were similar, and after executing a grid search, a model with
20 neighbors which is an item-based model and uses msd as a similarity metric is the best
model.
Now we can also take a look at the best score obtained on rmse and mae.
print(grid_search.best_score['RMSE'])
print(grid_search.best_score['MAE'])
The best score on Root Mean Squared error is 0.93 and on Mean Absolute Error is 0.75.

5. MODEL-BASED RECOMMENDATION ENGINES

Model-based recommendation engines are algorithms that produce individualised
recommendations for users based on mathematical models. In order to make suggestions for
products that are most likely to be of interest to the user, these models assess user behaviour,
preferences, and item features. Collaborative filtering, matrix factorization, and neural
networks are a few examples of how model-based recommendation engines might be
implemented. These engines are frequently used in services like video-on-demand, online e-
commerce, and music streaming.
5.1 Matrix Factorization
The matrix factorization technique is used to represent users and items as latent factors or
features. To factorise a user-item matrix into two lower-dimensional matrices that represent
the latent properties of users and items is the aim of matrix factorization. These latent
features can be utilised to generate recommendations since they capture the underlying
relationships and patterns between users and objects.
The two primary types of matrix factorization methods are:
1. Singular value decomposition (SVD).

2. Non-negative matrix factorization (NMF)
Let’s consider the User item rating Matrix as shown in figure 6.
Figure 6: User Item Rating Matrix
Assume, that the ratings are on a scale of 1 to 5. In the context of recommendation engines,
matrix factorization indicates finding 2 matrices first one being a User factor matrix and the
second one being an Item factor matrix.
The User factor matrix and Item factor matrix will be factorized which is used for prediction
and recommendation.

Observe the four movies listed here, the first two movies are of one genre which is science
fiction while the last two movies are of a different genre, maybe fantasy movies. Also, we can
see that user 1 and user 2 have rated science fiction movies highly.
If we look at the User factor matrix, it has two columns. We can represent these two columns
by the genres of the movie i.e., the first factor as representing the sci-fi movie.
We can think of rows of this User factor matrix representing how much a user likes one genre
of movie. We can see that users 1 and 2 have high values corresponding to the sci-fi genre.
And we also know that these users have given high ratings to sci-fi movies in the original
matrix as shown below.
Let’s now consider the Item factor matrix. The Item factor matrix can be thought of as a
matrix where the numbers represent how much each factor, each of the items has.
Considering the first factor which is the science fiction genre movies, we can see that Star
Trek and Avatar have a higher score on the sci-fi factor. Similarly, we can interpret other
parts of the User Item factor matrix.
For example, the second factor in both the User and Item factor matrix can be thought of as
a factor representing the fantasy genre. In the User factor matrix, user 3 and user 4 have a
higher score on the fantasy factor and looking at the original rating matrix, will realize that
these were the users who had given higher ratings to fantasy movies. If we look at the Item
factor matrix, we would realize that Spiderman and Hulk have a higher score on the fantasy
factor.
Here, the user-item rating matrix with m users and n items indicates that our matrix will
have m rows and n columns. When we factorise this rating matrix, the resulting User factor
matrix has m rows, which corresponds to the number of users, and k columns and we call
this matrix as the p matrix.
The other matrix is known as the Item factor matrix with as many rows as the number of
factors that is k and as many columns as the number of items that is n. We call this matrix as
q transpose.

The number of factors k is determined by the user, and the number of factors to be
considered is always less than the number of rows or columns or the original rating matrix
whichever is minimum.
5.2 Prediction and Estimation Using SVD and NMF
Data analysis, dimension reduction, and data compression mathematical techniques include
singular value decomposition (SVD) and non-negative matrix factorization (NMF).
In SVD, a matrix is divided into three matrices to reflect its right and left singular vectors as
well as its singular values. Data compression, denoising, and the discovery of latent features
in data are all common uses for SVD.
Conversely, NMF divides a non-negative matrix into two non-negative matrices with non-
negative components. Applications for NMF include document clustering, text and image
data analysis, and recommendation systems.

The above equation is the general identity that is used to make predictions for the user and
item i. Let’s understand this identity in more detail by considering a numerical example.
Let’s consider the user-item rating matrix is shown in figure 7.
Figure 7: User-item Rating Matrix with no ratings for few fields
You can observe that there are many places in this matrix where we don’t know the rating
and also there is no rating available for user 4 and the movie Avatar.
Now, let us compute the value of bi (average ratings considering each item i.e., the column
averages are computed) and bµ (Average ratings considering each user i.e., we have
computed the row averages).
Figure 8: Computed bi and bµ Value
We can compute the global average of ratings as well i.e., 1.5 (average of all the ratings in the
data) and we call this µ.
Let’s compute the rating for movie 4 by user 1.
(Using the computed µ, bi bµ values)

Now quantity can be computed by taking the respective rows and columns from our p and q
matrix and then do the dot product of these two vectors.
How is the Matrix Factorization done?
Now let us understand what matrix factorization is and how it can be used to do rating
prediction, in the context of recommendation engines.
To do matrix factorization there are two popular algorithms.
1. Singular Value Decomposition

2. Non-Negative Matrix Factorization.
5.3 Singular Value Decomposition
In SVD, a matrix is divided into three matrices to reflect its right and left singular vectors as
well as its singular values. Data compression, denoising, and the discovery of latent features
in data are all common uses for SVD.
Assume we have the rating matrix shown in figure 9(a). Since there are users and items for
which we don’t have any rating history, then we will ignore these items and users to obtain
a matrix as shown below.
Figure 9 (a): Rating Matrix Figure 9(b): Matrix after ignoring the unknown ratings
In figure 9(b) we have one missing entry. We will assign it with either the mean or 0 value
and it depends on the type of implementation to do the matrix factorization.
Once we have this matrix, now we also know that to make rating predictions we have this
identity. We randomly choose any matrix q and p and using the identity we make predictions.

Since we have actual data and after calculating the predicted values , we can compute an
Error matrix using the below formula.
But just minimizing the error can lead to overfitting. So, regularization is mostly used to
guard against overfitting.
A more commonly used cost function is:
You can see that we have a hyperparameter here called lambda λ and this becomes the
regularization term. So, all the parameters of SVD that is pu, qi, bu and bi are all estimated by
minimizing this cost function.
5.4 Non-Negative Matrix Factorization
NMF divides a non-negative matrix into two non-negative matrices with non-negative
components. Applications for NMF include document clustering, text and image data
analysis, and recommendation systems.
Now Non-Negative Matrix Factorization in the context of recommendation engines is very

similar to SVD. The only difference is that the factors estimated are non-negative. So, the
same cost function is minimized as in the case of SVD but with a constraint that pu and qi are
to be positive.

Now to minimize the cost functions numerical optimization procedures such as stochastic
gradient descent is used. Apart from stochastic gradient descent, another popular algorithm
to minimize the cost function is Alternating Least Squares.
Depending on what machine learning framework is used, the exact details of

implementations can differ. The cost functions discussed here are only indicative of what is
normally used by most implementations.
The hyperparameters of model-based methods are:
• The number of factors to be considered,

• The value of the regularization term,
• The learning rates,
The number of epochs to be used (mostly using Stochastic Gradient Descent to estimate
parameters will become the hyperparameters of a matrix factorization model). The
optimum values are found using grid search and cross-validation.

6. CODE DEMO
In this code demo we will show how we can build a model-based recommendation engine
using the surprise library in Python. Let’s get started.
Let’s do some required standard imports like OS module, the panda’s module, the NumPy
module and the surprise module (This will help in building various kinds of model based
recommendation engines).
import os
import pandas as pd
import numpy as np
import surprise
Read the file (Dataset – here we have used ratings.csv).
mr=pd.read_csv("ratings.csv")
mr.head()
Now, drop the column which is not required (here we have dropped timestamp column) and
also will rename the columns to the names that the surprise expects.
mr.drop('timestamp',axis=1,inplace=True)
mr.rename(columns={'userId':'user','movieId':'item','rating':'rating'}
,inplace=True)
Next will create a reader object, specifying the line format (i.e., first column is the column of
user IDs, second column is the column of item IDs and the third column is the column of
ratings) and the ratings are on a scale of 1 to 5.
reader=surprise.dataset.Reader(line_format='user item rating',

rating_scale=(1,5))
Now, will create a training object from which will create a train set.
mr_train=surprise.dataset.Dataset.load_from_df(mr,reader=reader)
mr_trainset=mr_train.build_full_trainset()

Now from surprise let’s import a class called SVD. This will help us in creating model-based
recommenders using Singular Value Decomposition.
from surprise import SVD
Now let’s factorize the user item matrix into 20 factors. Hence, assigned a value of 20 in the
code.
model=SVD(n_factors=20)
And will use the train method of the model object that we just created to train the model.
model.train(mr_trainset)
Now let’s take a look at our raw data.
mr.head()
Now, let’s make a prediction for user whose raw ID is 1 and item ID is 31. Then based on the
SVD model,
model.predict(uid=1,iid=31,r_ui=2.5)
the prediction comes out to be 2.40 (approximately 2.5).
Now, let’s try and build a matrix factorization model based on Non-Negative Matrix
Factorization.
First will import the NMF class from surprise module and train the model taking 20 factors.
from surprise import NMF

model1=NMF(n_factors=20,biased=True,)

Next train the model.
model1.train(mr_trainset)
Let’s make a prediction for user ID 1 with item ID 31 and according to NMF the prediction is
around 2.8 as shown below.
model1.predict(uid=1,iid=31,r_ui=2.5)
So, it seems like at least for this user the SVD’s model is more accurate.
1. In ________________ suggestions are made for a user based on both that person's
prior behaviour and the behaviour of users who are similar to them
2. In ______________ suggestions are created based on the user's previous
behaviour and the connections between items.
3. Collaborative filtering can be implemented in ______________ & ______________.
4. The similarity between a user and another user based on their ratings is
known as ________ 5. The cosine similarity is computed by doing a ___________
between the ratings of different items.
5. ______________ recommendation engines are algorithms that produce
individualised recommendations for users based on mathematical models.
6. The _____________ technique is used to represent users and items as latent
factors or features.
7. A matrix is divided into three matrices to reflect its right and left singular
vectors as well as its singular values in ____________
8. ______________ divides a non-negative matrix into two non-negative matrices
with non-negative components.
9. Popular algorithm to minimize the cost function is _______________ &
_______________.

7. SELF-ASSESSMENT ANSWERS
1. user-based collaborative filtering
2. item-based collaborative filtering,
3. Memory based & Model based
4. cosine similarity.
5. dot product
6. Model-based
7. matrix factorization
8. Singular Value Decomposition
9. Non-Negative Matrix Factorization
10. Alternating Least Squares or stochastic gradient descent
1. Briefly explain the user-based and item-based collaborative filtering with a neat diagram.
2. Explain the cosine similarity in user based collaborative filtering
3. Explain the two different ways to make predictions in item-based collaborative filtering
4. Mention the general identity to make predictions for the user and item. Explain each term.
9. TERMINAL QUESTIONS – ANSWERS
1. Refer section 3.1

3. 3.Refer section 3.3
5. Refer section 5.3 & 5.4

Unit 14
Case Study
Table of Contents
No / Graph Activity
1 Identify Customers For Personal Loan - -

3-5
Campaign
2 Exploratory Data Analysis - - 6-8
3 Data Visualization - - 9 - 20
4 Data Preprocessing - - 21 - 23
5 Split Train And Test Data - - 24 - 25
6 Model Evaluation - -
6.1 Logistic Regression - -
6.2 Gaussian Naive Bayes - -
6.3 KNN Classifier - - 26 - 39
6.4 SVM - -
6.5 Stochastic Gradient Descent - -
6.6 Random Forest Classifier - -
Unit 14 : Case Study 2

\ 1. IDENTIFY CUSTOMERS FOR PERSONAL LOAN CAMPAIGN
As a part of this usecase, we have to create a model that can tell if the customer is likely to
avail a personal loan. This will help in targetted marketting for the customers. This is a
classification problem.
The dataset is available in the following location

https://www.kaggle.com/datasets/itsmesunil/bank- loan-modelling
“This case is about a bank (Thera Bank) which has a growing customer base. Majority of
these customers are liability customers (depositors) with varying size of deposits. The
number of customers who are also borrowers (asset customers) is quite small, and the bank
is interested in expanding this base rapidly to bring in more loan business and in the process,
earn more through the interest on loans. In particular, the management wants to explore
ways of converting its liability customers to personal loan customers (while retaining them
as depositors). A campaign that the bank ran last year for liability customers showed a
healthy conversion rate of over 9% success. This has encouraged the retail marketing
department to devise campaigns to better target marketing to increase the success ratio with
a minimal budget. The department wants to build a model that will help them identify the
potential customers who have a higher probability of purchasing the loan. This will increase
the success ratio while at the same time reduce the cost of the campaign.”
To identify a solution for this problem, we are going to do the following steps:
1) Exploratory Data Analysis : This involves identifying the types and count of the data.
This also calculates the 5 point summary.
2) Data Visualization : In this step, we check the distibution of the continuous and the cate-
gorical variables to check if they are appropriately distributed and the data is not
extemly skewed or have many outliers .
3) Data Imutation/Data Preprocessing : This step involves identifying if any of the rows
has missing or incorrect data and replacing it with a 0 or a mean value. We also check
the object type to ensure that the continuous variables are of the right type and not
incorrectly marked as object In this step, we transform the categorical variables using
the one hot swap technique and also normalize the continous variables.

4) Split Training And Test Data

5) Evaluate Various Techniques : We execute various techniques (with different
parameters) on the training data and identify the best parameters based on a cut-off
criteria.
6) Identification of the best Model : We compare the various results to identify a good
model
for the problem given. In this case, the model that gives us the Good accuracy Maximun ‘True
Postives’ and ‘True Negatives’ and Minimum ‘False Negatives’ will be a good candiate as this
will ensure maximum translation into personal loans.
i.e The model with a good “Accuracy” and “Recall” Ratio will imply it is a good model for this
problem.
[1]:
import numpy as np
import pandas as pd
[2]:

[2]: ID Age Experience Income ZIP Code Family CCAvg Education Mortgage \
0 1 25 1 49 91107 4 1.6 1 0
1 2 45 19 34 90089 3 1.5 1 0
2 3 39 15 11 94720 1 1.0 1 0
3 4 35 9 100 94112 1 2.7 2 0
4 5 35 8 45 91330 4 1.0 2 0
Personal Loan Securities Account CD Account Online CreditCard

0 0 1 0 0 0
1 0 1 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 1

2. EXPLORATORY DATA ANALYSIS

As a part of EDA we are going to analyze the following
1) Find the shape of the data,data type of individual columns

2) Check the presence of missing values
3) Descriptive stats of numerical columns
4) Find the distribution of numerical columns and the asssociated skeweness and presence
of outliers
5) Distribution of categorical columns
6) 5 point summary
[3]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
# Column Non-Null Count Dtype
0 ID 5000 non-null int64

1 Age 5000 non-null int64
2 Experience 5000 non-null int64
3 Income 5000 non-null int64
4 ZIP Code 5000 non-null int64
5 Family 5000 non-null int64
6 CCAvg 5000 non-null float64
7 Education 5000 non-null int64
8 Mortgage 5000 non-null int64
9 Personal Loan 5000 non-null int64
10 Securities Account 5000 non-null int64
11 CD Account 5000 non-null int64
12 Online 5000 non-null int64
13 CreditCard 5000 non-null int64
dtypes: float64(1), int64(13) memory usage: 547.0 KB

[4]:
[4]: count mean std min 25% \

ID 5000.0 2500.500000 1443.520003 1.0 1250.75
Age 5000.0 45.338400 11.463166 23.0 35.00
Experience 5000.0 20.104600 11.467954 -3.0 10.00
Income 5000.0 73.774200 46.033729 8.0 39.00
ZIP Code 5000.0 93152.503000 2121.852197 9307.0 91911.00
Family 5000.0 2.396400 1.147663 1.0 1.00
CCAvg 5000.0 1.937938 1.747659 0.0 0.70
Education 5000.0 1.881000 0.839869 1.0 1.00
Mortgage 5000.0 56.498800 101.713802 0.0 0.00
Personal Loan 5000.0 0.096000 0.294621 0.0 0.00
Securities Account 5000.0 0.104400 0.305809 0.0 0.00
CD Account 5000.0 0.060400 0.238250 0.0 0.00
Online 5000.0 0.596800 0.490589 0.0 0.00
CreditCard 5000.0 0.294000 0.455637 0.0 0.00
50% 75% max

ID 2500.5 3750.25 5000.0
Age 45.0 55.00 67.0
Experience 20.0 30.00 43.0
Income 64.0 98.00 224.0
ZIP Code 93437.0 94608.00 96651.0
Family 2.0 3.00 4.0
CCAvg 1.5 2.50 10.0
Education 2.0 3.00 3.0
Mortgage 0.0 101.00 635.0
Personal Loan 0.0 0.00 1.0
Securities Account 0.0 0.00 1.0
CD Account 0.0 0.00 1.0
Online 1.0 1.00 1.0
CreditCard 0.0 1.00 1.0

It can be observed that the columns ‘Personal Loan’, ’ Securities Account’, ‘CD Account’,
‘Online’ and ‘Credit Card’ are all boolean values stored as integers. Family and Education are
categorical columns. ID and ZipCode may not be very relevant for this use-case.
From the difference between the mean and the 50%, we can understand that some of the
data is skewed. For example, Mortgage, Income etc
[5]:
[5]: ID Age Experience Income ZIP Code Family CCAvg Education \
False 5000 5000 5000 5000 5000 5000 5000 5000
Mortgage Personal Loan Securities Account CD AccountOnline \

False 5000 5000 5000 5000
CreditCard
False 5000
Currently, there seems to be no missing values
[6]:
[6] : Empty DataFrame
Columns: [ID, Age, Experience, Income, ZIP Code, Family, CCAvg,

Education, Mortgage, Personal Loan, Securities Account, CD Account,
Online, CreditCard] Index: []
The output indicates that there are no duplicates

3. DATA VISUALIZATION
[7]:
#Plots to see the distribution of the continuous features individually
plt.figure(figsize= (20,15))
plt.subplot(5,5,1)
sns.histplot(df.Income, kde=True)
plt.xlabel('Income')
plt.subplot(5,5,2)
sns.histplot(df.CCAvg, kde=True)
plt.xlabel('CCAvg')
plt.subplot(5,5,3)
sns.histplot(df.Age, kde=True)
plt.xlabel('Age')
plt.subplot(5,5,4)
sns.histplot(df.Mortgage, kde=True)
plt.xlabel('Mortgage')
plt.subplot(5,5,5)
sns.histplot(df.Experience, kde=True)
plt.xlabel('Experience')
plt.show()
There are more people with lesser income. Also there are more ppl who have no mortage in
this data set. Also low number of ppl with high CCAvg. The age and experience seems to be
uniformly distibuted.
[8]:

[8] : Text(0.5, 1.0, 'CreditCard Holder Distribution')
The dataset seems to have a very small percentage of people who has taken the loan when
compared to those who havent. Ideally we should have a balanced dataset. However, this is
not possible always. Therefore, we may have to handle this during modelling.

[9]:
plt.title('CreditCard Holder Distribution')
plt.subplot(3,3,2)
sns.countplot(x=df['Online'])
plt.title('Online User Distribution')
plt.subplot(3,3,3)
sns.countplot(x=df['Education'])
plt.title('Educational Qualification Distribution')
plt.subplot(3,3,4)
sns.countplot(x=df['Family'])
plt.title('Family Distribution')
plt.subplot(3,3,5)
sns.countplot(x=df['Securities Account'])
plt.title('Securities Account Distribution')
plt.subplot(3,3,6)
sns.countplot(x=df['CD Account'])
plt.title('CD Account Distribution')
[9] : Text(0.5, 1.0, 'CD Account Distribution')

[10]:
# Analysis of the distibution of the dependent variable on other continuous␣
<→ features
sns.catplot(x="Personal Loan", y="Age", kind="violin", data=df);

sns.catplot(x="Personal Loan", y="Mortgage", kind="violin", data=df);
sns.catplot(x="Personal Loan", y="Experience", kind="violin", data=df);
sns.catplot(x="Personal Loan", y="CCAvg", kind="violin", data=df);
sns.catplot(x="Personal Loan", y="Income", kind="violin", data=df);





Age and Experience seems to have no direct impact on the target variable. But there is a
difference for others
[11]:
# Lets display the bivariate distribution for all the features
# The pairplot() function in seaborn library is a good tool for that
# lets drop the field that we know are not significatnt

df_temp = df.drop (columns= [ 'ID','ZIP Code'])
# Let us mark the categorical variables. This will give us a better plot
df_temp['Family'] = df['Family'].astype("category")
df_temp['Education'] = df['Education'].astype("category")
df_temp['Personal Loan'] = df['Personal Loan'].astype("category")
df_temp['Securities Account'] = df['Securities Account'].astype("category")
df_temp['CD Account'] = df['CD Account'].astype("category")
df_temp['Online'] = df['Online'].astype("category")
df_temp['CreditCard'] = df['CreditCard'].astype("category")

sns.pairplot ( df_temp )
[11] : <seaborn.axisgrid.PairGrid at 0x21e12edf910>
We can see that the Age and Experience column seems to be highly correlated.

[12]:
# lets use the scatter plot with the target variables to understand␣
<→multifeature relationships
plt.subplot(3,3,1)
sns.scatterplot ( x='Income', y='Experience', hue='Personal Loan', data=df)
plt.subplot(3,3,2)
[12] : <AxesSubplot:xlabel='Age', ylabel='Experience'>
We can see that people with higher Income irrespecitve of their experience seems to be more
keen on the loan. Similarly , ppl with larger mortage also seems to be more potential
candidates for loans

4. DATA PREPROCESSING
We first need to perform the necessary preprocessing before we begin to apply the
algorithms. Remember that these preprocessing steps need to be performed on the data for
which the prediction needs to be done. (This is one step that people tend to miss or forget. ).
We can use the concept of ‘pipeline’ that can make this process simpler. However, for the
sake of simplicity, we have not included this in this casestudy.
[13]:
# The first step is to replace the category variables with one hot encoding
dummy_var1=pd.get_dummies(df['Education'],drop_first=True)
df_new=pd.concat([df,dummy_var1],axis=1)
df_new=df_new.rename(columns={2: "Edu2" , 3: "Edu3"})
dummy_var2=pd.get_dummies(df['Family'],drop_first=True)
df_new=pd.concat([df_new,dummy_var2],axis=1)
df_new=df_new.rename(columns={2: "Fam2" , 3: "Fam3", 4: "Fam4", 5: "Fam5" })
df_new.head()
df = df_new.drop (['ID', 'ZIP Code', 'Family', 'Education'], axis = 1)
[14]:
[14]: Age Experience Income CCAvg Mortgage Personal Loan \
0 25 1 49 1.6 0 0
1 45 19 34 1.5 0 0
2 39 15 11 1.0 0 0
3 35 9 100 2.7 0 0
4 35 8 45 1.0 0 0
… … … … … … …
4995 29 3 40 1.9 0 0
4996 30 4 15 0.4 85 0
4997 63 39 24 0.3 0 0
4998 65 40 49 0.5 0 0
4999 28 4 83 0.8 0 0

Securities Account CD Account Online CreditCard Edu2 Edu3 Fam2 \

0 1 0 0 0 0 0 0
1 1 0 0 0 0 0 0
2 1 0 0 0 0 0 0
3 1 0 0 0 0 0 0
4 1 0 0 1 1 0 0
… … … … … … … …
4995 0 0 1 1 0 0 1 0
4996 0 0 1 1 0 0 0 0
4997 0 0 0 0 0 0 1 1
4998 0 0 1 1 0 1 0 0
4999 0 0 1 1 1 0 0 0
Fam3 Fam4
0 0 1
1 1 0
2 0 0
3 0 0
4 0 1
… … …
4995 0 0
4996 0 1
4997 0 0
4998 1 0
4999 1 0
[5000 rows x 15 columns]
[15]:
# As a second step, perform standardisation. This will ensure that the mean and␣
the range remains constant

<→
# for different data and the scale of a feature is not influencing the results
from scipy.stats import zscore
vars = ['Mortgage','Age', 'CCAvg', 'Experience', 'Income']

X = df[vars]
df[vars] = X.apply(zscore)
df.head()

[15]: Age Experience Income CCAvg Mortgage Personal Loan \

0 -1.774417 -1.666078 -0.538229 -0.193385 -0.555524 0
1 -0.029524 -0.096330 -0.864109 -0.250611 -0.555524 0
2 -0.552992 -0.445163 -1.363793 -0.536736 -0.555524 0
3 -0.901970 -0.968413 0.569765 0.436091 -0.555524 0
4 -0.901970 -1.055621 -0.625130 -0.536736 -0.555524 0
Securities Account CD Account Online CreditCard Edu2 Edu3 Fam2 Fam3

0 1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 1
2 0 0 0 0 0 0 0 0
3 0 0 0 0 1 0 0 0
4 0 0 0 1 1 0 0 0
Fam4
0 1
1 0
2 0
3 0
4 1

5. SPLIT TRAIN AND TEST DATA

We need to prepare the data for modelling. As a first step, we split the ‘X’ (dependent
features) and ‘y’ (target outcome). We also split the data into train and test sets so that
validation can be done on data that the model has never seen. This is the reason why we need
to drop duplicates else there is a possibility of a leak into the training.
[16]:
# Removing the dependent column
X_data = df.drop(columns=["Personal Loan"])

Y_data = df["Personal Loan"]
x_train, x_test, y_train, y_test = train_test_split(X_data, Y_data, test_size=0.

<→4, random_state=1)
df_ytrain = y_train.value_counts().to_frame()
df_ytest = y_test.value_counts().to_frame()
print ( "Train Data: Ratio of People who have availed loan" , df_ytrain.iloc[1]/
<→ (df_ytrain.iloc[0]+ df_ytrain.iloc[1]) )
print ( "Test Data: Ratio of People who have availed loan" , df_ytest.iloc[1]/␣
<→(df_ytest.iloc[0]+ df_ytest.iloc[1]) )
Train Data: Ratio of People who have availed loan Personal Loan 0.095667
dtype: float64
Test Data: Ratio of People who have availed loan Personal Loan 0.0965
dtype: float64

[17]:
from imblearn.over_sampling import SMOTE
# We see that the data is highly imbalanced. Therefore, it is important to␣
<→ create synthetic data before training the model.
# We will use SMOTE (Synthetic Minority Oversampling Technique) in this case␣
<→study
oversample = SMOTE()
x_train, y_train = oversample.fit_resample(x_train, y_train)
df_ytrain = y_train.value_counts().to_frame()
print ( "Train Data: Ratio of People who have availed loan" , df_ytrain.iloc[1]/
<→ (df_ytrain.iloc[0]+ df_ytrain.iloc[1]) )
Train Data: Ratio of People who have availed loan Personal Loan 0.5
dtype: float64

6. MODEL EVALUATION
We will apply various algorithms to this dataset and compare them We will also tune each of
the algorithms for the best possible hyper parameters
[18]:
6.1 Logistic Regression
Logistic regression is a commonly used algorithm in classification probelms. It involes

identifying the best parameters to fit a signmoid curve that can classifiy the data. This uses
the log likelihood as the cost function. The code below uses the LogisticRegression class from
the sklearn library. We use the GridSearchCV to identify the best hyperparameters and
StratifiedKFold for performing the cross validation

[19] :
from sklearn import metrics

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold
# Create a parameter grid that contains all the values to

be # considered to derive the best hyper-parameter value.
param_grid = {
'penalty' : ['l1', 'l2'],

'C' : np.logspace(-2, 4, 10), 'solver'
: ['liblinear']
}
# We will initialize the StratifiedKFold class. This
# will be used for performing cross validation during Grid Search
cv = StratifiedKFold(n_splits=10,
shuffle=True,
random_state=2)
# Perform a gridsearch on an instance of LogisticRegression for all␣

<→combinations of parameters
# that has been provided in the parameter grid. In this particular use case ,␣
<→we use the recall as the
# evaluation metrics.
model = GridSearchCV(LogisticRegression() , param_grid, cv=cv , scoring =␣

<→'recall')
# Fir the model on the training data

model.fit(x_train, y_train)
# Display the best hyper parameters for the model

print ( "Best parameters: " + str( model.best_params_ ) )
# Evaluate the model with unseen data or the test data.

y_predict = model.predict(x_test)
# Display the score on the test data

model_score = model.score(x_test, y_test)
print ("ROC AUC: " + str(model_score) )
# Display the confusion matrix to identify how many classifications were

# correctly identified and those that were incorrect

cm=metrics.confusion_matrix(y_test, y_predict, labels=[1, 0])
print("Confusion Matrix:" )
print(cm)
# Display the list of metrics like Accuracy, Precision, Recall and F1-Score
print("Final Metrics:")
accuracy = str( metrics.accuracy_score(y_test, y_predict))
recall = str(metrics.recall_score(y_test, y_predict))
precision = str(metrics.precision_score(y_test, y_predict))
f1 = str(metrics.f1_score(y_test, y_predict) )
print( " Accuracy: " + accuracy )
print( " Recall: " + recall)
print( " Precision: " + precision)
print( " F1: " + f1)
results.update({ 'Logistic Regression':{ 'Accuracy': accuracy, 'Recall':

recall␣
<→}})
Best parameters: {'C': 0.01, 'penalty': 'l2', 'solver': 'liblinear'} ROC AUC:
0.9015544041450777
Confusion Matrix:
[[ 174 19]
[ 244 1563]]
Final Metrics:
Accuracy: 0.8685
Recall: 0.9015544041450777
Precision: 0.41626794258373206
F1: 0.5695581014729951

6.2 Gaussian Naive Bayes
Gaussian Naive Bayes Algorithm is a probabilistic algorithm based on Bayes Theorem.
[20]:
from sklearn.naive_bayes import GaussianNB
# Initialize the parameters and CV class

param_grid = {'var_smoothing': np.logspace(0,-9, num=100)} cv
= StratifiedKFold(n_splits=10,
shuffle=True,
random_state=2)
# Perform the GridSearch to derive the best parameters

model = GridSearchCV(GaussianNB() , param_grid, cv=cv , scoring = 'recall')
# Evaluate the model on unseen data
print ("Score: " + str(model_score) )

print(cm)
print( " F1: " + f1)
results.update({ 'Gaussian Naive Bayes':{ 'Accuracy': accuracy, 'Recall':␣

<→recall }})

Best parameters: {'var_smoothing': 0.0657933224657568} Score:

0.7927461139896373
Confusion Matrix:
[[ 153 40]
[ 232 1575]]
Final Metrics:
Accuracy: 0.864
Recall: 0.7927461139896373
Precision: 0.3974025974025974
F1: 0.5294117647058825
6.3 KNN Classifier
The K-Nearest Neighbour algorithm computes the ‘k’ closest rows (from the known data) for
an incoming data. It then uses the ‘target’ or ‘predicted’ column of these ‘K’ neighbours to
compute the ‘target’ column for the incoming data.
[21]:
from import KNeighborsClassifier

# KNN
# number

param_grid = {'n_neighbors': range(2,30)}

shuffle=True,
random_state=2)
# As a part of gridsearch, the model will be evaluated for

different # values of neighbours provided.
model = GridSearchCV(KNeighborsClassifier() , param_grid, cv=cv , scoring =␣

<→'recall')
# The ideal value of 'k' or the number of neighbours is

available # after the 'fit' operation
# Evalaute the model on the unseen data

# Display the confusion matrix to identify how many classifications

were # correctly identified and those that were incorrect
print(cm)
print( " F1: " + f1)
results.update({ 'KNN':{ 'Accuracy': accuracy, 'Recall': recall }})

Best parameters: {'n_neighbors': 29}

Score: 0.9015544041450777
Confusion Matrix:
[[ 174 19]
[ 225 1582]]
Final Metrics:
Accuracy: 0.878
Recall: 0.9015544041450777
Precision: 0.43609022556390975
F1: 0.5878378378378378
From the above output, we can see that the model provided the best output when the number
of neighbours considered was 11.

6.4 SVM
SVM algorithm finds a most optimal hyperplane that will effectively classify the data points.
‘C’ and ‘gamma’ are important parameters of SVM that helps to prevent overfitting
[22]:
from sklearn import svm

# The important tuning parameters of SVM are 'C', 'gamma'

# and the 'kernel'. Create a parameter grid with different
# values for each of this that needs to be evaluated.
param_grid = {'C': [0.1, 1, 10, 100],

'gamma': [1, 0.1, 0.01, 0.001],
'kernel': ['rbf', 'linear']}
shuffle=True,
random_state=2)
model = GridSearchCV(svm.SVC() , param_grid, cv=cv , scoring = 'recall')
# Perform the fit operation on the intialize model to train

# Display the parameters that provided the best parameters during the training
# Evalaute the model using the unseen data

print(cm)


print( " Accuracy: " + accuracy
) print( " Recall: " + recall)
print( " F1: " + f1)
results.update({ 'SVM':{ 'Accuracy': accuracy, 'Recall': recall }})
Best parameters: {'C': 100, 'gamma': 1, 'kernel': 'rbf'}

Score: 0.6269430051813472
Confusion Matrix:
[[ 121 72]
[ 54 1753]]
Final Metrics:
Accuracy: 0.937
Recall: 0.6269430051813472
Precision: 0.6914285714285714
F1: 0.657608695652174
6.5 Stochastic Gradient Descent
Stochastic gradient descent algorithm computes the ‘gradient’ on the loss and reduces the
coefficents in the negative direction of the gradient. It continuously performs this operation
till a ‘minima’ on the loss is obtained. In staochastic gradient, we update the coefficients after
evaluating every row of data.

[23]:
from sklearn.linear_model import SGDClassifier
# The parameters of stochastic gradient descent are the los to be considered,

# the learning rate , regularization parameters
param_grid = {
"loss" : ["hinge", "log_loss"],
"alpha" : [0.0001, 0.001, 0.01, 0.1]
}
shuffle=True,
random_state=2)
model = GridSearchCV(SGDClassifier() , param_grid, cv=cv , scoring = 'recall')
# Fit the model. This will perform the gridsearch on all the combinations of
# parameters provided. This uses cross-validation during each evaluation.
# Display the best parametes


print(cm)
print( " F1: " + f1)
results.update({ 'SGD':{ 'Accuracy': accuracy, 'Recall': recall }})

Best parameters: {'alpha': 0.1, 'loss': 'hinge'}

ROC AUC: 0.8911917098445595
Confusion Matrix:
[[ 172 21]
[ 250 1557]]
Final Metrics:
Accuracy: 0.8645
Recall: 0.8911917098445595
Precision: 0.4075829383886256
F1: 0.5593495934959348
6.6 Random Forest Classifier
Random Forest Classifier is a ensemble model that creates a forest using ‘trees’ that were
created from different subsets of data’(rows or columns). The final value for the ‘target’
column is derived from the output generated by each of these trees
[24]:
from sklearn.ensemble import RandomForestClassifier
# The evaluation parameter and the different values it can take

# for this algorithm is initialized as a grid as shown below
param_grid = {'max_features':[1,3,10],
'min_samples_split':[2,3,10],
'min_samples_leaf':[1,3,10],

'bootstrap':[False, True],
'n_estimators':[50],
'criterion':['gini']}
shuffle=True,
random_state=2)
model = GridSearchCV(RandomForestClassifier() , param_grid, cv=cv , scoring =␣
<→'recall')
# 'Fit' the model to derive the best value
# Display the parameters that yielded the best results

# Display the confusion matrix to identify how many classifications

were # correctly identified and those that were incorrect
print(cm)
print( " F1: " + f1)
results.update({ 'Random Forest Classifier':{ 'Accuracy': accuracy, 'Recall':␣

<→recall }})

Best parameters: {'bootstrap': False, 'criterion': 'gini', 'max_features': 3, 'min_samples_leaf':

1, 'min_samples_split': 2, 'n_estimators': 50}
ROC AUC: 0.9119170984455959
Confusion Matrix:
[[ 176 17]
[ 34 1773]]
Final Metrics:
Accuracy: 0.9745
Recall: 0.9119170984455959
Precision: 0.8380952380952381
F1: 0.8734491315136477
This problem requires to maxmize the ‘True Postives’ and ‘True Negatives’ ( Accuracy) and
minimize the ‘False Negatives’ (Recall Ratio). This will ensure that maximum customers
who are predicted to take the loan avails the same and the customers who avail the loan are
not missed in the prediction.
[25]:
[25]: Accuracy Recall

Logistic Regression 0.8685 0.9015544041450777
Gaussian Naive Bayes 0.864 0.7927461139896373
KNN 0.878 0.9015544041450777
SVM 0.937 0.6269430051813472
SGD 0.8645 0.8911917098445595
Random Forest Classifier 0.9745 0.9119170984455959
From the above table, we can see that the best results are obtained from a Random Forest
Classifier based Model

Unit 14
Case Study Regression Analysis
Table of Contents
SL Topic Fig No / SAQ / Page No
No Table / Activity
Graph
1 Case Study for Predicting the price of a car

based on its age, selling Price, Present Price,
- -
Kms_Driven, Fuel_Type, Seller_Type, and
Transmission
1.1 Problem Statement - -
1.2 Data Collection - -
1.3 Data Preprocessing - -

3 - 32
1.4 Explore information about the created
- -
dataframe using the following methods
1.5 Exploratory Data Analysis (EDA) - -
1.6 Model Building - -
1.7 Perform scaling using StandardScaler()

- -
function
1.8 Model Evaluation and Deployment - -

1. CASE STUDY FOR PREDICTING THE PRICE OF A CAR BASED ON ITS

\ AGE, SELLING PRICE, PRESENT PRICE, KMS_DRIVEN, FUEL_TYPE,
SELLER_TYPE, AND TRANSMISSION.
Here an end-to-to case study analysis for regression is done and following steps are
required for performing the analysis process:
• Problem Statement: Clearly define the problem that you want to solve using regression
analysis. For example, one wants to predict the prices of a used car based on the other
variables.
• Data Collection: Collect the data required to solve the problem. For example, a dataset
containing information about 300 used cars, with the following variables: ‘Car_Name’,
‘Year’, ‘Selling_Price’, ‘Present_Price’,‘Kms_Driven’, ‘Fuel Type’,
‘Seller_Type’,‘Transmission’,‘Owner’.
• Data Preprocessing: Prepare the data for analysis by cleaning, transforming, and
encoding the data if necessary. For example, check for missing values, remove or
impute missing values, split the data into dependent and independent variables, and
convert categorical variables into numerical variables.
• Exploratory Data Analysis (EDA): Analyze the data to gain insights and understand the
relationships between the dependent and independent variables. For example, plot a
scatter plot to check the relationship between the dependent and independent
variables, calculate descriptive statistics, and identify outliers.
• Model Building: Build the regression model using an appropriate method, such as
simple linear regression, multiple linear regression, or polynomial regression. For
example, create an instance of the LinearRegression model and fit the model to the
training data.
• Model Evaluation: Evaluate the performance of the model by comparing the predicted
values with the actual values. For example, calculate the mean squared error (MSE),
coefficient of determination (R^2), and other performance metrics.
• Model Deployment: Use the fitted model to make predictions for new data.
• Model Refinement: Refine the model if necessary by changing the variables, algorithms,
or parameters.

• This process is repeated until an optimal model is obtained that meets the desired
performance criteria. The final model can then be used for prediction and decision
making.
1.1. Problem Statement
Predicting the price of a car
1.2. Data Collection
The data is collected from the Kaggle dataset: kaggle kernels output rezasemyari/car-price-
using- linear-regression -p /path/to/dest
1.3. Data Preprocessing
[1]: #import the necessary libraries

import pandas as pd
import numpy as np
import matplotlib as mpl

import sklearn
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures,StandardScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor
[2]:

[2]: Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type \

0 ritz 2014 3.35 5.59 27000 Petrol
1 sx4 2013 4.75 9.54 43000 Diesel
2 ciaz 2017 7.25 9.85 6900 Petrol
3 wagon r 2011 2.85 4.15 5200 Petrol
4 swift 2014 4.60 6.87 42450 Diesel
.. … … … … … …
296 city 2016 9.50 11.60 33988 Diesel
297 brio 2015 4.00 5.90 60000 Petrol
298 city 2009 3.35 11.00 87934 Petrol
299 city 2017 11.50 12.50 9000 Diesel
300 brio 2016 5.30 5.90 5464 Petrol
Seller_Type Transmission Owner

0 Dealer Manual 0
1 Dealer Manual 0
2 Dealer Manual 0
3 Dealer Manual 0
4 Dealer Manual 0
.. … … …
296 Dealer Manual 0
297 Dealer Manual 0
298 Dealer Manual 0
299 Dealer Manual 0
300 Dealer Manual 0
1.4 Explore Information About The Created Dataframe Using The

Following Methods
a. df.shape: The df.shape method returns the dimensions of a pandas DataFrame as a tuple,
where the first element is the number of rows and the second element is the number of
columns. Here ‘df’ is a dataframe. Here 301 referes to the number of rows and 9 refers tot he
nunmber of columns.
[3]:
[3]: (301, 9)
b. df.info(): The df.info() method is used to get a summary of the DataFrame, including
the number of non-null values, the data type of each column, and memory usage.

[4]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301 entries, 0 to 300
# Column Non-Null Count Dtype
--- ---------- ------------------- -------
0 Car_Name 301 non-null object
1 Year 301 non-null int64
2 Selling_Price 301 non-null float64
3 Present_Price 301 non-null float64
4 Kms_Driven 301 non-null int64
5 Fuel_Type 301 non-null object
6 Seller_Type 301 non-null object
7 Transmission 301 non-null object
8 Owner 301 non-null int64
dtypes: float64(2), int64(3), object(4) memory usage: 21.3+ KB
c. df.describe(include=‘all’): The df.describe(include=‘all’) method is used to generate

descriptive statistics of the DataFrame, including count, mean, standard deviation, minimum,
25th percentile, 50th percentile (median), 75th percentile, and maximum. By default, only
the numerical columns are included in the summary, but you can use the include parameter
to specify the data types of the columns to be included in the summary.
[5]:
[5]: Car_Name Year Selling_Price Present_Price Kms_Driven \
count 301 301.000000 301.000000 301.000000 301.000000
unique 98 NaN NaN NaN NaN
top city NaN NaN NaN NaN
freq 26 NaN NaN NaN NaN
mean NaN 2013.627907 4.661296 7.628472 36947.205980
std NaN 2.891554 5.082812 8.644115 38886.883882
min NaN 2003.000000 0.100000 0.320000 500.000000
25% NaN 2012.000000 0.900000 1.200000 15000.000000
50% NaN 2014.000000 3.600000 6.400000 32000.000000
75% NaN 2016.000000 6.000000 9.900000 48767.000000
max NaN 2018.000000 35.000000 92.600000 500000.000000
Fuel_Type Seller_Type Transmission Owner

count 301 301 301 301.000000
unique 3 2 2 NaN
top Petrol Dealer Manual NaN
freq 239 195 261 NaN
mean NaN NaN NaN 0.043189

std NaN NaN NaN 0.247915

min NaN NaN NaN 0.000000
25% NaN NaN NaN 0.000000
50% NaN NaN NaN 0.000000
75% NaN NaN NaN 0.000000
max NaN NaN NaN 3.000000
The information about the statistics of the dataset reveals that it has missing values that
needs to be rectified and also a check on the presence of outliers need to be done along with
the dropping of unnecessary columns in relation with the target variable.
[6]:
[7]:
[7]: Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_ Type

0 2014 3.35 5.59 27000 Petrol Dealer
1 2013 4.75 9.54 43000 Diesel Dealer
2 2017 7.25 9.85 6900 Petrol Dealer
3 2011 2.85 4.15 5200 Petrol Dealer
4 2014 4.60 6.87 42450 Diesel Dealer
.. … … … … … …
296 2016 9.50 11.60 33988 Diesel Dealer
297 2015 4.00 5.90 60000 Petrol Dealer
298 2009 3.35 11.00 87934 Petrol Dealer
299 2017 11.50 12.50 9000 Diesel Dealer
300 2016 5.30 5.90 5464 Petrol Dealer
Transmission Owner
0 Manual 0
1 Manual 0
2 Manual 0
3 Manual 0
4 Manual 0
.. … …
296 Manual 0
297 Manual 0
298 Manual 0
299 Manual 0
300 Manual 0

For ease of processing of data the age column is calculated from the year column
[8]:
-x)
[8]: Age Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type \

0 5 3.35 5.59 27000 Petrol Dealer
1 6 4.75 9.54 43000 Diesel Dealer
2 2 7.25 9.85 6900 Petrol Dealer
3 8 2.85 4.15 5200 Petrol Dealer
4 5 4.60 6.87 42450 Diesel Dealer
.. … … … … … …
296 3 9.50 11.60 33988 Diesel Dealer
297 4 4.00 5.90 60000 Petrol Dealer
298 10 3.35 11.00 87934 Petrol Dealer
299 2 11.50 12.50 9000 Diesel Dealer
300 3 5.30 5.90 5464 Petrol Dealer
Transmission Owner
0 Manual 0
1 Manual 0
2 Manual 0
3 Manual 0
4 Manual 0
.. … …
296 Manual 0
297 Manual 0
298 Manual 0
299 Manual 0
300 Manual 0
This code performs several operations on a pandas DataFrame df. Let’s break down each line
of code:
• maximum = df[‘Year’].max(): This line of code finds the maximum value in the ‘Year’
column of the DataFrame df. The max() method returns the maximum value of a pandas
Series.

• Age = df[‘Year’].apply(lambda x: (maximum+1) - x): This line of code creates a new

column ‘Age’ by subtracting each value in the ‘Year’ column from the maximum value
of the ‘Year’ column, plus 1. The apply() method is used to apply a lambda function to
each element of the ‘Year’ column. The lambda function takes the value of each element
and subtracts it from the maximum value of the ‘Year’ column, plus 1.
• df.drop(‘Year’, axis=1, inplace=True): This line of code drops the ‘Year’ column from the
DataFrame df. The drop() method is used to drop one or more columns from the
DataFrame. The axis parameter is set to 1 to indicate that we want to drop a column,
not a row. The inplace parameter is set to True to indicate that we want the changes to
be made in-place, without creating a new DataFrame.
• df.insert(0, ‘Age’, Age): This line of code inserts the ‘Age’ column at the beginning (index
0) of the DataFrame df. The insert() method is used to insert a column into a DataFrame
at a specified location. The first argument is the index where the column should be
inserted, the second argument is the name of the column, and the third argument is the
values of the column.
• df: Finally, the last line of code returns the DataFrame df, which now contains the ‘Age’
column at the beginning and no longer contains the ‘Year’ column.
• Overall, this code creates a new ‘Age’ column that represents the difference between
the maximum value of the ‘Year’ column and each value in the ‘Year’ column, then drops
the ‘Year’ column from the DataFrame and inserts the ‘Age’ column at the beginning.
1.5 Exploratory Data Analysis (EDA)
Outlier Detection
[9]: fig=plt.figure(figsize=(20,15))
fs=mpl.gridspec.GridSpec(2,2)
ax0=fig.add_subplot(fs[0:1,0:1])
ax0.scatter(df['Age'],df['Selling_Price'])
ax1=fig.add_subplot(fs[0:1,1:])
ax1.scatter(df['Present_Price'],df['Selling_Price'])

[9]: <matplotlib.collections.PathCollection at 0x24869d22f40>
[10]:
#####As seen above there is a presence of outliers or noises that need to be removed and
fixed, to perform this the indexes of outliers are to be known
[10]:
[10]:
Age Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type \
86 9 35.0 92.6 78000 Diesel Dealer
Transmission Owner
86 Manual 0

[11]:
[11]:
Age Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type \
196 11 0.17 0.52 500000 Petrol Individual
Transmission Owner
196 Automatic 0
[12]:
[13]:
for index,item in enumerate(df_show):


The above code creates a set of scatter plots to visualize the relationship between several
variables in a pandas DataFrame df and the target variable ‘Selling_Price’, which goes as:
• df_show = [‘Age’, ‘Present_Price’, ‘Kms_Driven’, ‘Owner’]: This line of code defines a list
of variables that will be plotted against the target variable ‘Selling_Price’.
• fig, ax = plt.subplots(4, 1, figsize=(10, 15)): This line of code creates a set of subplots
using the subplots method from the matplotlib library. The subplots method creates a
figure and one or more subplots that are contained within the figure. The first argument
is the number of rows and the second argument is the number of columns of subplots.
The figsize argument sets the size of the figure in inches.
• for index, item in enumerate(df_show): This line of code starts a for loop that will
iterate through the list df_show. The enumerate function returns a tuple with the index
and value of each item in the list.
• ax[index].scatter(df[item], df[‘Selling_Price’], color=‘b’): This line of code creates a
scatter plot of each variable in the list df_show against the target variable ‘Selling_Price’.
The scatter method is used to create a scatter plot and the arguments are the x-axis
values (df[item]) and the y-axis values (df[‘Selling_Price’]). The color argument sets the
color of the points in the scatter plot. The ax[index] argument specifies which subplot
the scatter plot should be plotted in. The subplot index is determined by the index
variable in the for loop.
####Check if any NaNs are present
[14]:

[14]: Age 0
Selling_Price 0
Present_Price 0
Kms_Driven 0
Fuel_Type 0
Seller_Type 0
Transmission 0
Owner 0
dtype: int64
[15]:
[15]: Age Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type \

0 5 3.35 5.59 27000 Petrol Dealer
1 6 4.75 9.54 43000 Diesel Dealer
2 2 7.25 9.85 6900 Petrol Dealer
3 8 2.85 4.15 5200 Petrol Dealer
4 5 4.60 6.87 42450 Diesel Dealer
.. … … … …
296 3 9.50 11.60 33988 Diesel Dealer
297 4 4.00 5.90 60000 Petrol Dealer
298 10 3.35 11.00 87934 Petrol Dealer
299 2 11.50 12.50 9000 Diesel Dealer
300 3 5.30 5.90 5464 Petrol Dealer
Transmission Owner
0 Manual 0
1 Manual 0
2 Manual 0
3 Manual 0
4 Manual 0
.. … …
296 Manual 0
297 Manual 0
298 Manual 0
299 Manual 0
300 Manual 0
####This shows that it does not have any NaN
As the columns ‘Fuel_Type’, ‘Seller_Type’ and ‘Transmission’ involves categorical

variables converting them to numerical representation can help a machine learning
model to better understand the relationship between features and the target variable.

[16]:
df1
[16]: Age Selling_Price Present_Price Kms_Driven Owner Fuel_Type_Diesel \

0 5 3.35 5.59 27000 0 0
1 6 4.75 9.54 43000 0 1
2 2 7.25 9.85 6900 0 0
3 8 2.85 4.15 5200 0 0
4 5 4.60 6.87 42450 0 1
.. … … … … … …
296 3 9.50 11.60 33988 0 1
297 4 4.00 5.90 60000 0 0
298 10 3.35 11.00 87934 0 0
299 2 11.50 12.50 9000 0 1
300 3 5.30 5.90 5464 0 0
Fuel_Type_Petrol Seller_Type_Individual Transmission_Manual

0 1 0 1 1
1 0 0 1 1
2 1 0 1 1
3 1 0 1 1
4 0 0 1 1
… … … …
296 0 0 1
297 1 0 1
298 1 0 1
299 0 0 1
300 1 0 1
After the above step the next is to identify the correlation between the columns, and the
purpose of finding the correlation between the features is to identify the strength and
direction of the relationships between the variables in the dataset. This information can be
used to inform the feature selection process for building a machine learning model.
[17]:
[17]: Age Selling_Price Present_Price Kms_Driven \
Age 1.000000 -0.275162 0.014112 0.617777
Selling_Price -0.275162 1.000000 0.883903 0.062810
Present_Price 0.014112 0.883903 1.000000 0.337736
Kms_Driven 0.617777 0.062810 0.337736 1.000000

Owner 0.185671 -0.091101 0.016221 0.134016

Fuel_Type_Diesel -0.070641 0.549127 0.498088 0.257992
Fuel_Type_Petrol 0.065955 -0.537184 -0.489685 -0.259208
Seller_Type_Individual 0.034617 -0.570043 -0.591886 -0.212191
Transmission_Manual 0.014893 -0.412520 -0.453428 -0.087296
Owner Fuel_Type_Diesel Fuel_Type_Petrol \

Age 0.185671 -0.070641 0.065955
Selling_Price -0.091101 0.549127 -0.537184
Present_Price 0.016221 0.498088 -0.489685
Kms_Driven 0.134016 0.257992 -0.259208
Owner 1.000000 -0.052974 0.055223
Fuel_Type_Diesel -0.052974 1.000000 -0.979364
Fuel_Type_Petrol 0.055223 -0.979364 1.000000
Seller_Type_Individual 0.125139 -0.347161 0.355065
Transmission_Manual -0.052166 -0.107406 0.099645
Seller_Type_Individual Transmission_Manual
Age 0.034617 0.014893
Selling_Price -0.570043 -0.412520
Present_Price -0.591886 -0.453428
Kms_Driven -0.212191 -0.087296
Owner 0.125139 -0.052166
Fuel_Type_Diesel -0.347161 -0.107406
Fuel_Type_Petrol 0.355065 0.099645
Seller_Type_Individual 1.000000 0.076886
Transmission_Manual 0.076886 1.000000
Now, let us find the relationship between the features with target variable and between
other features
[18]:

[18]: <AxesSubplot:>
As we had dropped the outliers there are chances of discontinuous indexes and to bring it to
the same sequence of integers it is important to reset the indexes
[19]: df1

[19]:
Age Selling_Price Present_Price Kms_Driven Owner Fuel_Type_Diesel
0 5 3.35 5.59 27000 0 0
1 6 4.75 9.54 43000 0 1
2 2 7.25 9.85 6900 0 0
3 8 2.85 4.15 5200 0 0
4 5 4.60 6.87 42450 0 1
.. … … … … … …
294 3 9.50 11.60 33988 0 1
295 4 4.00 5.90 60000 0 0
296 10 3.35 11.00 87934 0 0
297 2 11.50 12.50 9000 0 1
298 3 5.30 5.90 5464 0 0
Fuel_Type_Petrol Seller_Type_Individual Transmission_Manual

0 1 0 1
1 0 0 1
2 1 0 1
3 1 0 1
4 0 0 1
.. … … …
294 0 0 1
295 1 0 1
296 1 0 1
297 0 0 1
298 1 0 1
The code df1.reset_index(drop=True, inplace=True) resets the index of the

DataFrame df1. In pandas, a DataFrame’s index is a unique identifier for each row in the
DataFrame. By default, the index is an integer sequence starting from 0, but it can be changed
to any unique identifier. When index values are changed, the old index values can still remain
as a column in the DataFrame.
The reset_index method is used to reset the index of a DataFrame to the default integer
sequence. The drop parameter is set to True, which means that the old index values will not
be added as a new column to the DataFrame. The inplace parameter is set to True, which
means that the changes will be made directly to the DataFrame without creating a copy.

1.6 Model Building
The model building involves first to identify the relationship and correlation between
features we also split the ‘X’ and ‘Y’ variables for further processing.
[20]:
df1[
[20]: Age Present_Price Kms_Driven Owner Fuel_Type_Diesel \

0 5 5.59 27000 0 0
1 6 9.54 43000 0 1
2 2 9.85 6900 0 0
3 8 4.15 5200 0 0
4 5 6.87 42450 0 1
.. … … … … …
294 3 11.60 33988 0 1
295 4 5.90 60000 0 0
296 10 11.00 87934 0 0
297 2 12.50 9000 0 1
298 3 5.90 5464 0 0
Fuel_Type_Petrol Seller_Type Individual Transmission_Manual

0 1 0 1
1 0 0 1
2 1 0 1
3 1 0 1
4 0 0 1
.. … … …
294 0 0 1
295 1 0 1
296 1 0 1
297 0 0 1
298 1 0 1
As above the X has the complete dataframe, in order to fing the relationshiop and correlation
between features which involves the changes to be made in X it is also good to save it to a
different variable X1

[21]:
Having seen through the plots it clearly depicts that in addition to linear relationship there
exists non- linear relationships between the features. As a result, the Polynomial Features
are used to create higher- degree features from the existing features in a dataset. The
purpose of creating polynomial features is to capture the non-linear relationships between
the features and the target variable in a dataset. By creating polynomial features, the model
can capture more complex relationships between the features and the target. Additionally,
polynomial features can help to improve the model’s performance by increasing the model’s
ability to fit the data. By including higher-degree features, the model can fit more complex
relationships in the data, leading to improved accuracy.
Once the Polynomial Features are ready it can be saved in a dataframe and checked for the
Mul- ticollinearity, which happens by checking the correleation between the features with
the Variance Inflation Factor (VIF) which is used to measure the multicollinearity between
features in a multiple regression model. In Python, this can be calculated with the help of
‘statsmodels’.
The VIF values are calculated aganist each feature using the ‘vif’ function avalibale in the
‘statsmodels’.
[22]:
degrees=[1,2]
for degree in degrees:
vif_data=pd.DataFrame()
poly_features=PolynomialFeatures(degree=degree,include_bias=False)
X_poly=poly_features.fit_transform(X)
X_poly=pd.DataFrame(X_poly,columns=poly_features.get_feature_names_out(X.
'→columns))
vif_data['feature']=X_poly.columns
vif_data['VIF']=[variance_inflation_factor(X_poly.values,i) for i
in␣
'→range(len(X_poly.columns))]
print(vif_data)
if(degree==1):
if ((vif_data['VIF']>10).any()):
viff=vif_data[vif_data['VIF']>10].feature
X_poly.drop(viff,axis=1,inplace=True)
elif(degree==2):

if((vif_data['VIF']>1.5e+02).any()):
print(1.5e+02)
viff=vif_data[vif_data['VIF']>1.5e+02].feature
X_poly.drop(viff,axis=1,inplace=True)
reg20=LinearRegression()
␣
'→X_train,X_test,Y_train,Y_test=train_test_split(X_poly,Y,random_state=0,test_size=0.
'→3)
reg20.fit(X_train,Y_train)
r2=reg20.score(X_train,Y_train)
r3=reg20.score(X_test,Y_test)
print('Polynomial degree{0}: r2_score_train={1} and r_score_teset={2}'.
'→format(degree,r2,r3))
'→ print('#################################################################################')
X=X_poly
print(X)
print(1.5e+02)
feature VIF
0 Age 8.150440
1 Present_Price 5.143695
2 Kms_Driven 5.337965
3 Owner 1.098706
4 Fuel_Type_Diesel 6.114416
5 Fuel_Type_Petrol 16.860725
6 Seller_Type_Individual 2.606120
7 Transmission_Manual 9.590729
Polynomial degree1: r2_score_train=0.8792315715981194 and

r_score_teset=0.887361727858974
##############################################################
################## #
Age Present_Price Kms_Driven Owner Fuel_Type_Diesel \
0 5.0 5.59 27000.0 0.0 0.0
1 6.0 9.54 43000.0 0.0 1.0
2 2.0 9.85 6900.0 0.0 0.0
3 8.0 4.15 5200.0 0.0 0.0
4 5.0 6.87 42450.0 0.0 1.0
.. … … … … …
294 3.0 11.60 33988.0 0.0 1.0
295 4.0 5.90 60000.0 0.0 0.0
296 10.0 11.00 87934.0 0.0 0.0
297 2.0 12.50 9000.0 0.0 1.0
298 3.0 5.90 5464.0 0.0 0.0

0 0.0 1.0
1 0.0 1.0
2 0.0 1.0
3 0.0 1.0
4 0.0 1.0
.. … …
294 0.0 1.0
295 0.0 1.0
296 0.0 1.0
297 0.0 1.0
298 0.0 1.0
150.0
feature VIF
0 Age 2.873644e+02
1 Present_Price 9.484446e+01
2 Kms_Driven 2.456942e+02
3 Owner 6.433714e+14
4 Fuel_Type_Diesel 1.501200e+15
5 Seller_Type_Individual 2.001600e+14
6 Transmission_Manual 1.801440e+15
7 Age^2 8.281797e+01
8 Age Present_Price 9.248660e+01
9 Age Kms_Driven 9.043112e+01
10 Age Owner 1.246664e+02
11 Age Fuel_Type_Diesel 2.762261e+01
12 Age Seller_Type_Individual 3.371661e+01
13 Age Transmission_Manual 9.829982e+01
14 Present_Price^2 5.824951e+01
15 Present_Price Kms_Driven 1.000068e+02
16 Present_Price Owner 9.363854e+02
17 Present_Price Fuel_Type_Diesel 4.153571e+01
18 Present_Price Seller_Type_Individual 1.780492e+01
19 Present_Price Transmission_Manual 3.088424e+01
20 Kms_Driven^2 2.108720e+01
21 Kms_Driven Owner 2.679716e+02
22 Kms_Driven Fuel_Type_Diesel 2.833071e+01
23 Kms_Driven Seller_Type_Individual 2.079318e+01
24 Kms_Driven Transmission_Manual 8.731800e+01
25 Owner^2 4.892558e+12
26 Owner Fuel_Type_Diesel 2.504373e+00
27 Owner Seller_Type_Individual 6.930659e+01
28 Owner Transmission_Manual 6.002799e+11

29 Fuel_Type_Diesel^2 3.880741e+12
30 Fuel_Type_Diesel Seller_Type_Individual 1.244385e+01
31 Fuel_Type_Diesel Transmission_Manual 5.048446e+01
32 Seller_Type_Individual^2 3.336000e+14
33 Seller_Type_Individual Transmission_Manual 3.241417e+01
34 Transmission_Manual^2 5.848831e+13
150.0
Polynomial degree2: r2_score_train=0.9776676672531582 and
r_score_teset=0.9807987512114504
##############################################################
################## #
Present_Price Age^2 Age Present_Price Age Kms_Driven Age Owner \

0 5.59 25.0 27.95 135000.0 0.0
1 9.54 36.0 57.24 258000.0 0.0
2 9.85 4.0 19.70 13800.0 0.0
3 4.15 64.0 33.20 41600.0 0.0
4 6.87 25.0 34.35 212250.0 0.0
.. … … … … … 0.0
294 11.60 9.0 34.80 101964.0
295 5.90 16.0 23.60 240000.0 0.0
296 11.00 100.0 110.00 879340.0 0.0
297 12.50 4.0 25.00 18000.0 0.0
298 5.90 9.0 17.70 16392.0 0.0
Age Fuel_Type_Diesel Age Seller_Type_Individual \

0 0.0 0.0
1 6.0 0.0
2 0.0 0.0
3 0.0 0.0
4 5.0 0.0
.. … …
294 3.0 0.0
295 0.0 0.0
296 0.0 0.0
297 2.0 0.0
298 0.0 0.0
Age Transmission_Manual Present_Price^2 Present_Price Kms_Driven …\

0 5.0 31.2481 150930.0 …
1 6.0 91.0116 410220.0 …
2 2.0 97.0225 67965.0 …
3 8.0 17.2225 21580.0 …

4 5.0 47.1969 291631.5 …

.. … … … … …
294 3.0 134.5600 394260.8
295 4.0 34.8100 354000.0 …
296 10.0 121.0000 967274.0 …
297 2.0 156.2500 112500.0 …
298 3.0 34.8100 32237.6 …
Present_Price Transmission Manual Kms_Driven^2 \

0 5.59 7.290000e+08
1 9.54 1.849000e+09
2 9.85 4.761000e+07
3 4.15 2.704000e+07
4 6.87 1.802002e+09
.. … …
294 11.60 1.155184e+09
295 5.90 3.600000e+09
296 11.00 7.732388e+09
297 12.50 8.100000e+07
298 5.90 2.985530e+07
Kms_Driven Fuel_Type_Diesel Kms_Driven Seller_Type_Individual \
0 0.0 0.0
1 43000.0 0.0
2 0.0 0.0
3 0.0 0.0
4 42450.0 0.0
.. … …
294 33988.0 0.0
295 0.0 0.0
296 0.0 0.0
297 9000.0 0.0
298 0.0 0.0

Kms_Driven Fuel_Type_Diesel Kms_Driven Seller_Type_Individual \

0 27000.0 0.0
1 43000.0 0.0
2 6900.0 0.0
3 5200.0 0.0
4 42450.0 … 0.0
.. … 0.0
294 33988.0
295 60000.0 0.0
296 87934.0 0.0
297 9000.0 0.0
298 5464.0 0.0
Owner Seller_Type_Individual Fuel_Type_Diesel Seller_Type_Individual \

0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0
.. … …
294 0.0 0.0
295 0.0 0.0
296 0.0 0.0
297 0.0 0.0
298 0.0 0.0
Fuel_Type_Diesel Transmission_Manual \
0 0.0
1 1.0
2 0.0
3 0.0
4 1.0
.. …
294 1.0
295 0.0
296 0.0
297 1.0
298 0.0

0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
.. …
294 0.0
295 0.0
296 0.0
297 0.0
298 0.0
150.0
The code above is for checking and removing multicollinearity in the features of the dataset.
The code first creates a loop over two degrees of polynomial features (1 and 2). Within the
loop, it first creates polynomial features of the input data X, and stores it in X_poly. The code
then creates a dataframe “vif_data” which contains the VIF scores for each feature in X_poly.
The code then checks if any VIF score is greater than 10 for degree 1 and 1.5e+02 for degree
2. If yes, it removes the features with high VIF scores.
The code then fits a linear regression model using the X_poly and Y data and splits the data
into training and testing sets (X_train, X_test, Y_train, Y_test). The code then calculates the R-
squared score of the model on both the training and testing sets, and prints the results. The
code updates the X with the X_poly with the reduced features, and continues the loop. The
code outputs the VIF scores and R-squared scores for each degree of polynomial features.
The Next step is to fit the Linear Regression model, but before that one has to concate the
Polynomial Features with the Y column to consider the entire dataset
[23]:
df2
[23]: Present_Price Age^2 Age Present_PriceAge Kms_Driven Age Owner \

0 5.59 25.0 27.95 135000.0 0.0
1 9.54 36.0 57.24 258000.0 0.0
2 9.85 4.0 19.70 13800.0 0.0
3 4.15 64.0 33.20 41600.0 0.0

4 6.87 25.0 34.35 212250.0… 0.0

.. … … … … 0.0
294 11.60 9.0 34.80 101964.0
295 5.90 16.0 23.60 240000.0 0.0
296 11.00 100.0 110.00 879340.0 0.0
297 12.50 4.0 25.00 18000.0 0.0
298 5.90 9.0 17.70 16392.0 0.0
Age Fuel_Type_Diesel Age Seller_Type_Individual \

0 0.0 0.0
1 6.0 0.0
2 0.0 0.0
3 0.0 0.0
4 5.0 0.0
.. … …
294 3.0 0.0
295 0.0 0.0
296 0.0 0.0
297 2.0 0.0
298 0.0 0.0
Age Transmission_Manual Present_Price^2 Present_Price Kms_Driven …\

0 5.0 31.2481 150930.0 …
1 6.0 91.0116 410220.0 …
2 2.0 97.0225 67965.0 …
3 8.0 17.2225 21580.0 …
4 5.0 47.1969 291631.5 …
.. … … … … …
294 3.0 134.5600 394260.8
295 4.0 34.8100 354000.0 …
296 10.0 121.0000 967274.0 …
297 2.0 156.2500 112500.0 …
298 3.0 34.8100 32237.6 …
Kms_Driven^2 Kms_Driven Fuel_Type_Diesel \

0 7.290000e+08 0.0
1 1.849000e+09 43000.0
2 4.761000e+07 0.0
3 2.704000e+07 0.0
4 1.802002e+09 42450.0
.. … …
294 1.155184e+09 33988.0
295 3.600000e+09 0.0
296 7.732388e+09 0.0
297 8.100000e+07 9000.0
298 2.985530e+07 0.0

Kms_Driven Seller_Type_Individual Kms_Driven Transmission_Manual \

0 0.0 27000.0
1 0.0 43000.0
2 0.0 6900.0
3 0.0 5200.0
4 0.0 42450.0
.. … …
294 0.0 33988.0
295 0.0 60000.0
296 0.0 87934.0
297 0.0 9000.0
298 0.0 5464.0
Owner Fuel_Type_Diesel Owner Seller_Type_Individual \

0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0
.. … …
294 0.0 0.0
295 0.0 0.0
296 0.0 0.0
297 0.0 0.0
298 0.0 0.0
Fuel_Type_Diesel Seller_Type_Individual \
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
.. …
294 0.0
295 0.0
296 0.0
297 0.0
298 0.0
Fuel_Type_Diesel Transmission_Manual \
0 0.0
1 1.0
2 0.0
3 0.0

4 1.0
.. …
294 1.0
295 0.0
296 0.0
297 1.0
298 0.0
Seller_Type_Individual Transmission_Manual Selling_Price

0 0.0 3.35
1 0.0 4.75
2 0.0 7.25
3 0.0 2.85
4 0.0 4.60
.. … …
294 0.0 9.50
295 0.0 4.00
296 0.0 3.35
297 0.0 11.50
298 0.0 5.30
1.7 Perform Scaling Using Standardscaler() Function
To improve efficiency of the model proper Scaling/ Normalization of features are to be done.
[24]:
In the code below an object for the LinearRegression() class is created and using the
train_test_split the data is split into the train and test datasets.
[25]:
[26]:
[26]: LinearRegression()

1.8 Model Evaluation and Deployment
[27]:
The y_pred is a predicted value or an array of predicted values generated by the model reg3
using the test data X_test. reg3 is a machine learning model, possibly a regression model,
which has been trained on some data to make predictions about the target variable based on
the input features. X_test is a 2D array or a matrix of test samples. Each row represents a
sample, and each column represents a feature of the sample. The predict method of the reg3
model is used to generate predictions based on the input X_test. The output of the predict
method is stored in the variable y_pred.
[28]:
[28]: array([ 1.08603318e+00, 1.07337683e-02, -5.58083128e-02, 1.08152573e-06,

8.30564742e-02, -1.66417298e-01, -7.01081398e-02, -8.22378238e-02,
-9.70644456e-03, -2.45573293e-06, 2.30485412e-01, 1.55966540e-01,
2.00735596e-02, 2.06820152e-11, 2.31957650e-05, 8.51428473e-08,
-1.06971479e-05, -3.71813308e-01, -8.82797683e-01, -2.91134036e+00,
-1.33446556e+00, 2.40581665e-01])
The above represents the array of cofficients with p predictor variable, where each element
of the array corresponds to the coefficient for a specific predictor variable.
[29]:
[29]: -0.03221012598673134
The intercept represents the baseline or offset of the regression line, and is a scalar
value. The value of the intercept can be positive, negative, or zero, and its value depends
on the scale and distribution of the predictor variables and the target variable. In our
case it would suggest that the baseline or average value of the target is lower than zero.
Model Evaluation using MAE, MSE, and R2_SCORE

[30]:
mean_Absolute-Error 0.5694326466668683
[31]:
mean-squared-error 0.6326205484415302
[32]:
r2score
r2score 0.9807987512114504
[33]:
[33]: <matplotlib.collections.PathCollection at 0x2486a069430>
Performing Cross- Validation Cross-validation is a technique for validating the

performance of a machine learning model by dividing the data into training and validation
sets. The model is trained on the training set, and its performance is evaluated on the

validation set. This process is repeated multiple times with different splits of the data, and
the results are averaged to get a more accurate estimate of the model’s performance. Indeed
training the data with this concept k-times can be done to get better results rather than
single-train_test_split.
[34]:
df2[
[35]:
reg4=LinearRegression()
scores=cross_val_score(reg4,X,Y,cv=cv)
=>
Score => k= 2 : 0.9680500267771892

Score => k= 3 : 0.964759287774896
Score => k= 4 : 0.9678845176300368
Score => k= 5 : 0.9633190268945515
Score => k= 6 : 0.9637140123402181
Score => k= 7 : 0.9624590637372925
Score => k= 8 : 0.9636431260452298
Score => k= 9 : 0.9626845083237989
[ ]:

Introduction To Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Machine Learning

Uploaded by

Copyright:

Available Formats

DADS303: Introduction to Machine Learning Manipal University Jaipur (MUJ)

MASTER OF BUSINESS ADMINISTRATION

Unit 1 : Introduction to Machine Learning 1

SL Topic Fig No / Table SAQ / Page No

2 Definitions and Examples - 1

3 How Machine Learning Works and its 1, 2 ,3 2

4 Type(s) of Machine Learning Model - 3 12 - 14

Unit 1 : Introduction to Machine Learning 2

Unit 1 : Introduction to Machine Learning 3

1.1 Learning Objectives

After studying this unit, you should be able to:

❖ Describe Machine Learning and various examples

Unit 1 : Introduction to Machine Learning 4

2. DEFINITION AND EXAMPLES

Unit 1 : Introduction to Machine Learning 5

1. Who coined the term Machine Learning?

Unit 1 : Introduction to Machine Learning 6

3. HOW MACHINE LEARNING WORKS AND ITS RELEVANCE TO

Unit 1 : Introduction to Machine Learning 7

Figure.1: Sample Email (Not Spam)

Figure.2: Sample Email (Spam)

Therefore, feature engineering is here as a saviour. Feature engineering is the process of

Unit 1 : Introduction to Machine Learning 8

Figure.3: Raw Email texts in tabular form

Unit 1 : Introduction to Machine Learning 9

• Monitoring of Social Media Contents

Unit 1 : Introduction to Machine Learning 10

Unit 1 : Introduction to Machine Learning 11

4. TYPE(S) OF MACHINE LEARNING MODEL

Unit 1 : Introduction to Machine Learning 12

When a machine learns without an instructor, it is said to be performing unsupervised

• Identifying different client groups around which to develop marketing or other

Unit 1 : Introduction to Machine Learning 13

7. Which of the following belongs to the classification of Machine Learning?

Unit 1 : Introduction to Machine Learning 14

Short Answer Questions

Q. 1 Define Machine Learning?

Q.2 Elaborate on the working of Machine Learning?

Q.3 Explain the need for Machine Learning in Business.

Long Answer Questions

Q 1: Explain the detailed classification of Machine Learning Model(s).

Unit 1 : Introduction to Machine Learning 15

1. Feed the data in Machine Learning Model (i.e., input data).

[Refer Sec 1.2]

This is a technology highly dependent upon data or data sets.

Unit 1 : Introduction to Machine Learning 16

[Refer to sec 1.4]

• Monitoring of Social Media Contents

[Refer to sec 1.3]

Unit 1 : Introduction to Machine Learning 17

SL Topic Fig No / Table SAQ / Page No

2 Simple Linear Regression 1, 2 , 3, 4, 5 , 6, -

Unit 2 : Linear Regression- I 2

To comprehend and quantify Cause-Effect relationships, a regression model is used. To

1.1 Learning Objectives

After studying this unit, you should be able to:

Unit 2 : Linear Regression- I 3

2. SIMPLE LINEAR REGRESSION

Birthweight = f (gestation weeks) (1)

where f - is the functional form that needs to be determined

‘m’ represents the slope

‘c’ represents the intercept on the straight line

Unit 2 : Linear Regression- I 4