Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

Revolutionising B.

Tech
Module 3: Introduction to
Supervised Learning
Algorithms
Course Name: Fundamentals of Artificial Intelligence
[23CSE505]
Total Hours : 05
Table of Contents

 Introduction to Linear and Logistic Regression,

 Classification Algorithms: Naïve Bayes, Decision Tree.


Aim

To enable the students to get insight into the algorithms


used in Supervised Learning and its applications.
 Understand the basic idea of Supervised Learning and the type of
problems that can be solved.

 Learn the difference between the classification algorithm and


Regression algorithm.

 To Understand the Linear and Logistic Regression algorithm used


Objectiv
for classification.
es

 To Understand Naïve Bayes and Decision Tree Induction algorithm


used for Classification.

 To understand the different type of Supervised Machine Learning


Algorithm.
Recap: Machine Learning…
• Machine Learning, a branch of Artificial Intelligence, concerns the construction
and study of systems that can learn from data. Machine learning investigates
how computers can learn (or improve their performance) based on data.
• A main research area is for computer programs to automatically learn to
recognize complex patterns and make intelligent decisions based on data.
• For example, a typical machine learning problem is to program a computer so
that it can automatically recognize handwritten postal codes on mail after
learning from a set of examples.
Recap: Supervised Learning…
• Algorithms are trained on labelled examples, i.e., input where the desired output is known.
• Supervised learning is classified into two categories of algorithms:
• Regression: A regression problem is when the output variable is a real value, such as “dollars” or
“weight”.
• Classification: A classification problem is when the output variable is a category, such as “Red” or
“blue” , “disease” or “no disease”.
• Supervised learning deals with or learns with “labeled” data. This implies that some data
is already tagged with the correct answer.
Difference between Classification and Regression
Example: Classification

• Example: Credit scoring


• Differentiating between low-risk
and high-risk customers from
their income and savings

Discriminant: IF income > θ1 AND savings > θ2


THEN low-risk ELSE high-risk
Example: Regression

• Relation between variables where changes in some variables may “explain” or possibly “cause”
changes in other variables.
• Explanatory variables are termed the independent variables and the variables to be explained are
termed the dependent variables.
y = wx+w0
• Example: Price of a used car
• x : car attributes
y : price
y = g (x | θ )
g ( ) model,
θ parameters
Regression model

• Regression model estimates the nature of the relationship between the independent and
dependent variables.
• Change in dependent variables that results from changes in independent variables, ie.
size of the relationship.
• Strength of the relationship.
• Statistical significance of the relationship.
• Function: a mathematical relationship enabling us to predict what values of one variable
(Y) correspond to given values of another variable (X).
• Y: is referred to as the dependent variable, the response variable or the predicted
variable.
• X: is referred to as the independent variable, the explanatory variable or the predictor
variable.
Linear Regression Algorithm
• Intro
Establishes a relationship between the Independent & Dependent Variables.
Examples of Independent & Dependent Variables:-
• x is Rainfall and y is Crop Yield
• x is Advertising Expense and y is Sales
• x is sales of goods and y is GDP
Here x is Independent Variable & Y is Dependent Variable

• How it Works
• Regression analysis is used to understand which among the Independent Variables are related to Dependent Variables.
• It attempts to model relationship between two variables by fitting a line called Linear Regression Line.
• The case of Single variable is called Simple Linear Regression whereas the case of Multiple Independent Variables, it is
called Multiple Linear Regression
Single Linear Regression Vs Multiple Linear Regression
The Linear Regression line is created using Ordinary Least Square Method.

Simple Linear Regression


X Y

Multiple Linear Regression

X1
Multiple X2 Y
Predictors
X3
X4
Examples : Single Linear Regression Vs Multiple Linear Regression
Bivariate or simple regression model
(Education) x y (Income)

Multivariate or multiple regression model

(Education) x1
(Sex) x2
y (Income)
(Experience) x3
(Age) x4
Linear Regression Equation
Slope/Gradient

y = mx + c Y Intercept
Sum of Squared Error
What is Error?

• Actual Value – Predicted Value is called


Error

• Here Predicted Value is the value predicted


by the Linear Regression Model.

• Also known as Residual.

Why it is Important?

Smaller the residuals, more accurate model it


would be.
Formula for linear regression equation
• Formula for linear regression equation is given by: 𝑦=𝑎+𝑏𝑥
• a and b are given by the following formulas:

• Where,
x and y are two variables on the regression line.
b = Slope of the line.
a = y-intercept of the line.
x = Values of the first data set.
y = Values of the second data set.
Solved Example 1
Solved Example 1…
Logistics Regression
• Logistics Regression is used when the dependent variable is categorical.

• The values are strictly in the range of 0 and 1.

• It is used to describe data and to explain the relationship between one dependent binary
variable and one or more nominal, ordinal, interval or ratio- level independent variables.
Logistic regression
Table 2 Age and signs of coronary heart disease (CD)

Age CD Age CD Age CD


22 0 40 0 54 0
23 0 41 1 55 1
24 0 46 0 58 1
27 0 47 0 60 1
28 0 48 0 60 0
30 0 49 1 62 1
30 0 49 0 65 1
32 0 50 1 67 1
33 0 51 0 71 1
35 1 51 1 77 1
38 0 52 0 81 1
How can we analyse these data?
• Compare mean age of diseased and non-diseased

• Non-diseased: 38.6 years


• Diseased: 58.7 years (p<0.0001)

• Linear regression?
Linear Regression - Dot-plot: Data from Table 2

Y
es

Signsofcoronarydisease

N
o

0 2
0 4
0 6
0 8
0 1
00
A
GE(y
ears
)
Logistic regression
Table 3 Prevalence (%) of signs of CD according to age group

Diseased

Age group # in group # %

20 - 29 5 0 0

30 - 39 6 1 17

40 - 49 7 2 29

50 - 59 7 4 57

60 - 69 5 4 80

70 - 79 2 2 100

80 - 89 1 1 100
Dot-plot: Data from Table 3

100

Diseased % 80

60

40

20

0
0 2 4 6 8

Age group
Logistic function

1.0

e  x
Probability of disease
0.8 P( y x ) 
1  e  x
0.6

0.4

0.2

0.0

x
What Are the Types of Logistic Regression?
• Binary logistic regression
• Binary logistic regression was mentioned earlier in the case of classifying an object as an animal or
not an animal—it’s an either/or solution. There are just two possible outcome answers. This concept
is typically represented as a 0 or a 1 in coding. Examples include:
• Whether or not to lend to a bank customer (outcomes are yes or no).
• Assessing cancer risk (outcomes are high or low).
• Will a team win tomorrow’s game (outcomes are yes or no).
• Multinomial logistic regression
• Multinomial logistic regression is a model where there are multiple classes that an item can be
classified as. There is a set of three or more predefined classes set up prior to running the model.
Examples include:
• Classifying texts into what language they come from.
• Predicting whether a student will go to college, trade school or into the workforce.
• Does your cat prefer wet food, dry food or human food?
What Are the Types of Logistic Regression?
• Ordinal logistic regression
• Ordinal logistic regression is also a model where there are multiple classes that an item can be
classified as; however, in this case an ordering of classes is required. Classes do not need to be
proportionate. The distance between each class can vary. Examples include:
• Ranking restaurants on a scale of 0 to 5 stars.
• Predicting the podium results of an Olympic event.
• Assessing a choice of candidates, specifically in places that institute ranked-choice voting.
Differences Between Linear and Logistic Regression
Linear Regression Logistic Regression
Linear regression is used to predict the continuous Logistic regression is used to predict the categorical
dependent variable using a given set of independent dependent variable using a given set of independent
variables. variables.
Linear regression is used for solving regression
It is used for solving classification problems.
problem.
In this we predict the value of continuous variables In this we predict values of categorical variables
In this we find best fit line. In this we find S-Curve.
Least square estimation method is used for estimation of Maximum likelihood estimation method is used for
accuracy. Estimation of accuracy.
The output must be continuous value, such as price, age, Output must be categorical value such as 0 or 1, Yes or
etc. no, etc.
It required linear relationship between dependent and
It not required linear relationship.
independent variables.
There may be collinearity between the independent There should not be collinearity between independent
variables. variables.
Naïve Bayes
Naïve Bayes
• Naïve bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and
used for solving classification problems.

• It is mainly used in text classification that includes a high dimensional training dataset.

• Naïve Baes Classifier is one of the simple and most effective classification algorithms which helps
in building the fast machine learning models that can make quick predictions.

• It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.

• Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and
classifying articles.
Why is it called Naïve Bayes?
• The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described
as:

• Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent
of the occurrence of other features. Such as if the fruit is identified on the basis of color, shape,
and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature
individually contributes to identify that it is an apple without depending on each other.

• Color – Shape – Taste


Apple
Red - Spherical - Sweet

• Bayes: It is called Bayes because it depends on the principle of Bayes Theorem.


35
Probability Basics

• Prior, conditional and joint probability


– Prior probability: P(X)
– Conditional probability: P(X1 |X2 ), P(X2 |X1 )
– Joint probability: X  (X1 , X2 ), P(X)  P(X1 ,X2 )
– Relationship: P(X1 ,X2 )  P(X2 |X1 )P(X1 )  P(X1 |X2 )P(X2 )
– Independence: P(X2 |X1 )  P(X2 ), P(X1 |X2 )  P(X1 ), P(X1 ,X2 )  P(X1 )P(X2 )
• Bayesian Rule

P( X|C )P(C ) Likelihood Prior


P(C |X)  Posterior 
P( X) Evidence
Tennis Example

• Example: Play Tennis


The learning phase for tennis example
P(Play=Yes) = 9/14

P(Play=No) = 5/14
We have four variables; we calculate for each
we calculate the conditional probability table

Temperature Play=Yes Play=No


Hot 2/9 2/5
Outlook Play=Yes Play=No
Sunny Mild 4/9 2/5
2/9 3/5
Overcast Cool 3/9 1/5
4/9 0/5
Rain 3/9 2/5
Wind Play=Yes Play=No
Strong 3/9 3/5
Humidity Play=Yes Play=No
Weak 6/9 2/5
High 3/9 4/5
Normal 6/9 1/5
The test phase for the tennis example

• Test Phase
– Given a new instance of variable values,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Given calculated Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14

– Use the MAP rule to calculate Yes or No


P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.


Advantages of Naïve Bayes
• Naïve Bayes based on the independence assumption
• Training is very easy and fast; just requiring considering each attribute in each
class separately
• Test is straightforward; just looking up tables or calculating conditional
probabilities with normal distributions
• A popular generative model
• Performance competitive to most of state-of-the-art classifiers even in presence of
violating independence assumption
• Many successful applications, e.g., spam mail filtering
• Apart from classification, naïve Bayes can do more…
Disadvantages of Naïve Bayes

• The Naive Bayes Algorithm has trouble with the ‘zero-frequency problem’. It happens when you
assign zero probability for categorical variables in the training dataset that is not available. When
you use a smooth method for overcoming this problem, you can make it work the best.

• It will assume that all the attributes are independent, which rarely happens in real life. It will limit
the application of this algorithm in real-world situations.

• It will estimate things wrong sometimes, so you shouldn’t take its probability outputs seriously.
Applications that use Naive Bayes
Decision Tree
Decision Tree
• Decision tree induction is a type of supervised algorithm.
• Decision tree induction is the learning of decision trees from class-labeled training tuples.
• A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an
attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class
label.
• The topmost node in a tree is the root node.
• A Decision tree is a tree in which each branch node represents a choice between a number of alternatives, and
each leaf node represents decision.
Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Decision
Model
Tid Attrib1 Attrib2 Attrib3 Class
Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?

12 Yes Medium 80K ?


Decision Tree
13 Yes Large 110K ?
Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
An Example of Decision Tree
To sanction the loan or not?
Can be predicted whether a person can cheat or not based on the Account data.

Tid Refund Marital Taxable MarSt Single,


Status Income Cheat Married Divorced
1 Yes Single 125K No
NO Refund
2 No Married 100K No No
Yes
3 No Single 70K No
4 Yes Married 120K No NO TaxInc
5 No Divorced 95K Yes < 80K > 80K
6 No Married 60K No
NO YES
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits the same data!
10 No Single 90K Yes
10
Decision Tree Terminologies
• Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.
How does the Decision Tree algorithm Work?
• Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
• Step-3: Divide the S into subsets that contains possible values for the best attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.
Information Gain:
• Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.

• It calculates how much information a feature provides us about a class.

• According to the value of information gain, we split the node and build the decision tree.

• A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the below
formula:

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)


Attribute Selection Measures
• While implementing a Decision tree, the main issue arises that how to select the best attribute for
the root node and for sub-nodes.
• So, to solve such problems there is a technique which is called as Attribute selection measure or
ASM.
• By this measurement, we can easily select the best attribute for the nodes of the tree. There are two
popular techniques for ASM, which are:
• Information Gain

• Gini Index
Information Gain:
• Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.

• It calculates how much information a feature provides us about a class.

• According to the value of information gain, we split the node and build the decision tree.

• A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the below
formula:

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)


Entropy:

Entropy is Entropy: Entropy is a metric to measure the impurity in a given attribute. It


specifies randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,
S= Total number of samples
P(yes)= probability of yes
P(no)= probability of no a metric to measure the impurity in a given attribute. It
specifies randomness in data. Entropy can be calculated as:
Example
Instance Classification a1 a2
1 + T T
2 + T T
3 - T F
4 + F F
5 - F T
6 - F T

1) What is the entropy of training examples with respect to the target function?
2) What is the information gain of a1,a2 relative to the training examples
3) Draw a decision tree for the given dataset
Solution

[3+,3-]

Entropy (S)= - (3/6) *log2 (3/6) - (3/6) *log2 (3/6)


Instance Classifica a1 a2
= - 0.5*(-1) – 0.5 *(-1) tion
1 + T T
= 0.5 +0.5 2 + T T

=1 3 - T F
4 + F F
5 - F T
6 - F T
Solution

For Attribute a1
I (for a1 = T) =- (2/3) log2 (2/3) - (1/3) log 2(1/3) = 0.9183
I (for a1 = F) = - (1/3) log2 (1/3) - (2/3) log 2(2/3) = 0.9183

Instan Classif a1 a2
Gain (S,a1)= E(S) - (3/6) E (S+) + (3/6) E(S-) ce ication
1 + T T
= 1 - 3/6*0.9183 + 3/6*0.9183 2 + T T
3 - T F
= 0.0817
4 + F F
5 - F T
6 - F T
For Attribute a2 Instance Classificat a1 a2
I (for a2 = T) = - (2/4) log2 (2/4) - (2/4) log 2(2/4) = 1 ion
1 + T T
I (for a2 = F) = - (1/2) log2 (1/2) - (1/2) log 2(1/2) = 1
2 + T T
Gain (S,a2)= E(S) - (4/6) E (S+) + (2/6) E(S-) 3 - T F
= 1 - 4/6*1 + 2/6 *1 4 + F F
5 - F T
=0
6 - F T
Gain (S,a1)=0.0817

Gain (S,a2)=0

a1 has maximum information gain so select it as root


Decision tree

a1
F
T

a2 a2

T F F T

+ - + -
Applications of Supervised learning
• Supervised learning can be used to solve a wide variety of problems, including:
• Spam filtering: Supervised learning algorithms can be trained to identify and classify spam emails
based on their content, helping users avoid unwanted messages.
• Image classification: Supervised learning can automatically classify images into different
categories, such as animals, objects, or scenes, facilitating tasks like image search, content
moderation, and image-based product recommendations.
• Medical diagnosis: Supervised learning can assist in medical diagnosis by analyzing patient data,
such as medical images, test results, and patient history, to identify patterns that suggest specific
diseases or conditions.
• Fraud detection: Supervised learning models can analyze financial transactions and identify
patterns that indicate fraudulent activity, helping financial institutions prevent fraud and protect
their customers.
• Natural language processing (NLP): Supervised learning plays a crucial role in NLP tasks,
including sentiment analysis, machine translation, and text summarization, enabling machines to
understand and process human language effectively.
Advantages of Supervised learning
• Supervised learning allows collecting data and produces data output from previous
experiences.
• Helps to optimize performance criteria with the help of experience.
• Supervised machine learning helps to solve various types of real-world
computation problems.
• It performs classification and regression tasks.
• It allows estimating or mapping the result to a new sample.
• We have complete control over choosing the number of classes we want in the
training data.
Disadvantages of Supervised learning
• Classifying big data can be challenging.
• Training for supervised learning needs a lot of computation time. So, it
requires a lot of time.
• Supervised learning cannot handle all complex tasks in Machine Learning.
• Computation time is vast for supervised learning.
• It requires a labelled data set.
• It requires a training process.
Assessments

QUESTION 1:
A _________ is a decision support tool that uses a tree-like graph or model of
decisions and their possible consequences, including chance event outcomes,
resource costs, and utility.

a) Decision tree
b) Graphs
c) Trees
d) Neural Networks
Assessments

QUESTION 2:
Which of the following statement is not true about Naïve
Bayes classifier algorithm?
a) It cannot be used for Binary as well as multi-class
classifications
b) It is the most popular choice for text classification
problems
c) It performs well in Multi-class prediction as compared to
other algorithms
d) It is one of the fast and easy machine learning algorithms
to predict a class of test datasets
Did You Know?

AI is multi-disciplinary AI gets its core from AI can be used anywhere


Computer Science and everywhere
Terminal Questions

1) Explain the difference between Linear and Logistic Regression algorithms.


2) Explain the concept of Naïve Bayes Classifier with an example
3) Explain the decision tree induction algorithm with an example
4) Differentiate Classification & Regression.
Thank you

You might also like