AML AfterMid Merged

Applied Machine Learning
SE/SS ZG568
Swarna Chaudhary
Assistant Professor – BITS CSIS
BITS Pilani swarna.chaudhary@pilani.bits-pilani.ac.in
Pilani Campus
BITS Pilani
Pilani Campus
Logistic Regression Classifiers

These slides are prepared by the instructor, with grateful
acknowledgement of Tom Mitchell, and many others who made
their course materials freely available online.
BITS Pilani, Pilani Campus
Session Content
• Discriminant Functions
• Probabilistic Discriminative Classifiers
• Logistic regression
• Evaluation Measures
3
Logistic Regression
4
Linear Regression versus logistic
regression
• Linear Regression could help us predict the

student’s test score on a scale of 0 - 100. Linear
regression predictions are continuous (numbers
in a range).
• Logistic Regression could help use predict
whether the student passed or failed. Logistic
regression predictions are discrete (only specific
values or categories are allowed). We can also
view probability scores underlying the model’s
classifications.
IS ZC464, Machine Learning 5
Linear Regression versus Logistic Regression
Classification requires discrete values:

y = 0 or 1
For linear Regression output values:

can be much > 1 or much < 0
Logistic Regression:

Sigmoid/Logistic Function
• Sigmoid/logistic function takes a real value as

input and outputs another value between 0
and 1
• That framework is called logistic regression
– Logistic: A special mathematical sigmoid function
it uses
– Regression: Combines a weight vector with
observations to create an answer
7
Sigmoid Function
• Any function that takes

the shape of ‘S’ is called a
sigmoid function
• Maps input horizon to a
bounded, symmetrical
output range
• Examples:
– Logistic Function
– Hyperbolic Tangent

Sigmoid Function
• Known as ‘squashing function’
• Logistic function (a variant of sigmoid) bounds the
output between 0 and 1
• Conditional probability of a random variable can
be expressed as a sigmoid (logistic) function

Hypothesis representation
• Want 0 ≤ ℎ! 𝑥 ≤ 1 1
ℎ! 𝑥 = "! !#
" 1+ 𝑒
• ℎ! 𝑥 = 𝑔 𝜃 𝑥 ,
#
where 𝑔 𝑧 = #$% !" 𝑔(𝑧)
• Sigmoid function
𝑧
• Logistic function Slide credit: Andrew
Ng

Logistic Regression
1
z
Suppose predict “ “ if
predict “ “ if

Logistic regression
ℎ! 𝑥 = 𝑔 𝜃 " 𝑥
1 𝑔(𝑧)
𝑔 𝑧 =
1 + 𝑒 &'
Suppose predict “y = 1” if ℎ! 𝑥 ≥ 0.5 𝑧 = 𝜃 %𝑥
𝑧 = 𝜃 "𝑥 ≥ 0
predict “y = 0” if ℎ! 𝑥 < 0.5
𝑧 = 𝜃 "𝑥 < 0
Slide credit: Andrew
Ng

Learning model parameters
Training set:
m examples
How to choose parameters

(feature weights) ?

Cost function for Linear Regression
) )
1 & & + 1
𝐽 𝜃 = , ℎ* 𝑥 −𝑦 = , Cost(ℎ* (𝑥 & ), 𝑦))
2𝑚 𝑚
&'( &'(
1 #
Cost(ℎ" 𝑥 , 𝑦) = ℎ" 𝑥 − 𝑦
2

Ng

MSE Cost Function
Linear regression:
“non-convex” “convex”

Error (Cost) Function
• Our prediction function is non-linear (due to

sigmoid transform)
• Squaring this prediction as we do in MSE results
in a non-convex function with many local minima.
• If our cost function has many local minimums,
gradient descent may not find the optimal global
minimum.
• So instead of Mean Squared Error, we use a error/
cost function called Cross-Entropy, also known as
Log Loss. 16
Cross Entropy
• Cross entropy can be used to define a loss function
• The true probability y is the true label, and the given distribution hꝊ(x) is
the predicted value of the current model.
• Cross-entropy loss, or log loss, measures the performance of a
classification model whose output is a probability value between 0 and 1.
• Cross-entropy loss increases as the predicted probability diverges from the
actual label.
• So predicting a probability of .012 when the actual observation label is 1
would be bad and result in a high loss value.
• A perfect model would have a log loss of 0.
• Cross-entropy loss can be divided into
two separate cost functions:
one for y=1 and
one for y=0.
17
Logistic regression cost function (cross entropy)
If y = 1
Cost
0 1

Logistic regression cost function
If y = 0
Cost=0; If y=0 and hꝊ(x)=0
Cost
0 1

Cost function for Logistic Regression
−log ℎ! 𝑥 if 𝑦 = 1
Cost(ℎ! 𝑥 , 𝑦) = ,
−log 1 − ℎ! 𝑥 if 𝑦 = 0
if 𝑦 = 1 if 𝑦 = 0
0 ℎ* 𝑥 1 0 ℎ* 𝑥 1
Ng

Logistic regression cost function
−log ℎ! 𝑥 if 𝑦 = 1
• Cost(ℎ! 𝑥 , 𝑦) = 3
−log 1 − ℎ! 𝑥 if 𝑦 = 0
• Cost ℎ! 𝑥 , 𝑦 = −𝑦 log h! x − (1 − y) log 1 − ℎ! 𝑥
• If 𝑦 = 1: Cost ℎ! 𝑥 , 𝑦 = −log ℎ! 𝑥
• If 𝑦 = 0: Cost ℎ! 𝑥 , 𝑦 = −log 1 − ℎ! 𝑥
Ng

Cost function
To fit parameters : Apply Gradient Descent Algorithm
To make a prediction given new :

Output

Linear Decision boundary
• ℎ! 𝑥 = 𝑔(𝜃? + 𝜃# 𝑥# + 𝜃@ 𝑥@ )
Age
E.g., 𝜃? = −3, 𝜃# = 1, 𝜃@ = 1
x2
x1 Tumor Size 3
• Predict “𝑦 = 1” if −3 + 𝑥# + 𝑥@ ≥ 0
• i.e 𝑥# + 𝑥@ ≥ 3
Slide credit: Andrew Ng

Nonlinear Decision boundary
• ℎ# 𝑥 = 𝑔(𝜃$ + 𝜃% 𝑥% + 𝜃& 𝑥&

+ 𝜃' 𝑥%& + 𝜃( 𝑥&& )
E.g., 𝜃$ = −1, 𝜃% = 0, 𝜃& = 0, 𝜃' = 1, 𝜃( = 1
• Predict “𝑦 = 1” if −1 + 𝑥%& + 𝑥&& ≥ 0
• ℎ# 𝑥 = 𝑔(𝜃$ + 𝜃% 𝑥% + 𝜃& 𝑥& + 𝜃' 𝑥%& +

𝜃( 𝑥%& 𝑥& + 𝜃) 𝑥%& 𝑥&& + 𝜃* 𝑥%' 𝑥& + ⋯ )
Slide credit: Andrew Ng

Multi-class classification
• Email foldering/tagging: Work, Friends, Family, Hobby
• Medical diagrams: Not ill, Cold, Flu
• Weather: Sunny, Cloudy, Rain, Snow

Ng

Multi-class classification
Binary classification Multiclass classification
𝑥@ 𝑥@
𝑥# 𝑥#

One-vs-all (one-vs-rest)
𝑥+
"
ℎ! 𝑥
𝑥(
𝑥@
# 𝑥+
ℎ! 𝑥
𝑥# 𝑥(
Class 1:
Class 2:
Class 3:
$
ℎ! 𝑥 𝑥+
&
ℎ* 𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3) 𝑥(
Ng

One-vs-all
3
• Train a logistic regression classifier
ℎ" 𝑥 for
each class 𝑖 to predict the probability that 𝑦 = 𝑖
• Given a new input 𝑥, pick the class 𝑖 that

maximizes
3
max ℎ" 𝑥
4

Ng

Softmax Regression
• Logistic regression can be used for multi-class classification without having
to train and combine multiple binary classifiers using Softmax Regression,
aka Multinomial Logistic Regression
• For instance x, softmax first computes a score sk(x) for each class k, then estimates
the probability of each class using the softmax function to the scores.
where, each class has its own dedicated parameter vector θ(k)
• Compute the probability pk that the instance belongs to class k
where, K is the number of classes., s(x) is a vector containing the scores of each class for the instance x.
and σ(s(x))k is the estimated probability that the instance x belongs to class k given the scores of each
class for that instance.
• the Softmax Regression classifier predicts the class with the highest probability
29
Logistic regression (Classification)
• Model
"
ℎ! 𝑥 = 𝑃 𝑌 = 1 𝑋" , 𝑋# , ⋯ , 𝑋$ = #
"%& !" $
• Cost function
)
1 −log ℎ! 𝑥 if 𝑦 = 1
𝐽 𝜃 = . Cost(ℎ! (𝑥 ' ), 𝑦 (') )) Cost(ℎ! 𝑥 , 𝑦) = 6
𝑚 −log 1 − ℎ! 𝑥 if 𝑦 = 0
'("
• Learning
" $
Gradient descent: Repeat {𝜃! ≔ 𝜃! − 𝛼 # ∑#
$%" ℎ& 𝑥
$ −𝑦 $ 𝑥! }
• Inference
1
𝑌= = ℎ! 𝑥 ,-., = # 0 %&'%
1 + 𝑒 /!
30
Logistic Regression Applications
• Credit Card Fraud : Predicting if a given credit

card transaction is fraud or not
• Health : Predicting if a given mass of tissue is
benign or malignant
• Marketing : Predicting if a given user will buy
an insurance product or not
• Banking : Predicting if a customer will default
on a loan.
IS ZC464, Machine Learning 31
Evaluation Measures
32
Classifier Evaluation Metrics: Confusion
Matrix
Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in class i
that were labeled by the classifier as class j
May have extra rows/columns to provide totals
• True Positive (TP): It refers to the number of
predictions where the classifier correctly predicts
the positive class as positive. Predicted C1 ¬ C1
• True Negative (TN): It refers to the number of class ->
predictions where the classifier correctly predicts Actual
the negative class as negative. classß
• False Positive (FP): It refers to the number of
C1 True False
predictions where the classifier incorrectly Positives Negatives
predicts the negative class as positive. (TP) (FN)
• False Negative (FN): It refers to the number of
¬ C1 False True
predictions where the classifier incorrectly predicts
Positives Negatives
the positive class as negative. (FP) (TN)
33
33
Classifier Evaluation Metrics: Accuracy, Error Rate,
Classifier Accuracy, or recognition rate: percentage of test set tuples that are
correctly classified
Accuracy = (TP + TN)/All
most effective when the class distribution is relatively balanced
Classification Error/ Misclassification rate: 1 – accuracy, or

= (FP + FN)/All
A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All
34
34
Example
Given below is a confusion matrix for medical data where the class values are yes and
no for a class label attribute, cancer. Calculate the accuracy of the classifier.
35
BITS Pilani, Deemed to be University under Section

BITS Pilani,
3 of UGC
PilaniAct,
Campus
1956
Example
Actual Class\Predicted class cancer = yes cancer = Total Recognition(%)

no
cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56
(specificity)
Total 230 9770 10000 96.40 (accuracy)
36
36
Class Imbalance Problem
• The main class of interest is rare.
• the data set distribution reflects a significant majority of the negative class and a
minority positive class.
• For example,
– fraud detection applications, the class of interest (or positive class) is “fraud,”
– medical tests, there may be a rare class, such as “cancer”
• Accuracy might not be a good option for measuring performance in case of class
imbalance problem
37

BITS Pilani,
3 of UGC
PilaniAct,
Campus
1956
Approaches to solve Class imbalance problem
• Generate Synthetic Samples
– new samples based on the distances between the point and its nearest
neighbors
• Change the performance metric
– Use Recall, Precision or ROC curves instead of accuracy
• Try different algorithms
– Some algorithms as Support Vector Machines and Tree-Based
algorithms may work better with imbalanced classes.
38
Model Evaluation Measures
• True positive rate (TPR) or sensitivity is defined as the fraction of positive examples
predicted correctly by the model
• True negative rate (TNR) or specificity is defined as the fraction of negative examples
predicted correctly by the model
• False positive rate (FPR) is the fraction of negative examples predicted as a positive class
• False negative rate (F N R) is the fraction of positive examples predicted as a negative class
39

BITS Pilani,
3 of UGC
PilaniAct,
Campus
1956
Model Evaluation Measures
Precision: exactness – what % of tuples that the classifier labeled as positive are
actually positive
Recall/Sensitivity: completeness – what % of positive tuples did the classifier label as

positive?
Perfect score is 1.0
Inverse relationship between precision & recall
F measure (F1 or F-score): harmonic mean of precision and recall,
High value of F1 measure ensures that both precision and recall are high
Fß: weighted measure of precision and recall

– assigns ß times as much weight to recall as to precision
40
40
Example
Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%
Actual Class\Predicted class cancer = yes cancer = Total Recognition(%)

no
cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56
(specificity)
Total 230 9770 10000 96.40 (accuracy)
41
41
Example: Contingency for Multi-Class
Classifier
42

Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
Holdout method
– Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies obtained
Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
– At i-th iteration, use Di as test set and others as training set
– *Stratified cross-validation*: folds are stratified so that class dist. in
each fold is approx. the same as that in the initial data
– The Accuracy of the model is the average of the accuracy of each fold.
43
43
Cross Validation
44

Stratified Cross Validation
45

Receiver Operating Characteristic (ROC)
Curve
• ROC curve plots TPR and FPR, to graphically represent their trade-off
• AUC represents degree or measure of separability. It tells how much model is
capable of distinguishing between classes.
– Higher the AUC, better the model is at predicting
• Area Under Curve (AUC) of ROC – evaluates model performance on average
– AUC of ROC = 1, for a perfect model
– AUC of ROC = 0.5, if the model is random
• For model comparison, AUC of ROC should be larger for the model to be superior
or better performing
46

BITS Pilani,
3 of UGC
PilaniAct,
Campus
1956
Example
The table below shows the probability value (column 3) returned by a probabilistic
classifier for each of the 10 tuples in a test set, sorted by decreasing probability
order. The corresponding ROC is given on right hand side.
47

BITS Pilani,
3 of UGC
PilaniAct,
Campus
1956
AUC-ROC
• Threshold selection
• Performance assessment
• Classifier comparison
48

BITS Pilani,
3 of UGC
PilaniAct,
Campus
1956
References
• Tom M. Mitchell
Generative and discriminative classifiers: Naïve Bayes
and Logistic Regression
http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.
pdf
• Andrew Ng, Michael Jordan

On discriminative vs. generative classifiers: A
comparison of logistic regression and naive bayes
http://papers.nips.cc/paper/2020-on-discriminative-vs-
generative-classifiers-a-comparison-of-logistic-
regression-and-naive-bayes.pdf

References
• http://www.cs.cmu.edu/~tom/NewChapters.html
• http://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf
• https://medium.com/@sangha_deb/naive-bayes-vs-logistic-regression-
a319b07a5d4c
• https://www.youtube.com/watch?v=-la3q9d7AKQ
• http://www.datasciencesmachinelearning.com/2018/11/handling-outliers-in-
python.html
• http://www.cs.cmu.edu/~tom/10601_fall2012/slides/GenDiscr_LR_9-20-2012.pdf
• https://www.statlect.com/fundamentals-of-statistics/logistic-model-maximum-
likelihood
• http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf
• https://towardsdatascience.com/logistic-regression-explained-9ee73cede081
Interpretability
• https://christophm.github.io/interpretable-ml-book/logistic.html
50
Naïve Bayesian Classifier
Swarna Chaudhary
BITS Pilani Assistant Professor
Pilani|Dubai|Goa|Hyderabad
swarna.chaudhary@pilani.bits-pilani.ac.in
1
• The slides presented here are obtained from the authors of the books and from
various other contributors. I hereby acknowledge all the contributors for their
material and inputs.
• I have added and modified a few slides to suit the requirements of the course.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

In this segment
• Probability Basics
• Bayes Theorem
• Naïve Bayesian Classifier
• An illustrative Example
• Naïve Bayes Classifier is a generative model
• Advantages of Naïve Bayes Classifier and when to use Naïve Bayes Classifier?
• Interpretability of Naïve Bayes Classifier
• Python Implementation

Probability
Question: You roll a fair die. What is the probability of E={1,5}?
BITS Pilani
Random Variables
• A random variable, is a variable whose possible values are numerical outcomes of
a random experiment.
• Random variable X is a function from the sample space to the real numbers.
• Denoted by Capital letter

5
Random Variables
• 1. Discrete Random Variable is one which may take on only a countable number
of distinct values such as 0,1,2,3,4
• X = sum of values on the roll of two dice,
• X has to be either 2, 3, 4, …, or 12.
• X be the number of heads that result from the toss of 2 coins.
• Here X can take values 0,1, or 2. X is a discrete random variable.
2. Continuous Random Variable Continuous Data can take any value within a
range.
• Example, List all the real numbers between [0,1]
5/15/22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Random Variables

Probability Distribution of Discrete Random Variable
• Discrete Probability Distribution: The mathematical definition of a discrete
probability function, p(x), is a function that satisfies the following properties. This is
referred as Probability Mass Function.
• Probability mass function (pmf): P(X=xi)

• Simple facts about pmf
Example

Probability Distribution - Example
• Sum of 2 dice
10

Probability of Continuous Random Variable
• Continuous Probability Distribution: The mathematical definition of a continuous
probability function, f(x), is a function that satisfies the following properties. This is
referred as Probability Density Function.
• Simple facts about pdf
11
Probability Density Function
BITS Pilani
Probability Density Function
BITS Pilani
Gaussian Distribution
• The normal curve is bell-shaped and has a single peak at the
• exact center of the distribution.
• The arithmetic mean, median, and mode of the distribution are equal and
located at the peak.
• Half the area under the curve is above the peak, and the other half is
below it.
• The normal distribution is symmetrical about its mean.
• The normal distribution is asymptotic - the curve gets closer and closer to
the x-axis but never actually touches it.


MATHEMATICAL FUNCTION (pdf)
16

Gaussian Distribution- Parameter
§ The normal distribution can be completely specified by two
parameters:
1. Mean
2. Standard deviation
17

Properties of normal curve
No matter what the value of µ and σ are, area under normal curve
remain in certain fixed proportions within a specified number of
standard deviation on either side of µ. For example the interval
§ µ ± σ will always contain 68.26%
§ µ ± 2σ will always contain 95.44%
§ µ ± 3σ will always contain 99.73%
18

19

Joint Probability
• Joint probability is the probability of two events happening together. The two
events are usually designated event A and event B. In probability terminology, it can
be written as:
• P(X and Y) or P(A n B) or P(X,Y)
• Example: The probability that a card is a five and black, p(five and black) = 2/52
= 1/26, (There are two black fives in a deck of 52 cards, the five of spades and the
five of clubs)
20

Conditional Probability
§ Conditional Probability is a measure of the probability of an event given that (by assumption,
presumption, assertion or evidence) another event has already occurred. If the event of interest
is A and the event B is known or assumed to have occurred.
§ This probability is written P(A|B), notation for the probability of A given B.
§ Example: Let a pair of dice is thrown. If the sum of the numbers on the dice is 7, Find the
probability that at least one of the dice shows 2.
21

Independence
• Two events A and B are independent if and only if P(A∩B)=P(A)P(B).
• if two events A and B are independent and P(B)≠0, then P(A|B)=P(A).
• In general, for n events A1,A2,⋯,An to be independent we must have
• Example: If a dice is thrown twice, What is the probability that the first throw results in a
number greater than 4 and the second throw results in a number less than 3.
22

Law of Total Probability
23

Law of Total Probability
24

Bayes Rule
From the definition of conditional distribution
25

Example of Bayes Theorem
Suppose the fraction of undergraduate students who play sport is 15% and the
fraction of graduate students who play sports is 23%. If one-fifth of the college
students are graduate students and the rest are undergraduates, what is the
probability that a student who plays a sport is a graduate student?
26

Example of Bayes Theorem
Classify the record:

X = (Home Owner = No, Marital Status = “Married”, Annual Income =
$120k)
27

Bayesian Classifiers
Let the attributes X1, X2,…,Xn and class labels Y1, Y2,..Ym be random variables
Given a record with attributes (X1, X2,…,Xn)

– Goal is to predict class Y=Yk
– We want to find the value of Y that maximizes P(Y| X1, X2,..,Xn)
Can we estimate P(Y| X1, X2,..,Xn) directly from data?

Bayesian Classifiers
– Compute the posterior probability P(Y| X1, X2,.., Xn) for all values of Y using
the Bayes theorem
P(X1, X2,..,Xn | Y)P(Y)

P(Y| X1, X2,.., Xn) =
P(X1, X2,..,Xn)
– Choose value of Class that maximizes P(Y| X1, X2,.., Xn)
– Equivalent to choosing value of Y that maximizes P(X1, X2,.., Xn | Y)P(Y)
How to estimate P(X1, X2,.., Xn | Y)?

Conditional independence
30

Applying conditional
independence
Naïve Bayes assumes 𝑋! are conditionally independent given 𝑌
e.g., 𝑃 𝑋" 𝑋#, 𝑌 = 𝑃(𝑋"|𝑌)
𝑃 𝑋", 𝑋# 𝑌 = 𝑃 𝑋" 𝑋#, 𝑌 𝑃 𝑋# 𝑌 = 𝑃 𝑋" 𝑌 𝑃(𝑋#|𝑌)
General form: 𝑃 𝑋", ⋯ , 𝑋$ 𝑌 = ∏$%&" 𝑃(𝑋% |𝑌)

How many parameters to describe 𝑃 𝑋", ⋯ , 𝑋$ 𝑌 ?

Naïve Bayes Independence
assumption
Assumption:
"
𝑃 𝑋! , ⋯ , 𝑋" 𝑌 = ' 𝑃(𝑋# |𝑌)

#$!
i.e., 𝑋% and 𝑋# are conditionally independent

given 𝑌 for 𝑖 ≠ 𝑗

Naïve Bayes Classifier
Assume independence among attributes Xi when class is given:
– P(X1, X2, …, Xd |Yj) = P(X1| Yj) P(X2| Yj)… P(Xd| Yj)
– Now we can estimate P(Xi| Yj) for all Xi and Yj combinations from the training
data
– New point is classified to Yj if P(Yj) P P(Xi| Yj) is maximal.
Slide adopted from “Introduction to Data mining” Vipin Kumar

Class Conditional Probabilities
34

How to Estimate Probabilities from Data?
• Class: P(C) = Nc/N

Tid Refund Marital Taxable Evade • e.g., P(No) = 7/10,
Status Income P(Yes) = 3/10
1 Yes Single 125 No
2 No Married 100 No • For discrete attributes:
3 No Single 70 No P(Ai | Ck) = |Aik|/ Nck
4 Yes Married 120 No
5 No Divorced 95 Yes
• where |Aik| is number of
instances having attribute Ai
6 No Married 60 No
and belongs to class Ck
7 Yes Divorced 220 No • Examples:
8 No Single 85 Yes
P(Status=Married|No) = 4/7
9 No Married 75 No P(Refund=Yes|Yes)=0
10 No Single 90 Yes
Data Mining

• For continuous attributes:
• Discretize the range into bins
• one ordinal attribute per bin
• Too many bins – training records are too few to provide reliable probability for
each interval
• Too few bins – some intervals may aggregate from different classes & we may miss
the correct decision boundary
• Probability density estimation:
• Assume attribute follows a normal distribution
• Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
• Once probability distribution is known, can use it to estimate the conditional
probability P(Ai|Ck)
Data Mining

Tid Refund Marital

Status
Taxable
Income
Evade • Normal distribution:
( Ai - µ ij ) 2
-
1 Yes Single 125K No 1 2s ij2
P( Ai | c j ) = e
2 No Married 100K No 2ps ij2
3 No Single 70K No
4 Yes Married 120K No • One for each (Ai,ci) pair
5 No Divorced 95K Yes
6 No Married 60K No • For (Income, Class=No):
7 Yes Divorced 220K No
• If Class=No
8 No Single 85K Yes
• sample mean = 110
9 No Married 75K No
• sample variance = 2975
10 No Single 90K Yes
Data Mining

Example of Naïve Bayes Classifier
Given a Test Record:
X = (Refund = No, Married, Income = 120K)
Tid Refund Marital Taxable Evade ! P(X|Class=No) = P(Refund=No|Class=No)

Status Income ´ P(Married| Class=No)
1 Yes Single 125K No ´ P(Income=120K| Class=No)
2 No Married 100K No = 4/7 ´ 4/7 ´ 0.0069 = 0.0023
3 No Single 70K No
! P(X|Class=Yes) = P(Refund=No| Class=Yes)
4 Yes Married 120K No
´ P(Married| Class=Yes)
5 No Divorced 95K Yes ´ P(Income=120K| Class=Yes)
6 No Married 60K No = 1 ´ 0 ´ 1.2 ´ 10-9 = 0
8 No Single 85K Yes Since P(X|No)P(No) > P(X|Yes)P(Yes)
9 No Married 75K No Therefore P(No|X) > P(Yes|X)
10 No Single 90K Yes => Class = No
Data Mining

Naive Bayes Classifier
• If one of the conditional probability is zero, then the entire expression
becomes zero
• Probability estimation is done with Laplacian correction:
N ic
Original : P( Ai | C ) =
Nc
C: No. of distinct
N +1 values of an
Laplace : P( Ai | C ) = ic
Nc + c attribute
Data Mining

Example of Naïve Bayes Classifier
Tid Refund Marital Taxabl Evade
Status e
Income
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
Data Mining
40
Naïve Bayes for Text Classification
Data Mining
41
A Simple Example
Which tag does the sentence A very close game
Text Tag
belong to? i.e. P(sports| A very close game)
“A great game” Sports
Feature Engineering: Bag of words i.e use word
“The election was over” Not sports frequencies without considering order
Using Bayes Theorem:
“Very clean match” Sports
P(sports| A very close game)
“A clean but forgettable
Sports = P(A very close game| sports) P(sports)
game”
----------------------------------------------
“It was a close election” Not sports P(A very close game)
We assume that every word in a sentence is independent of the other ones
“close” doesn’t appear in sentences of sports tag, So P(close | sports) = 0, which makes product
0
Data Mining
42
Laplace smoothing
• Laplace smoothing: we add 1 or in general constant k to

every count so it’s never zero.
• To balance this, we add the number of possible words to the
divisor, so the division will never be greater than 1
• In our case, the possible words are ['a', 'great', 'very', 'over',
'it', 'but', 'game', 'election', 'clean', 'close', 'the', 'was',
'forgettable', 'match'].
Data Mining
43
Apply Laplace Smoothing
Word P(word | Sports) P(word | Not Sports)
a 2+1 / 11+14 1+1 / 9+14
very 1+1 / 11+14 0+1 / 9+14
close 0+1 / 11+14 1+1 / 9+14
game 2+1 / 11+14 0+1 / 9+14
Data Mining
44
Learning to Classify Text
• Why?
• Learn which news articles are of interest
• Learn to classify web pages by topic
• Naive Bayes is among most effective algorithms

• What attributes shall we use to represent text
documents??
Data Mining
45
Naïve Bayes Generative Model
• Naïve Bayes classifier uses likelihood and prior probability to calculate
conditional probability of the class
• Likelihood is based on joint probability, which is the core principle of probabilistic
generative model
• Naïve Bayes simplifies the calculation of likelihood by the assumption of
conditional independence among input parameters
• Each parameter’s likelihood is determined using joint probability of the input
parameter and the output label
BITS Pilani
Naïve Bayes – When to use
• When the training data is small
• When the features are conditionally independent (mostly)
• When we have a fewer missing data
• When we have large number of features with minimal data set
• Ex: Text classification
BITS Pilani
Gaussian Naïve Bayes Algorithm
• A Gaussian Naive Bayes algorithm is a special type of NB
algorithm.
• It’s specifically used when the features have continuous values.
• It’s also assumed that all the features are following a Gaussian
distribution i.e, normal distribution.

Naïve Bayesian in Sklearn
• GaussianNB: implements the Gaussian Naive Bayes algorithm for classification. The likelihood
of the features is assumed to be Gaussian:
The parameters σy and μy are estimated using maximum likelihood

• MultinomialNB: implements the naive Bayes algorithm for multinomially distributed data, and
is one of the two classic naive Bayes variants used in text classification (where the data are
typically represented as word vector counts, although tf-idf vectors are also known to work well
in practice).
• Bernoulli NB: implements the naive Bayes training and classification algorithms for data that is
distributed according to multivariate Bernoulli distributions; i.e., there may be multiple features
but each one is assumed to be a binary-valued (Bernoulli, boolean) variable. Therefore, this class
requires samples to be represented as binary-valued feature vectors;
• Categorical NB: implements the categorical naive Bayes algorithm for categorically distributed
data. It assumes that each feature, which is described by the index i, has its own categorical
distribution.
Reference: https://scikit-learn.org/stable/modules/naive_bayes.html
Extra Reading
• Text Classification in NLP using Naïve Bayes
• https://medium.com/@theflyingmantis/text-classification-in-nlp-naive-bayes-a606bf419f8c
• More on Probability
• https://www.probabilitycourse.com/chapter1/1_0_0_introduction.php
50

References
• https://towardsdatascience.com/probability-concepts-explained-bayesian-inference-for-
parameter-estimation-90e8930e5348
• https://towardsdatascience.com/probability-concepts-explained-introduction-a7c0316de465
• https://medium.com/@theflyingmantis/text-classification-in-nlp-naive-bayes-a606bf419f8c
• https://www.youtube.com/watch?v=5Pck0Cqw-zc
• https://towardsdatascience.com/basic-probability-theory-and-statistics-3105ab637213
• https://nlp.stanford.edu/IR-book/html/htmledition/properties-of-naive-bayes-1.html
51

Thank You
52

Classification: Decision Tree
Swarna Chaudhary
Pilani Campus
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• I have added and modified a few slides to suit the requirements
of the course.
2

Topics to be covered
Module 5 : Decision Tree
1. Decision Tree Representation
2. Entropy and Information Gain for an attribute
3. ID3 Algorithm for decision tree learning
4. Alternative measures for selecting attributes
5. Prefer short hypothesis to longer ones, Occam’s razor
6. Overfitting in Decision Tree
7. Reduced Error Pruning and Rule post pruning
8. Interpretability of Decision Trees

BITS Pilani
Pilani Campus
Decision Tree
Decision trees
§ Decision Trees is one of the most widely used and practical methods of inductive inference
§ Features
§ Method for approximating discrete-valued functions
§ Learned functions are represented as decision trees (or if-then-else rules)
§ Interpretable Not black box
§ Humans can understand decisions
§ Decision tree good at handling noisy or missing system (low information gain)
§ Fast and Compact
§ Greedy (disadvantages)
§ Overfitting avoided by pruning

Decision Tree Induction
• Decision tree induction is the learning of decision trees from class-labeled training tuples.
• Decision tree is a flowchart-like tree structure,
• internal node (denoted by rectangles) denotes a test on an attribute,
• each branch represents an outcome of the test, and
• each leaf node (or terminal node, denoted by ovals ) holds a class label
• Used for classification.
• Easily converted to classification rules.
• Does not require any domain knowledge.
• Decision tree algorithms: ID3(Iterative Dichotomiser), C4.5 (successor of ID3),
CART(classification and regression trees).
• Adopt greedy approach
• Based on top- down recursive divide and conquer approach.
Data Mining
6 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Decision Tree Induction: An Example
q Training data set: Buys_computer age income student credit_rating buys_computer
Youth high no fair no
q Resulting tree:
Youth high no excellent no
Middle_aged
high no fair yes
Senior medium no fair yes
Senior low yes fair yes
Senior low yes excellent no
Middle_aged
low yes excellent yes
age? Youth medium
Youth low
no fair
yes fair
no
yes
Senior medium yes fair yes
Youth medium yes excellent yes
Middle_aged
medium no excellent yes
youth Midle_aged
overcast Senior Middle_aged
high yes fair yes
Senior medium no excellent no
student? yes credit rating?
no yes excellent fair
no yes yesData Mining

no
7
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, they are discretized in advance)
• It uses “Attribute_selection_method” to determine the splitting criteria.
• Attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)
• Conditions for stopping partitioning
• All samples for a given node belong to the same class
• There are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf
• There are no samples left
• Complexity is O(n X |D| X log|D|), where n is the number of attributes describing the
tuples in D, |D| is the number of training tuples.
Data Mining

Splitting Criteria
• Splitting criterion tells us which attribute to test at node N by determining the
“best” way to separate or partition the tuples in D into individual classes
• Also, tells us which branches to grow from node N with respect to the outcomes
of the chosen test.
• Splitting criterion indicates the splitting attribute and may also indicate either a
split-point or a splitting subset
• Partition is “Pure” i.e. all the tuples in it belong to the same class.
• Splitting attribute A can be:
• A is discrete valued: outcome correspond to the known values of A.
• A is continuous-valued: the test at node N has two possible outcomes,
corresponding to the conditions A<=split_point and A > split_point,
respectively.
• A is dicrete-values and Binary tree: The test at node N is of the form “A
∈SA?,” where SA is the splitting subset for A,. If a given tuple has value aj of A
and if aj ∈SA, then the test at node N is satisfied.
Data Mining

Splitting criteria…contd
a) Discrete-valued b) Continuous-valued c) discrete-valued and binary tree
Data Mining

Decision-tree Algorithm
Data Mining

Attribute Selection Measures
• Also, known as Splitting rules
• Measure is a heuristic for selecting the splitting criterion that “best” separates a
given data partition D, of class-labeled training tuples into individual classes.
• Partition should be pure (i.e., all the tuples that fall into a given partition would
belong to the same class).
• Provides ranking to each attribute of training tuple, and the attribute having “best”
score is chosen as the splitting attribute.
Data Mining

Data Mining
13
Brief Review of Entropy
m=2
Data Mining
14
Attribute Selection Measure: Information
Gain (ID3/C4.5)
n Select the attribute with the highest information gain
n This attribute minimizes the expected number of tests needed to classify a given tuple.
n Let pi be the probability that a tuple in D belongs to class Ci, estimated by |Ci, D|/|D|, m is
the number of distinct classes, v is the number of distinct values in an attribute.
n Expected information (entropy) needed to classify a tuple in D:
m
Info ( D) = -å pi log 2 ( pi )
i =1
n Information needed (after using A to split D into v partitions) to classify D
v | Dj |
InfoA ( D) = åj =1 |D|
´ Info( D j )
n The smaller the expected information required, greater the purity of the partitions.
Data Mining

Attribute Selection Measure: Information
Gain (ID3/C4.5)
n Information gained by branching on attribute A
Gain(A) = Info(D) - InfoA(D)

n Defined as the difference between the original information requirement (i.e. based on just
the proportion of the classes) and the new requirement (i.e. obtained after partitioning of
attribute A).
n Attribute with highest information gain Gain(A), is chosen as the splitting attribute at
node N.
Data Mining

Attribute Selection: Information Gain
5 4
gClass P: buys_computer = “yes” Infoage ( D) = I ( 2,3) + I ( 4,0)
14 14
gClass N: buys_computer = “no”
5
9 9 5 5 + I (3,2) = 0.694
Info( D) = I (9,5) = - log 2 ( ) - log 2 ( ) =0.940 14
14 14 14 14
age pi ni I(pi, ni) 5
Youth 2 3 0.971 I ( 2,3) means “age <=30” has 5 out of 14
14
Middle_aged
4 0 0 samples, with 2 yes’es and 3 no’s.
Senior 3 2 0.971
Infoage ( D) = (5 / 14) X (-2 / 5 log(2 / 5) - 3 / 5 log(3 / 5)
age income student credit_rating buys_computer
Youth high no fair no + (4 / 4) X (-4 / 4) log(4 / 4)
Middle_aged
high no fair yes + 5 / 14((-3 / 5) X log(3 / 5) - (2 / 5) log(2 / 5)) = 0.694
Senior low
Senior low
yes
yes
fair
excellent
yes
no Gain(age ) = Info ( D) - Infoage ( D) = 0.246
Middle_aged
low yes excellent yes Similarly,
Youth medium no fair no
Youth low yes fair yes
Senior medium yes fair yes
Youth medium yes excellent yes Gain(income) = 0.029
Middle_aged
medium no excellent yes
Middle_aged
high yes fair yes Gain( student ) = 0.151
Data Mining
Gain(credit _ rating ) = 0.048
Attribute Selection: Information Gain
Data Mining

Decision Tree
Decision tree for the concept “buys_computer”
Data Mining

Splitting Based on Continuous Attributes
• Different ways of handling
• Discretization to form an ordinal categorical attribute
• Binary Decision: (A < v) or (A ³ v)

• consider all possible splits and finds the best cut
• can be more compute intensive
Data Mining
20
Splitting Based on Continuous Attributes
Data Mining
21
Computing Information-Gain for
Continuous-Valued Attributes
• Let attribute A be a continuous-valued attribute
• Must determine the best split point for A
• Sort the value A in increasing order
• Typically, the midpoint between each pair of adjacent values is considered as
a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
• The point with the minimum expected information requirement for A is
selected as the split-point for A
• Split:
• D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of
tuples in D satisfying A > split-point
Data Mining

Question?
• Compute the information gain for every possible split, for the given continuous
valued attribute?
• Sort the values and find the the midpoint between each A1 A2 Class
pair of adjacent values (split_point). T 1 +
T 6 +
Class
+ - + - + - - T 5 -
Sorted F 4 +
• Calculate1 Info(D)
A2
3 4 5 6 7 8
Split_Poi F 7 -
• Calculate
nt
2
the entropy 3.5each 4.5
for 5.5 for 6.5
split_point 7.5“>”
“<=” and
F 3 -
• Find Gain for each split_point
F 8 -
T 7 +
F 5 -
Data Mining

Gain Ratio for Attribute Selection (C4.5)
• Information gain measure is biased towards attributes with a large number of distinct
values. Eg. Product_ID (unique for every tuple), resulting in large number of partitions
as Infoproduct_ID (D) = 0, Such partitioning is useless.
• C4.5 (a successor of ID3) uses gain ratio to overcome the problem.
• It applies normalization to information gain using a “split information”
v | Dj | | Dj |
SplitInfo A ( D) = -å ´ log 2 ( )
j =1 |D| |D|
• GainRatio(A) = Gain(A)/SplitInfoA (D)
• Ex. Gain Ratio of “income” on the given data set.
• gain_ratio(income) = 0.029/1.557 = 0.019

• The attribute with the maximum gain ratio is selected as the splitting attribute
Data Mining
24
Gini Index (CART, IBM IntelligentMiner)
• If a data set D contains examples from n classes, gini index, gini(D) is
defined as n 2
gini( D) = 1- å p j
j =1
where pj is the probability that a tuple in D belongs to class j is |Cj,D |/|D|

• It considers binary-split for each attribute
• If a data set D is split on A (such that A∈ SA?) into two subsets D1 and D2,
the gini index gini(D) is defined as
| D1| |D |
gini A ( D) = gini( D1) + 2 gini( D 2)
| D| | D|
• Reduction in Impurity:
Dgini( A) = gini(D) - giniA (D)

• The attribute provides the minimum ginisplit(D) (or the largest reduction in
impurity) is chosen to split the node (need to enumerate all the possible
splitting points for each attribute)
Data Mining
25
Computation of Gini Index
• Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
æ 9 ö æ 5 ö
gini ( D ) = 1 - ç ÷ -ç ÷ = 0.459
è 14 ø è 14 ø
• Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in
D2
æ 10 ö æ4ö
giniincomeÎ{low,medium} ( D) = ç ÷Gini ( D1 ) + ç ÷Gini ( D2 )
è 14 ø è 14 ø
age income student credit_rating buys_computer

Youth high no fair no
Middle_aged
high no fair yes
Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split Senior low yes fair yes
on the {low,medium} (and {high}) since it has the lowest Senior low
Middle_aged
low
yes excellent
yes excellent
no
yes
Gini index Youth medium no fair no
Youth low yes fair yes
• May need other tools, e.g., clustering, to get the possible Senior medium yes fair yes
Youth medium yes excellent yes
• split values Middle_aged
medium no excellent
Middle_aged
high yes fair
yes
yes
Data Mining
26
Comparing Attribute Selection Measures
• The three measures, in general, return good results but
• Information gain:
• biased towards multivalued attributes
• Gain ratio:
• tends to prefer unbalanced splits in which one partition is much smaller
than the others
• Gini index:
• biased to multivalued attributes
• has difficulty when # of classes is large
Data Mining

Decision Tree Based Classification
• Advantages:
• Inexpensive to construct
• Extremely fast at classifying unknown records
• Easy to interpret for small-sized trees
• Accuracy is comparable to other classification techniques for many simple
data sets
Data Mining
28
Inductive bias in decision tree learning
§ Inductive bias is the assumptions made by the model to learn the target
function and to generalize beyond training data.
§ What is the inductive bias of DT learning?
1. Shorter trees are preferred over longer trees
This is the bias exhibited by a simple breadth first algorithm
generating all DT's e selecting the shorter one
2. Prefer trees that place high information gain attributes close to the root

Prefer shorter hypotheses: Occam's razor
§ Why prefer shorter hypotheses?
§ Arguments in favor:
§ There are fewer short hypotheses than long ones
§ If a short hypothesis fits data unlikely to be a coincidence
§ Elegance and aesthetics
§ Arguments against:
§ Not every short hypothesis is a reasonable one.
§ Occam's razor says that when presented with competing hypotheses that make the
same predictions, one should select the solution which is simple"

Issues in decision trees learning
§ Overfitting
• Building trees that “adapt too much” to the training examples may lead to
“overfitting”.
• May therefore fail to fit additional data or predict future observations
reliably

Overfitting in decision tree learning

How to Address Overfitting
• Pre-Pruning (Early Stopping Rule)
– Stop the algorithm before it becomes a fully-grown tree
– General stopping conditions for a node:
• Stop if all instances belong to the same class
• Stop if all the attribute values are the same
– More restrictive conditions (for pre-pruning) :
• Stop if number of instances is less than some user-specified threshold
• Stop if class distribution of instances are independent of the available features
(e.g., using c 2 test)
• Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
5/15/22
33
How to Address Overfitting…
• Post-pruning
– Grow decision tree to its entirety
– Trim the nodes of the decision tree in a bottom-up fashion
– If generalization error(i.e. expected error of the model on previously unseen
records) improves after trimming, replace sub-tree by a leaf node.
– Class label of leaf node is determined from majority class of instances in
the sub-tree
5/15/22
34
Good References
Decision Tree
• https://www.youtube.com/watch?v=eKD5gxPPeY0&list=PLBv0
9BD7ez_4temBw7vLA19p3tdQH6FYO&index=1
Overfitting
• https://www.youtube.com/watch?time_continue=1&v=t56Nid85
Thg
• https://www.youtube.com/watch?v=y6SpA2Wuyt8
• Decision tree for regression
• https://www.saedsayad.com/decision_tree_reg.htm

Textbook/Reference Books
• Tom M. Mitchell: Machine Learning, The McGraw-Hill Companies, Inc..
• Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson Education
• Data Mining: Concepts and Techniques, Third Edition by Jiawei Han, Micheline Kamber and
Jian Pei Morgan Kaufmann Publishers
Data Mining
36
Thank You
15 May 2022 BITS Pilani, Pilani Campus

Classification: Ensemble Methods
BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus
Text Book(s)
R1 An Introduction to Data Mining – Pang-Ning Tan, Michael Steinbach, Anuj
Karpatne, Vipin Kumar - 2005
These slides are prepared by the instructor, with grateful acknowledgement of and many others who made their
course materials freely available online.
• Ensemble Method
• Methods for constructing an Ensemble Classifier
• Bagging, Boosting
• Random Forest
• AdaBoost
• eXtreme Gradient Boosting (XGBoost)

Ensemble Methods
• Ensemble methods use multiple learning algorithms to
obtain better predictive performance than could be
obtained from any of the constituent learning algorithms
alone
• Construct a set of classifiers from the training data
• Predict class label of test records by combining the
predictions made by multiple classifiers
• Tend to reduce problems related to over-fitting of the
training data.
• By combining individual models, the ensemble model
tends to be more flexible (less bias) and less data-
sensitive(less variance).
4
Why Ensemble Methods work?
• 25 base classifiers
• Each classifier has
error rate, e = 0.35
• If base classifiers are identical, then the
ensemble will misclassify the same
examples predicted incorrectly by the
base classifiers depicted by dotted line
• Assume errors made
by classifiers are uncorrelated
• ensemble makes a wrong prediction
only if more than half of the base
classifiers predict incorrectly
• Probability that the ensemble classifier
makes a wrong prediction: 25
æ 25 ö i
P( X ³ 13) = å çç ÷÷e (1 - e ) 25-i = 0.06
i =13 è i ø
5
When does Ensemble work?
• Ensemble classifier performs better than the

base classifiers when each classifier error is
smaller than 0.5
• Necessary conditions for an ensemble classifier
to perform better than a single classifier:
– Base classifiers should be independent of
each other
– Base classifiers should do better than a
classifier that performs random guessing

General Approach
Original
D Training data
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets
Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers
Step 3:
Combine C*
Classifiers
7
Methods for constructing Ensemble
Classifier
• Using different algorithms
• Using different parameters/hyperparameters
• Using different training sets
• By manipulating input features
• By manipulating the class labels

Types of Ensemble Methods
• Simple Ensemble methods: Max Voting, Averaging,
Weighted Averaging
• Advanced Ensemble Methods
• Bagging, often considers homogeneous weak learners, learns
them independently from each other in parallel and combines
them by some kind of averaging process
• Boosting, often considers homogeneous weak learners, learns
them sequentially in a very adaptative way (a base model
depends on the previous ones) and combines them by some
deterministic strategy
• stacking, often considers heterogeneous weak learners, learns
them in parallel and combines them by training a meta-model to
output a prediction based on the different weak models
predictions
9
Simple Ensemble Techniques
Max Voting
Ex: Movie rating
The result of max voting would be something like this:
– Rating by 5 friends: 5 4 5 4 4
Averaging- (5+4+5+4+4)/5 = 4.4 Final rating
Weighted Average
Weight-0.23 0.23 0.18 0.18 0.18
The result is calculated as [(5*0.23) + (4*0.23) +
(5*0.18) + (4*0.18) + (4*0.18)] = 4.41.

Bootstrap Sampling
Bootstrap
– Bootstrapping is a sampling technique in which we create subsets of
observations from the original dataset, with replacement.
– Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected again
and re-added to the training set
– When you sample with replacement, items are independent. One item does not
affect the outcome of the other. You have 1/7 chance of choosing the first item
and a 1/7 chance of choosing the second item.
– If the two items are dependent, or linked to each other. When you choose the
first item, you have a 1/7 probability of picking a item. Assuming you don’t
replace the item, you only have six items to pick from. That gives you a 1/6
chance of choosing a second item.
11
Bootstrap Sampling
• Several bootstrap methods, and a common one is 0.632 boostrap
– A data set with d tuples is sampled d times, with replacement, resulting in a training set
of d samples.
– About 63.2% of the original data end up in the bootstrap, and the remaining 36.8% form
the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
• Where does the figure, 63.2%, come from?
Each tuple has a probability of 1/d of being selected, so the probability of not being
chosen is (1-1/d). We have to select d times, so the probability that a tuple will not be chosen
during this whole time is since (1 – 1/d)d . If d is large, the probability approaches e-1 = 0.368.
Thus, 36.8% of tuples will not be selected for training
12
Bootstrap Sampling

Bagging (Bootstrap Aggregating)
• Bagging stands for Bootstrap Aggregation
• Draw N items from X (initial dataset) with replacement
• Desired learners with High Variance (given different data, they give different
models)
• Decision tree and ANNs
• Use bootstrapping to generate L training sets and train one base learner with each
• Use Voting

Bagging
• Multiple subsets are created from the original dataset, selecting

observations with replacement.
• A base model (weak model) is created on each of these subsets.
• The models run in parallel and are independent of each other.
• The final predictions are determined by combining the predictions from all
the models.
Bagging Example
• Consider 1-dimensional data set:

Original Data:
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
• Classifier is a decision stump

• Decision tree with one internal node (the root) which is immediately connected to the
terminal nodes (its leaves). Decision stump makes a prediction based on the value of just a
single input feature. Sometimes they are also called 1-rules
– Decision rule: x£k

– Split point k is chosen based on entropy
x£k
True False
yleft yright
16
Bagging Example
Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35 è y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35 è y = -1
Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1 x <= 0.7 è y = 1
y 1 1 1 -1 -1 -1 1 1 1 1 x > 0.7 è y = 1
Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9 x <= 0.35 è y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.35 è y = -1
Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9 x <= 0.3 è y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.3 è y = -1
Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1 x <= 0.35 è y = 1
y 1 1 1 -1 -1 -1 -1 1 1 1 x > 0.35 è y = -1
17
Bagging Example
Bagging Round 6:
x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1 x <= 0.75 è y = -1
y 1 -1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 è y = 1
Bagging Round 7:
x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1 x <= 0.75 è y = -1
y 1 -1 -1 -1 -1 1 1 1 1 1 x > 0.75 è y = 1
Bagging Round 8:
x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1 x <= 0.75 è y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 è y = 1
Bagging Round 9:
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1 1 x <= 0.75 è y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 è y = 1
Bagging Round 10:

x 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9 x <= 0.05 è y = 1
y 1 1 1 1 1 1 1 1 1 1 x > 0.05 è y = 1
18
Bagging Example
• Assume test set is the same as the original data

• Use majority vote to determine class of ensemble classifier
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 1 1 1 -1 -1 -1 -1 -1 -1 -1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
4 1 1 1 -1 -1 -1 -1 -1 -1 -1
5 1 1 1 -1 -1 -1 -1 -1 -1 -1
6 -1 -1 -1 -1 -1 -1 -1 1 1 1
7 -1 -1 -1 -1 -1 -1 -1 1 1 1
8 -1 -1 -1 -1 -1 -1 -1 1 1 1
9 -1 -1 -1 -1 -1 -1 -1 1 1 1
10 1 1 1 1 1 1 1 1 1 1
Sum 2 2 2 -6 -6 -6 -6 2 2 2
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class
19
Bagging Algorithm
20
Boosting
• What if a data point is incorrectly predicted by the first
model, and then the next (probably all models), will
combining the predictions provide better results? Such
situations are taken care of by boosting.
• Boosting is a sequential process, where each
subsequent model attempts to correct the errors of the
previous model.
• The succeeding models are dependent on the previous
model.

Boosting
• A subset is created from the original dataset.

• Initially, all data points are given equal weights.
• A base model is created on this subset.
• This model is used to make predictions on the whole dataset.
• Errors are calculated using the actual values and predicted values.
• The observations which are incorrectly predicted, are given higher
weights.
• Another model is created and predictions are made on the dataset.
(This model tries to correct the errors from the previous model)

Boosting
• Similarly, multiple models are created, each correcting the
errors of the previous model.
• The final model (strong learner) is the weighted mean of all
the models (weak learners).
• Individual models would not perform well on the entire
dataset, but they work well for some part of the dataset.
Thus, each model actually boosts the performance of the
ensemble.

Boosting
• Records that are wrongly classified will have their weights increased
• Records that are classified correctly will have their weights decreased
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
• Example 4 is hard to classify

• Its weight is increased, therefore it is more likely
to be chosen again in subsequent rounds
24
Bagging and Boosting Algorithms
Bagging algorithms:
Boosting algorithms:
– AdaBoost
– XGBoost

Random Forest
• Random Forest is ensemble machine learning

algorithm that follows the bagging technique.
• The base estimators in random forest are
decision trees.
• Random forest randomly selects a set of
features which are used to decide the best split
at each node of the decision tree.

Random Forest
• Random subsets are created from the original

dataset (bootstrapping).
• At each node in the decision tree, only a random
set of features are considered to decide the best
split.
• A decision tree model is fitted on each of the
subsets.
• The final prediction is calculated by averaging
the predictions from all decision trees.

Random Forest
• Combines the predictions
made by multiple decision
trees
• Each tree is generated
based on the values of an
independent set of random
vectors
• Randomness is injected into
the model-building process
28
Random Forest
29
Random Vector selection –
Forest RI (Random Input selection)
• Randomly select F input features to split at each node of the
decision tree and then fully grow the tree without pruning
• This helps reduce the bias present in the resulting tree
• The predictions are combined using a majority voting scheme
• To increase randomness, bagging can also be used to generate
bootstrap samples
• The strength and correlation of random forests may depend on
the size of F features
– If F is sufficiently small, then the trees tend to become less
correlated
• The strength of the tree classifier tends to improve with a larger
number of F
• Optimal number of F = log2 d + 1 (where d is number of input
features)
30
AdaBoost
• Adaptive boosting or AdaBoost is one of the
simplest boosting algorithms. Usually, decision
trees are used for modelling. Multiple sequential
models are created, each correcting the errors
from the last model.
• AdaBoost assigns weights to the observations
which are incorrectly predicted and the
subsequent model works to predict these values
correctly.

Adaboost Algorithm
• Initially, all observations (n) in the dataset are

given equal weights (1/n).
• A model is built on a subset of data.
• Using this model, predictions are made on the
whole dataset.
• Errors are calculated by comparing the
predictions and actual values.
• While creating the next model, higher weights
are given to the data points which were
predicted incorrectly.
Adaboost Algorithm
• Weights can be determined using the error

value. For instance, higher the error more
is the weight assigned to the observation.
• This process is repeated until the error
function does not change, or the maximum
limit of the number of estimators is
reached.

AdaBoost Algorithm
Alpha is weight for the classifier. It is how much influence this stump will have in the final classification.
Weight Update:

AdaBoost
• Reduce weight if correctly classified else increase
• If any intermediate rounds produce error rate higher than 50%, the weights
are reverted to 1/n and the resampling procedure is repeated

Classifier weight
ε = 0.3
α = 1/2 * ln(1- 0.3 / 0.3) = 0.42365
ε = 0.7
α = 1/2 * ln(1- 0.7 / 0.7) = -0.42365
ε = 0.5
α = 1/2 * ln(1- 0.5 / 0.5) = 0
Notice three interesting observations:
1) classifier with accuracy higher than 50% results in a positive
weight for the classifier
(in other words, α > 0 if ε <= 0.5),
2) classifier with exact 50% accuracy is 0, and thus, does not
contribute to the final prediction, and
3) errors 0.3 and 0.7 lead to classifier weights with inverse signs.
36
Training instance weight update
37
Adaboost Example
Initial weights for each data point Data points
for training
0.1 0.1 0.1

Original
Data +++ - - - - - ++
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - a = 1.9459
38
Adaboost Example
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - a = 1.9459
B2
0.3037 0.0009 0.0422
Boosting
Round 2 - - - - - - - - ++ a = 2.9323
B3
0.0276 0.1819 0.0038
Boosting
Round 3 +++ ++ ++ + ++ a = 3.8744
Overall +++ - - - - - ++
39
Good References
Ensemble methods
https://www.slideshare.net/hustwj/an-introduction-to-ensemble-methodsboosting-
bagging-random-forests-and-more
Bagging and Boosting
• https://www.youtube.com/watch?time_continue=2&v=m-S9Hojj1as
XGBoost
• https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/
• https://xgboost.readthedocs.io/en/latest/tutorials/model.html
• https://www.youtube.com/watch?time_continue=71&v=Vly8xGnNiWs
• https://www.slideshare.net/ShangxuanZhang/xgboost-55872323
• https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/
• https://towardsdatascience.com/a-beginners-guide-to-xgboost-87f5d4c30ed7
• https://towardsdatascience.com/machine-learning-for-diabetes-562dd7df4d42
40
Thank You
41
15 May 2022 BITS Pilani, Pilani Campus
Support Vector Machines
BITS Pilani
Pilani Campus
Slide Credits: Dr. Chetana Gavankar

BITS Pilani
Pilani Campus
Text Book(s)
T1 Christopher Bishop: Pattern Recognition and Machine Learning, Springer
International Edition
T2 Tom M. Mitchell: Machine Learning, The McGraw-Hill Companies, Inc..
Theseslidesarepreparedbytheinstructor,withgratefulacknowledgementof Prof.TomMitchell,Prof. Burges, Prof.

AndrewMooreand manyotherswhomade theircoursematerialsfreelyavailableonline.
§ Linear Classifiers
§ Maximum Margin Classification
§ Linear SVM
§ SVM optimization problem
§ Soft Margin SVM

Linear Classifiers
w x + b>0 f(x,w,b) = sign(w x + b)
denotes +1
denotes -1 How would you
classify this data?
w x + b<0

Linear Classifiers
f(x,w,b) = sign(w x + b)
denotes +1
classify this data?

Linear Classifiers
denotes +1
classify this data?

Linear Classifiers
denotes +1
denotes -1 Any of these
would be fine..
..but which is
best?

Linear Classifiers
denotes +1
denotes -1
How would you
classify this data?
Misclassified
to +1 class

Linear Classifier

Linear Classifier
Decision Boundary

Large margin and support vectors

Support Vectors
• Geometric description of SVM is that the
max-margin hyperplane is completely
determined by those points that lie nearest to
it.
• Points that lie on this margin are the support
vectors.
• The points of our data set which if removed,
would alter the position of the dividing
hyperplane

Example

Support Vector Machines

Weight vector is perpendicular to
the hyperplane
Consider the points xa and xb,
which lie on the decision boundary.
This gives us two equations:
wT xa +b=0
wT xb+b=0
Subtracting these two equations gives us
wT.(xa − xb)=0
Note that the vector xa − xb lies on the decision boundary,
and it is directed from xb to xa.
Since the dot product wT.(xa − xb) is zero,
wT must be orthogonal to xa − xb and
in turn, to the decision boundary.
Linear SVM Mathematically
2 parallel hyperplanes
bi1 and bi2 can be
written as,
bi1: w.x + b = 1,
bi2: w.x + b = -1
Margin of the decision

boundary is given by
the distance between
the 2 hyperplanes:
w.(x1 – x2) = 2
||w||*d = 2
d = 2/||w||

Linear SVM Mathematically
x+ M=Margin Width
X-
Distance between lines given by solving linear equation:

What we know: 2
Maximize margin: M =
• w . x+ + b = +1 ||w ||
• w . x- + b = -1 1
Equivalent to minimize: ||w || 2
2

Solving the Optimization Problem
𝟏 2 T
L(w, b, αi)= 𝟐||w|| - Σ αi [yi (w xi + b) -1]
yi (wTxi + b) ≥1
Find w and b such that
𝟏 +1(wTxi + b) ≥ 1
Φ(w) = 𝟐||w||2 is minimized; -1(wTxi+ b) ≤ 1
and for all {(xi ,yi)}: yi (wTxi + b) ≥1 same as (wTxi + b) ≥ 1


𝟏
Φ(w) = 𝟐||w||2 is minimized; Type equation
ß Primal
here.
and for all {(x ,y )}: y (wTx + b) ≥1
i i i i
n Need to optimize a quadratic function subject to linear

inequality constraints.
n All constraints in SVM are linear
n Quadratic optimization problems are a well-known class of
mathematical programming problems, and many (rather
intricate) algorithms exist for solving them.
n The solution involves constructing a unconstrained problem
where a Lagrange multiplier αi is associated with every
constraint in the primary problem:

Optimization Problem
• Optimization problem is typically written:

Minimize f(x)
subject to
gi(x) = 0, i=1,…,p
hi(x) <= 0, i=1,…,m
• f(x) is called the objective function
• By changing x (the optimization variable) we wish to
find a value x∗ for which f(x) is at its minimum.
• p functions of gi define equality constraints and
• m functions hi define inequality constraints.
• The value we find MUST respect these constraints!



Dataset with noise
Hard Margin: So far we require
denotes +1 n
all data points be classified correctly
denotes -1 - No training error
n What if the training set is noisy?

Soft Margin Classification
Slack variables ξi can be added to allow
misclassification of difficult or noisy examples.
What should our quadratic optimization criterion be?

Minimize
R
1
w.w +C å ε k
2 k =1
Constraints are

Slack Variable
• Slack variable as giving the classifier some
leniency when it comes to moving around
points near the margin.
• When C is large, larger slacks penalize the
objective function of SVM’s more than when C
is small.

Soft margin example
26
Soft Margin

Hard Margin versus
Soft Margin
n Hard Margin:
𝟏
Φ(w) = 𝟐 wTw is minimized and for all {(x i ,yi)}
yi (wTxi + b) ≥ 1
n Soft Margin incorporating slack variables:

𝟏
Φ(w) = 𝟐 wTw + C Σ ξ i is minimized and for all {(x i ,yi)}
yi (wTxi + b) ≥1- ξi and ξi ≥ 0 for all i
n Parameter C can be viewed as a way to control

overfitting.

Value of C parameter
• C parameter tells the SVM optimization how

much you want to avoid misclassifying each
training example.
• For large values of C, the optimization will choose
a smaller-margin hyperplane if that hyperplane
does a better job of getting all the training points
classified correctly.
• Conversely, a very small value of C will cause the
optimizer to look for a larger-margin separating
hyperplane, even if that hyperplane misclassifies
more points.

Effect of Margin size v/s
misclassification cost



Linear SVMs: Overview
n The classifier is a separating hyperplane.
n Most “important” training points are support vectors; they
define the hyperplane.
n Quadratic optimization algorithms can identify which training
points xi are support vectors with non-zero Lagrangian
multipliers αi.
f(x) = ΣαiyixiTx + b

Good Web References for SVM
• Text categorization with Support Vector Machines:
learning with many relevant features - T. Joachims, ECML
• A Tutorial on Support Vector Machines for Pattern Recognition, Kluwer Academic
Publishers - Christopher J.C. Burges
• http://www.cs.utexas.edu/users/mooney/cs391L/
• https://www.coursera.org/learn/machine-learning/home/week/7
• https://towardsdatascience.com/support-vector-machine-introduction-to-machine-
learning-algorithms-934a444fca47
• https://data-flair.training/blogs/svm-kernel-functions/
• MIT 6.034 Artificial Intelligence, Fall 2010
• https://stats.stackexchange.com/questions/30042/neural-networks-vs-support-
vector-machines-are-the-second-definitely-superior
• https://www.sciencedirect.com/science/article/abs/pii/S0893608006002796
• https://medium.com/deep-math-machine-learning-ai/chapter-3-support-vector-
machine-with-math-47d6193c82be
• Radial basis kernel

Thank You
27 December 2020 BITS Pilani, Pilani Campus

Clustering: Expectation Maximization
Swarna Chaudhary
BITS Pilani Asst. Professor
Pilani Campus WILP Division, BITS-Pilani
•The slides presented here are obtained from the authors of the books and from various other contributors. I
hereby acknowledge all the contributors for their material and inputs.
•I have added and modified a few slides to suit the requirements of the course.

Unsupervised Learning
Learning from unlabelled data
Let X = {x(1),x(2),x(3),...,x(N)}
– The points does not carry labels
Supervised vs. Unsupervised Learning
Objective:
– Find patterns / sub-groups among the the data
points using data similarity
Unsupervised Learning - Find grouping based using

data similarity

Clustering
Clustering aims to find groupings in data

– Given a X, find K clusters using
data similarity
Unsupervised Learning - Find grouping based using

data similarity

Clustering
Inter-cluster
distances
Intra-cluster are
distances are
maximized
• Clustering aims to find groupings in minimized
data
• Given a X, find K clusters
using data similarity
Unsupervised Learning - Find grouping

based using data similarity

Quality: What Is Good Clustering?
• A good clustering method will produce high quality clusters with
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the similarity measure used by
the method and its implementation
• The quality of a clustering method is also measured by its ability to discover some
or all of the hidden patterns

What is not Cluster Analysis?
• Supervised classification
– Have class label information
• Simple segmentation
– Dividing students into different registration groups alphabetically, by last name
• Results of a query
– Groupings are a result of an external specification

Applications: Customer Segmentation
Help marketers discover distinct groups in their customer bases, and then use this
knowledge to develop targeted marketing programs

Document Clustering
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0

Recommendation System
https://ieeexplore.ieee.org/document/7019655 10

Other Applications
• Clustering on Gene Expression data
• https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3491664/
• Identifying Fake News

• http://snap.stanford.edu/mis2/files/MIS2_paper_2.pdf
• https://ai.intelligentonlinetools.com/ml/k-means-clustering-example-word2vec/
• COVID-19 Cluster Analysis

• https://towardsdatascience.com/covid-19-cluster-analysis-405ebbd10049
• Case Study
• https://medium.com/@msuginoo/three-different-lessons-from-three-different-clustering-
analyses-data-science-capstone-5f2be29cb3b2
11

Notion of a Cluster can be Ambiguous
How many clusters? Six Clusters
Two Clusters Four Clusters
12

Types of Clustering
• A clustering is a set of clusters
• An important distinction among types of clustering : hierarchical and partitional sets of clusters
• Partitional Clustering
– A division data objects into non-overlapping subsets (clusters) such that each data object is in
exactly one subset
• Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
• Density based
– identify distinctive groups/clusters in the data, based on the idea that a cluster in a data space
is a contiguous region of high point density, separated from other such clusters by contiguous
regions of low point density.
• Distribution Based
– Idea is data generated from the same distribution, belongs to the same cluster if there exists
several distributions in the dataset.
13

Partitional Clustering
Original Points A Partitional Clustering
14

Hierarchical Clustering
p1
p3 p4
p2
p1 p2 p3 p4
Hierarchical Clustering
Dendrogram
15

Density-Based
Density-based
– A cluster is a dense region of points, which is separated by low-density regions,
from other regions of high density.
– Used when the clusters are irregular or intertwined, and when noise and
outliers are present.
6 density-based clusters
16

Similarity and Dissimilarity
Similarity
– Numerical measure of how alike two data objects are
– Value is higher when objects are more alike
– Often falls in the range [0,1]
Dissimilarity (e.g., distance)
– Numerical measure of how different two data objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
Proximity refers to a similarity or dissimilarity
17

Data Matrix and Dissimilarity Matrix
Data matrix é x11 ... x1f ... x1p ù
ê ú
– n data points with p dimensions ê ... ... ... ... ... ú
êx ... xif ... xip ú
ê i1 ú
ê ... ... ... ... ... ú
êx ... xnf ... xnp úú
êë n1 û
Dissimilarity matrix
– n data points, but registers only the distance
– A triangular matrix
é 0 ù
ê d(2,1) 0 ú
ê ú
ê d(3,1) d ( 3,2) 0 ú
ê ú
ê : : : ú
êëd ( n,1) d ( n,2) ... ... 0úû
18

Proximity Measure for Nominal
Attributes
Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a binary
attribute)
Simple matching
– m: # of matches, p: total # of variables d (i, j) = p -
p
m
19

Proximity Measure for Binary Attributes
Object j
A contingency table for binary data
Object i
Distance measure for symmetric binary variables:
Distance measure for asymmetric binary variables:
Jaccard coefficient (similarity measure for asymmetric binary variables):
20

Dissimilarity between Binary Variables
Example Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
– Gender is a symmetric attribute

– The remaining attributes are asymmetric binary
– Let the values Y and P be 1, and the value N 0
0+1
d ( jack , mary ) = = 0.33
2+ 0+1
1+1
d ( jack , jim ) = = 0.67
1+1+1
1+ 2
d ( jim , mary ) = = 0.75
1+1+ 2
21

Distance on Numeric Data: Minkowski Distance
Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and h is
the order (the distance so defined is also called L-h norm)
Properties
– d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
– d(i, j) = d(j, i) (Symmetry)
– d(i, j) £ d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric
22

Example: Minkowski Distance
(Dissimilarity Matrices)
point attribute 1 attribute 2

Manhattan (L1) L x1 x2 x3 x4
x1 1 2
x1 0
x2 3 5 x2 5 0
x3 2 0 x3 3 6 0
x4 4 5 x4 6 1 7 0
x2 x4 Euclidean (L2) L2 x1 x2 x3 x4
x1 0
4 x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
2 x1 L¥ x1 x2 x3 x4
Supremum
x1 0
x2 3 0
x3 2 5 0
x3 x4 3 1 5 0
0 2 4
23

Example
The two tables above show the ‘area’ and ‘price’
of the same objects. Only the units of the variables
change.
Calculate Euclidean distance in both the cases.
24
Ordinal Variables
An ordinal variable can be discrete or continuous
Order is important, e.g., rank
Can be treated like interval-scaled
– replace xif by their rank rif Î{1,..., M f }
– map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by
rif -1
zif =
M f -1
– compute the dissimilarity using methods for interval-scaled variables
25

Attributes of Mixed Type: Gower distance
• A database may contain all attribute types
• Nominal, symmetric binary, asymmetric binary, numeric, ordinal
https://healthcare.ai/clustering-non-continuous-variables/
26
Example
Based on the information given in the table below, find most similar and most dissimilar persons
among them. Apply min-max normalization on income to obtain [0,1] range. Consider profession
and mother tongue as nominal. Consider native place as ordinal variable with ranking order of
[Village, Small Town, Suburban, Metropolitan]. Give equal weight to each attribute.
Name Income Profession Mother tongue Native Place
Ram 70000 Doctor Bengali Village
Balram 50000 Data Scientist Hindi Small Town
Bharat 60000 Carpenter Hindi Suburban
Kishan 80000 Doctor Bhojpuri Metropolitan
27
Solution
After normalizing income and quantifying native place, we get
Name Income Profession Mother tongue Native Place

Ram 0.67 Doctor Bengali 1
Balram 0 Data Scientist Hindi 2
Bharat 0.33 Carpenter Hindi 3
Kishan 1 Doctor Bhojpuri 4
d(Ram, Balram) = 0.67+1+1+(2-1)/(4-1)=3 d(Ram, Bharat) = 0.33+1+1+(3-1)/(4-1)=3

d(Ram, Kishan) = 0.33+0+1+(4-1)/(4-1) = 2.33 d(Balram, Bharat) = 0.33+1+0+(3-2)/(4-1)=1.67
d(Balram, Kishan) = 1+1+1+(4-2)/(4-1) = 3.67 d(Bharat, Kishan) = 0.67+1+1+(4-3)/(4-1) = 3
Most similar – Balram and Bharat; Most dissimilar – Balram and Kishan
28
Cosine Similarity
A document can be represented by thousands of attributes, each recording the frequency of a
particular word (such as keywords) or phrase in the document.
Other vector objects: gene features in micro-arrays, …

Applications: information retrieval, biologic taxonomy, gene feature mapping, ...
Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d||: the length of vector d
29

Example: Cosine Similarity
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d|: the length of vector d
Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 =
6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 =
4.12
cos(d1, d2 ) = 0.94
30

K-Means
31
K-Means Algorithm
• Works iteratively to find {𝜇k}and {rnk} such that J is minimized
• Iteration involves two key steps

• Find {rnk} , fixing {𝜇k} to minimize J
• Find {𝜇k} , fixing {rnk} to minimize J
• Let us look at each of these steps

K-Means Algorithm
A sample E-Step

K-Means Algorithm
A sample E-Step

K-Means Algorithm
A sample E-Step

K-Means Algorithm
E-Step:
For all xt ∊ X:

K-Means Algorithm
M-Step:
For all 𝜇k[where k = 1,2,...,K ] :

K-Means Algorithm
Algorithm:
Initialize 𝜇k[where k = 1,2,...,K ]

Repeat
E-Step [as defined earlier]
M-Step [as defined earlier]
Until convergence of 𝜇k .

K-Means Algorithm
Algorithm:

Repeat
E-Step in the second iteration

K-Means Algorithm
Algorithm:

Repeat
M-Step in the second iteration

K-Means Algorithm
Algorithm:

Repeat
M-Step in the second iteration

Example
Consider the analysis of weights of individuals and their respective blood glucose levels as given
below:
a) Identify the clusters using K-means clustering (k=2) for the given data, assuming candidate
1 and 2 as initial centroids.
b) How many iterations does it take before termination?
Candidate
Glucose
Weight level
1 72 185
2 56 170
3 60 168
4 68 179
5 72 182
6 77 188
7 70 180
8 84 183
42

Example
Candidate Cluster1(72, 185) Cluster2(56, 170)
3(60, 168) =172 +122 = 22+42
4(68 179) =62+42 = 92+122
5(72, 182) = 32+02 = 122+162
6(77, 188) = 32+52 = 182+112
7(70, 180) = 52+22 = 102+142
8(183, 84) = 22+122 = 132+282
Candidat Cluster1(182.83, 73.83) Cluster2(169, 58)

After 1st iteration, the cluster groups are C1{1, 4, 5, 6, 7, e
8} and C2{2, 3} 1(185, 72) (182.83 -185)2 +(73.83 -72)2 (169 -185)2 +(58 -72)2
Re-computing centroids: =2.172 +1.832 = 162+142
C1( [72+68+72+77+70+84]/6, [185 2(170, 56) (182.83 -170)2 +(73.83 -56)2 (169 -170)2 +(58 -56)2
+179+182+188+180+183]/6 ) =C1(73.83, 182.83,) =12.832 +17.832 = 12+22
C2( [56 +60]/2, [170+168]/2) =C2(58, 169) 3(168, 60) (182.83 -168)2 +(73.83 -60)2 (169 -168)2 +(58 -60)2
=14.832 +13.832 = 12+22

2nd Iteration –computing distances from resulting 4(179, 68) (182.83 -179)2 +(73.83 -68)2 (169 -179)2 +(58 -68)2
centroids: =3.832 +5.832 = 102+102
5(182, 72) (182.83 -182)2 +(73.83 -72)2 (169 -182)2 +(58 -72)2
After 2nd iteration, the cluster groups are C1{1, 4, 5, 6, 7,
=0.832 +1.832 = 132+142
8} and C2{2, 3}; there is no change in cluster groups 6(188, 77) (182.83 -188)2 +(73.83 -77)2 (169 -188)2 +(58 -77)2
hence the algorithm terminates.
=5.172 +3.172 = 192+192
7(180, 70) (182.83 -180)2 +(73.83 -70)2 (169 -180)2 +(58 -70)2
=2.172 +1.832 = 112+122

8(183, 84) (182.83 -183)2 +(73.83 -84)2 (169 -183)2 +(58 -84)2
43
= 0.172+10.172 = 142+262

Importance of Choosing Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
44

Importance of Choosing Initial Centroids
Iteration 6
1
2
3
4
5
3
2.5
1.5
y
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
45

Importance of Choosing Initial Centroids …
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x

3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
46

Importance of Choosing Initial Centroids …
Iteration 5
1
2
3
4
3
2.5
1.5
y
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
47

Variations of the K-Means Method
Most of the variants of the k-means which differ in
– Selection of the initial k means
– Dissimilarity calculations
– Strategies to calculate cluster means
Handling categorical data: k-modes
– Replacing means of clusters with modes
– Using new dissimilarity measures to deal with categorical objects
– Using a frequency-based method to update modes of clusters
48

Limitations of K-means
• K-means has problems when clusters are of differing
– Sizes
– Densities
– Non-globular shapes
• K-means has problems when the data contains outliers.
49

Limitations of K-means: Differing Sizes
Original Points K-means (3 Clusters)
50

Limitations of K-means: Differing Density
51

Limitations of K-means: Non-globular Shapes
52

Discussion on the K-means Method
• Efficiency: O(tkn) where n:#of objects, k:#of clusters, and t:#of iterations
• Normally, k,t<<n;thus, an efficient method
• K-means clustering often terminates at a local optimal
• Initialization can be important to find high-quality clusters
• Need to specify k, the number of clusters, in advance
• In practise, one often runs a range of values and selected the “best” k value
• Sensitivity to noisy data and outliers
• Variations: using k-medians, k-medoids, etc.
• K-means is applicable only to objects in a continuous n-dimensional space
• Using k-modes for categorical data
• Not suitable to discover clusters with non-convex shapes
• Using density based clustering, kernel k-means etc

Drawbacks of K-means Clustering
• Hard Clustering vs Soft Clustering
https://towardsdatascience.com/gaussian-mixture-models-d13a5e915c8e
• The normal curve is bell-shaped and has a single peak at the
• exact center of the distribution.
• The arithmetic mean, median, and mode of the distribution are equal and located at
the peak.
• Half the area under the curve is above the peak, and the other half is below it.
• The normal distribution is symmetrical about its mean.
55

MATHEMATICAL FUNCTION (Pdf)
56

Multivariate Gaussian
• Gaussian for a single variable (Univariate) x:
• Given the x is D-dimensional vector & x is Gaussian distributed :
Where
x: input vector
μ : D-dimensional mean vector
Σ : D × D covariance matrix
|Σ| : Determinant of Σ

Univariate and Multivariate
Gaussian density

Multimodal Data
• Gaussian distribution is that it is intrinsically unimodal
• A single distribution is not sufficient to fit many real data sets which is multimodal
Time to next Geyser Eruptions (in Mins) Length of Geyser Eruptions (in
Mins)
Demo of fitting a single gaussian to the data.

Multimodal Data
• A single distribution is not sufficient to fit

many real data sets
Time to next Geyser Eruptions (in Mins) Length of Geyser Eruptions (in
Mins)
Demo of fitting a k gaussian to the data.

Mixture Model
n A mixture model assumes that a set of observed objects is a mixture of instances
from multiple probabilistic clusters, and conceptually each observed object is
generated independently
n probabilistic cluster is a distribution over the data space, which can be mathematically
represented using a probability density function (or distribution function).
n Out task: infer a set of k probabilistic clusters that is mostly likely to generate D

Gaussian Mixture Model
n A Gaussian mixture model is a probabilistic model that assumes all the data points are
generated from a mixture of a finite number of Gaussian distributions with unknown
parameters.
• A Gaussian Mixture is a function that is comprised of several Gaussians, each identified
by k ∈ {1,…, K}, where K is the number of clusters of our dataset. Each Gaussian k in
the mixture is comprised of the following parameters:
• A mean μ that defines its centre.
• A covariance Σ that defines its width. This would be equivalent to the dimensions of an
ellipsoid in a multivariate scenario.
• A mixing probability π that defines how big or small the Gaussian function will be.
https://scikit-learn.org/stable/modules/mixture.html

Gaussian Mixture Model

Mixture of Gaussians
• Mixture of Gaussians: Component densities are Gaussian.
Mixing Coefficients
𝙆 - Number of Gaussians
Data Set: The data used this module came from ‘Old Faithful Data’ available from https://www.kaggle.com/janithwanni/old-faithful/data for download & used by the text book
PRML.

Component Densities
𝙆 - Number of Gaussians
PRML.

PRML.

Parameters of MoG:
π : {π1 , . . . , πK },
μ : {μ1 , . . . , μK }
Σ : {Σ1, . . . ΣK }
PRML.

Log Likelihood
Parameters of MoG:
π : {π1 , . . . , πK },
μ : {μ1 , . . . , μK }
Σ : {Σ1, . . . ΣK }
Log likelihood
PRML.

γ(zk )
Parameters of MoG:
π : {π1 , . . . , πK },
μ : {μ1 , . . . , μK }
Σ : {Σ1, . . . ΣK }
PRML.

Expectation-Maximization Algorithm
• Method for finding ML solutions for models with Latent Variables
• Broad applicability for estimating parameters for various models
• Estimating parameters of a MoG is one application
• Task : To Find ML parameters of MoG
● Set the derivative of ㏑ p( X | 𝝿, 𝜇,Σ)

○ w.r.t 𝜇k to 0 & solve it for 𝜇k
○ w.r.t Σk to 0 & solve it for Σk
○ w.r.t 𝝿k to 0 & solve it for 𝝿k - [Constrained optimization using Lagrangian Multipliers
since mixing coefficients sum should be 1]

● We get 𝜇k as
Where

● We get Σk as
Where

● We get 𝝿k as
Where

The EM (Expectation Maximization) Algorithm
• Powerful method for finding maximum likelihood solutions for models
• The (EM) algorithm: A framework to approach maximum likelihood of parameters in
statistical models.
– E-step assigns objects to clusters according to the current parameters of probabilistic
clusters
– M-step finds the new clustering or parameters that maximize the expected likelihood
• The k-means algorithm has two steps at each iteration:

– Expectation Step (E-step): Given the current cluster centers, each object is assigned to
the cluster whose center is closest to the object: An object is expected to belong to the
closest cluster
– Maximization Step (M-step): Given the cluster assignment, for each cluster, the
algorithm adjusts the center so that the sum of distance from the objects assigned to this
cluster and the new center is minimized
75
75
EM Algorithm for MoG
1. Start by placing gaussians randomly.
2. Repeat until it converges.
1. E step: With the current means and variances, find the probability of each data point xi
coming from each gaussian.
2. M step: Once it computed these probability assignments it will use these numbers to re-
estimate the gaussians’ mean and variance to better fit the data points.

EM for Gaussian Mixtures
Given a Gaussian mixture model, the goal is to maximize the likelihood function with respect to
the parameters (comprising the means and covariances of the components and the mixing
coefficients).
1. Initialize the means μk, covariances Σk and mixing coefficients πk, and evaluate the initial
value of the log likelihood.
2. E step. Evaluate the responsibilities using the current parameter values
• For fixed values of μ1, μ2, .. μk and Σ1, Σ2, .. Σk and Π1 ….Πk, Compute soft assignments per
data point (allocating the probability of that data point belonging to each one of the clusters).
• For each observation i, vector γ (aka responsibility vector) is (γi1, γi2,… γiK ), where K
is the total number of clusters, or often referred to as the number of components.
• The cluster responsibilities for a single data point i should sum to 1.

EM for Gaussian Mixtures
3. M step. Re-estimate the parameters using the current responsibilities
, where
4. Evaluate the log likelihood
and check for convergence of either the parameters or the log likelihood. If the convergence
criterion is not satisfied return to step 2.

To Estimate: M-Step
Initialize 𝝿, 𝜇,Σ and
E-Step
also evaluate the log likelihood
Perform E-Step Given 𝛾(zk)
Perform M-Step Given 𝝿, 𝜇,Σ
Repeat Until Convergence - [ Use log likelihood / parameters to decide this]

BITS Pilani
Pilani Campus
Thank You!
Swarna Chaudhary
Assistant Professor
Pilani Campus
BITS Pilani
Pilani Campus
NeuralNetwork
These slides are prepared by the instructor, with grateful acknowledgement of
Tom Mitchell, Andrew Ng and many others who made their course materials
freely available online.
Session Content
• Perceptron (Chapter 4 Tom Mitchell)

• Neural Network Architecture (Andrew Ng Notes
and Chapter 4 Tom Mitchell)
• Back propagation Algorithm (Andrew Ng Notes)
3
Artificial Neural Network
Neural Networks
• Origins: Algorithms that try to mimic the brain.
• Very widely used in 80s and early 90s; popularity diminished in late 90s.
• Recent resurgence: State-of-the-art technique for many applications
• Artificial neural networks are not nearly as complex or intricate as the actual
brain structure
Based on slide by Andrew Ng 8

When to use Neural Network
Perceptron
-
Perceptron Training rule
Perceptron Training
Gradient Descent
Gradient Descent
Gradient Descent
12
Perceptron: Sigmoid Function
13
Decision Surface of Perceptron
AND Operation
Multilayer network
• Single perceptrons can only express linear

decision surfaces.
• In contrast, the kind of multilayer networks
learned by the FORWARD and
BACKPROPAGATION algorithm are capable
of expressing a rich variety of nonlinear
decision surfaces
Neural networks
Output units
Hidden units
Input units
Layered feed-forward network
• Neural networks are made up of nodes or units,

connected by links
• Each link has an associated weight and activationlevel
• Each node has an input function (typically summing over
weighted inputs), an activation function, and an output
Based on slide by T. Finin, M. desJardins, L Getoor, R. Par 9
Neural Network
bias units x0 a(2)

0
hθ(x)
Layer 1 Layer 2 Layer 3

(Input Layer) (Hidden Layer) (Output Layer)
Slide by Andrew Ng 12
Feed-Forward Process
• Input layer units are set by some exterior function
(think of these as sensors), which causes their output
links to be activated at the specified level
• Working forward through the network, the input
function of each unit is applied to compute the input
value
– Usually this is just the weighted sum of the activation on
the links feeding into this node
• The activation function transforms this input

function into a final value
– Typically this is a nonlinear function, often a sigmoid
function corresponding to the “threshold”of thatnode
Neural Network
x0
Multiple Output Units:One-vs-Rest
Multiple Output Units:One-vs-Rest
Neural Network Classification
Given:
{(x1,y1), (x2,y2), ..., (xn,yn)}
S ϵ N + L contains # nodes at eachlayer
– s0 = d (#features)
Binary classification Multi-class classification (K classes)

y = 0 or 1 y ϵR e.g. , , ,
K
pedestrian car motorcycle truck
1 output unit (sL-1= 1)
K output units (sL-1= K)
Slide by Andrew Ng 19
Layering Representations
x1 ... x20
x21 ... x40
x41 ... x60
.
..
x381 ... x400
20 ×20 pixelimages
d = 400 10 classes
Each image is “unrolled” into a vector x of pixel intensities

24
Cost Function
(9.1 NN video of Andrew Ng)
Optimizing the Neural Network
Forward Propagation
• Given one labeled training instance (x , y):
Forward Propagation
• a(1) = x
• z(2) = Θ(1)a(1) a(1)
a(4)
• a(2) = g(z(2)) [add a 0(2)] a(2) a(3)
• z(3) = Θ(2)a(2)
• a(3) = g(z(3)) [add a0(3)]
• z(4) = Θ(3)a(3)
• a(4) = hΘ(x) = g(z(4))

Learning in NN: Backpropagation
• Similar to the perceptron learning algorithm, we cycle
through our examples
– If the output of the network is correct, no changes aremade
– If there is an error, weights are adjusted to reduce theerror
• The trick is to assess the blame for the error and divide
it among the contributing weights

Backpropagation Intuition
• Each hidden node j is “responsible” for some fraction
of the error δj(l) in each of the output nodes to
which it connects
• δj(l) is divided according to the strength of the

connection between hidden node and the output
node
• Then, the “blame” is propagated back to provide the

error values for the hidden layer

Backpropagation
32
Han Kamber, 3rd Edition
34
Random Initialization
Training a Neural Network
Pick a network architecture (connectivity pattern between nodes)
• # input units = # of features in dataset

• # output units = # classes
Reasonable default: 1 hidden layer

• or if >1 hidden layer, have same # hidden units in
every layer (usually the more the better)

Training a Neural Network
1. Randomly initialize weights
2. Implement forward propagation to get hΘ(xi)
for any instance xi
3. Implement code to compute cost function J(Θ)
4. Implement backprop to compute partial derivatives
5. Use gradient descent with backprop to fit the network

38
Early Stopping
If we let a complex model train long
enough on a given data set it can
eventually learn the dataexactly.
Given data that isn’t represented in the

training set, the model will perform
poorly when analyzing the data
(overfitting).
How is the sweet spot for

training located?
When the error on the training set begins

to deviate from the error on the
validation set, a threshold can be set to
determine the early stopping condition
and the ideal number of epochs to train.
26
39
Dropout
40
Dropouts
28
41
Good References for understanding
Neural Network
Andrew Ng videos on neural network
https://www.youtube.com/watch?v=EVeqrPGfuCY&li
st=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=45
Autonomous driving using neural network

https://www.youtube.com/watch?v=ppFyPUx9RIU&li
st=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=57
Swarna Chaudhary
Assistant Professor
Pilani Campus
BITS Pilani
Pilani Campus
ConvolutionalNeural Network
These slides are prepared by the instructor, with grateful acknowledgement of Andrew
Ng, Tom Mitchell, Mitesh Khapra and many others who made their course materials
freely available online.
Session Content
• Convolutional Neural Network

• Recurrent Neural Network
3
Convolutional Neural Network
Source: https://www.youtube.com/watch?v=40riCqvRoMs
5/68
How a computer sees an image.
source: http://cs231n.github.io/classification/
6/68
In CNN, every image is represented in the form of an array of pixel values.

Convolution
Convolution
Convolution
Let us apply this idea to a toyexample and see the results
Input
Kernel
a b c d
w x
e f g h
y z
i j k A
Output
aw+bx+ey+fz
6/68
Input
Kernel
a b c d
w x
e f g h
y z
i j k A
Output
aw+bx+ey+fz bw+cx+fy+gz
6/68
Input
Kernel
a b c d
w x
e f g h
y z
i j k A
Output
aw+bx+ey+fz bw+cx+fy+gz cw+dx+gy+hz
6/68
Input
Kerne
a b c d l
w x
e f g h
y z
i j k A
Output
ew+fx+iy+jz
6/68
Input
Kernel
a b c d
w x
e f g h
y z
i j k A
Output
ew+fx+iy+jz fw+gx+jy+kz
6/68
Input
Kernel
a b c d
w x
e f g h
y z
i j k A
Output
ew+fx+iy+jz fw+gx+jy+kz gw+hx+ky+Az
6/68
Example of kernel: Blur
1 1 1
∗ 1 1 1 =
1 1 1
blurs the image
CS7015 (Deep Mitesh M. Khapra 9/68

Learning) : Lecture 11
Example of kernel: Edge detection
1 1 1
∗ 1 -8 1 =
1 1 1
detec s h
edges

Convolution
Wejust slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output

Convolution
Wejust slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output

2D convolutions applied to images
• Wejust slide the kernel over the
input image
• Each time we slide the kernel we
get one value in the output
• The resulting output is called a
feature map.
• Wecan use multiple filters to get
multiple feature maps.

R GB
What would a 3D filter looklike?

It will be 3D and we will refer to it as a volume
Once again we will slide the volume over the
3D input and compute the convolution oper-
ation
We will assume that the filter always extends
to the depth of the image
In effect, we are doing a 2D convolution oper-

ation on a 3D input (because the filter moves
along the height and the width but not along
the depth)
As a result the output will be 2D (only width
OUTPUT
and height, no depth)
INPUT
Once again we can apply multiple filters to get
multiple feature maps
CS7015 (Deep Learning) : Lecture 11
Let us compute the dimension (W 2, H 2) of
the output
Notice that we can’t place the kernel at the
= corners as it will cross the input boundary
pixel of interest

the output
This is true for all the shaded points (the
kernel crosses the input boundary)
This results in an output which is of smaller
dimensions than the input
As the size of the kernel increases, this be-
comes true for even more pixels

the output
pixel of interest
For example, let’s consider a 5 ×5 kernel
We have an even smaller output now

Let us compute the dimension (W 2, H 2) of the output

=
corners as it will cross the input boundary
In general, W2 = W1 − F +1
H2 = H1 − F + 1 For example, let’s consider a 5 ×5 kernel
We will refine this formula further We have an even smaller output now
20/68
What if we want the output to be of same size
0 0 0 0 0 0 0 0 0
as the input?
0 0
We can use something known as padding
0 0
0 0
Pad the inputs with appropriate number of 0
0 0
= inputs so that you can now apply the kernel
at the corners
0 0
0 0 Let us use pad P = 1 with a 3 × 3 kernel
0 0
This means we will add one row and one
0 0 0 0 0 0 0 0 0
column of 0 inputs at the top, bottom, left
and right
We now have, W2 = W1 − F + 2P + 1 H2 = H1 − F + 2P + 1
21/68
0
0
0 0 0 0 0 0 0 0
0
What does the stride S do?
0 0 It defines the intervals at which the
0 0 filter is applied (here S =2)
0 0 =
Here, we are essentially
0 0
0 0
skipping every 2nd pixel which
0 0 will again result in an output
0 0 0 0 0 0 0 0 0 which is of smaller dimensions
22/68
0
0
0 0 0 0 0 0 0 0
0
What does the stride S do?
0 0 It defines the intervals at which the
0 0 filter is applied (here S =2)
0 0 =
Here, we are essentially
0 0
0 0
skipping every 2nd pixel which
0 0 will again result in an output
0 0 0 0 0 0 0 0 0 which is of smaller dimensions
So what should our final formula look like,

W1 − F +2P
W = +1
S
H1 − F +2P
H = +1
S
22/68
Finally, coming to the depth of the
output.
Each filter gives us one 2D output.
K filters will give us K such 2D out-
H2
puts
H1
W2
W1
D2 =K
D1
W2 = W1−F +2P +1
S
H2 = H1−F +2P +1
S
D2 =K
23/68
Finally, coming to the depth of the
output.
Each filter gives us one 2D output.
K filters will give us K such 2D out-
H2
puts
We can think of the resulting output
as K × W2 × H2 volume
H1
Thus D2 = K
W2
W1
D2 =K
D1
W2 = W1−F +2P +1
S
H2 = H1−F +2P +1
S
D2 =K
23/68
Example
H 2 =?
11
∗ =
11
227
3
W 2 =?
96filters
227 Stride =4
Padding = 0
W 2 = W 1 − F +2P + 1 96
S
3 H 2 = H 1 − F +2P + 1
S

Example
55 = 227−11 + 1
4
11
∗ =
11
227
3
55 = 227−11 + 1
96filters 4
227 Stride =4
Padding = 0
W 2 = W 1 − F +2P + 1 96
S
3 H 2 = H 1 − F +2P + 1
S
24/68
Applying multiple filters
Features
Raw pixels
car, bus,monument, flower
Edge Detector
S I F T /H OG
28/68
Input Features Classifier
0 0 0 0 0
0 1 1 1 0
0 1 -8 1 0
0 1 1 1 0
0 0 0 0 0
-1.21358689e-033.23652686e-03····· -
2.06615720e-02
-1.5275..7822e-032.3613..0832e-03····· - ..
1.1982.4838e-02
.
.
.
.
Learn these weights
.. . .
-8.25322699e-04-5.14897937e-03····· -
9.90395527e-
03
Instead of using handcrafted kernels such as edge detectors can we learn meaningful kernels/filters in
addition to learning the weights of the classifier?
Input Features Classifier
0 0 0 0 0
0 1 1 1 0
0 1 -8 1 0
0 1 1 1 0
0 0 0 0 0
-0.01871333-0.01075948······
0.04684572
0.001..043250.019..35937··· ··· ..
. .
0.010.16542
.. .. ..
. . .
0.030087770.00335217······ -0.02791128
Even better: Instead of using handcrafted kernels (such as edge detectors) can we learn multiple
meaningful kernels/filters in addition to learning the weights of the classifier?
Input Classifier
-1.21358689e-033.23652686e-03·····-2.06615720e- -0.011125820.02185669······ 0.00015161

0-12.52757822e-032.36130832e-03····-1.19824838e- -0.006875870.01229961······ 0.00214013
02 .. .. .. .. .. .. backpropagation
. . . . . .
.. .. .. .. .. ..
. . . . . .
-8.25322699e-04-5.14897937e-03······-9.90395527e- -0.00372989-0.00886137······ -0.01974954
03
Can we learn multiple layers of meaningful kernels/filters in addition to

learning the weights of the classifier?
Yes, we can !
Simply by treating these kernels as parameters and learning them in addition to the
weights of the classifier (using back propagation)
Such a network is called a Convolutional Neural Network.
10 classes(digits)
This is what a regular feed-forward
... neural network will look like
There are many dense

connections here
For example all the 16 input neurons

are contributing to the computation
of h11
16
Contrast this to what happens in the

2 case of convolution
33/68
Sparse Connectivity
h11 h12 Only a few local neurons participate
in the computation of h11
For example, only pixels 1, 2, 5, 6

... contribute to h11
The connections are much
1 sparser
6
In CNN, by superimposing the kernel
2
over the image, only a few inputs are
considering at a time to compute the
* = weighted average of selected pixel
inputs.
The output h₁₁ is calculated using
sparser connections rather than
CS7015 (Deep Mitesh M. Khapra considering all the connections. 34/68
This sparse connectivity reduces the number of parameters in the model

Convolution Layer 2
Input Convolution Layer 1 Pooling Layer 2
A
Pooling Layer 1 FC 1(120)FC2(84)Output(10)
32
28
14
P a r am
32
10 P a r am P a r a m
5 = 48120= 10164 = 850
28 14
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P =0, K = 6,P =0,
S = 1,F = 5, S = 1,F = 2,
P a r a m =150 P a r a m =0 K = 16,P = 0, K = 16,P = 0,
P a r a m = 2400 P a r a m =0
It has alternate convolution and pooling layers

What does a pooling layer do?
Let us see

Pooling
1 4 2 1
5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7 5
1 3 1 2
Input 1filter
1 4 2 1
5 8 3 4 maxpool 8 8
7 6 4 5 2x2 filters (stride 1)
1 3 1 2

Pooling
1 4 2 1
5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7 5
1 3 1 2
Input 1filter
1 4 2 1
5 8 3 4 maxpool 8 8 4
7 6 4 5 2x2 filters (stride 1) 8 8 5
1 3 1 2 7 6 5
Instead of max pooling we can also do average

pooling

Pooling Layer
43/68
The pooling layer uses various filters to identify different parts of the image like edges, corners, etc
Training CNN
Input Kernel
b c d l m n o
w x
e f g
y z
h i j b c d e f g h i j
Output ACNN can be implemented as a

feedforward neural network
A m
wherein only a few weights(in color)
are active
the rest of the weights (in gray) are
zero
44/68
Training CNN
Input Kerne
l
b c d l m n o
w x
e f g
y z
g
h i j b c d e f h i j
ACNN can be implemented as a

feedforward neural network
wherein only a few weights(in color)
are active
the rest of the weights (in gray) are
We can thus train a convolution zero
neural network using
backpropagation by thinking of it as
a feedforward neural network with
sparse connections 44/68
Recurrent Networks: Time Series
• Suppose we want to predict next state of world

– and it depends on history of unknown length
– e.g., robot with forward-facing sensors trying to predict next
sensor reading as it moves and turns
• Idea: use hidden layer in network to capture state history

Recurrent Neural Network
Artificial Neural Networks: Summary
• Highly non-linear regression/classification

• Hidden layers learn intermediate representations
• Potentially millions of parameters to estimate
• Stochastic gradient descent, local minima problems
• Deep networks have produced real progress in many fields

– computer vision
– speech recognition
– mapping images to text
– recommender systems
– …
• They learn very useful non-linear representations
Good References for understanding
Neural Network
Mitesh Khapra https://www.youtube.com/watch?v=yw8xwS15Pf4
Visualization of CNN
https://www.youtube.com/watch?v=cNBBNAxC8l4
Back propagation https://www.youtube.com/watch?v=G5b4jRBKNxw&lis

t=PLZbbT5o_s2xq7LwI2y8_QtvuXZedL6tQU&index=25
https://www.simplilearn.com/tutorials/deep-learning-
tutorial/convolutional-neural-
network?source=sl_frs_nav_playlist_video_clicked

AML AfterMid Merged

Uploaded by

Copyright:

Available Formats

You might also like

AML AfterMid Merged

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AML AfterMid Merged

Uploaded by

Copyright:

Available Formats

Applied Machine Learning

Logistic Regression Classifiers

• Linear Regression could help us predict the

Classification requires discrete values:

For linear Regression output values:

BITS Pilani, Pilani Campus

• Sigmoid/logistic function takes a real value as

• Any function that takes

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Suppose predict “y = 1” if ℎ! 𝑥 ≥ 0.5 𝑧 = 𝜃 %𝑥

BITS Pilani, Pilani Campus

How to choose parameters

BITS Pilani, Pilani Campus

Slide credit: Andrew

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

• Our prediction function is non-linear (due to

BITS Pilani, Pilani Campus

Cost=0; If y=0 and hꝊ(x)=0

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

• Cost ℎ! 𝑥 , 𝑦 = −𝑦 log h! x − (1 − y) log 1 − ℎ! 𝑥

BITS Pilani, Pilani Campus

To fit parameters : Apply Gradient Descent Algorithm

To make a prediction given new :

BITS Pilani, Pilani Campus

Slide credit: Andrew Ng

BITS Pilani, Pilani Campus

• ℎ# 𝑥 = 𝑔(𝜃$ + 𝜃% 𝑥% + 𝜃& 𝑥&

• ℎ# 𝑥 = 𝑔(𝜃$ + 𝜃% 𝑥% + 𝜃& 𝑥& + 𝜃' 𝑥%& +

BITS Pilani, Pilani Campus

• Email foldering/tagging: Work, Friends, Family, Hobby

• Medical diagrams: Not ill, Cold, Flu

• Weather: Sunny, Cloudy, Rain, Snow

Slide credit: Andrew

BITS Pilani, Pilani Campus

Binary classification Multiclass classification

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

• Given a new input 𝑥, pick the class 𝑖 that

Slide credit: Andrew

BITS Pilani, Pilani Campus

• Credit Card Fraud : Predicting if a given credit

Classification Error/ Misclassification rate: 1 – accuracy, or

BITS Pilani, Deemed to be University under Section

Actual Class\Predicted class cancer = yes cancer = Total Recognition(%)

BITS Pilani, Deemed to be University under Section

BITS Pilani, Deemed to be University under Section

Recall/Sensitivity: completeness – what % of positive tuples did the classifier label as

F measure (F1 or F-score): harmonic mean of precision and recall,

Fß: weighted measure of precision and recall

Actual Class\Predicted class cancer = yes cancer = Total Recognition(%)

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus