AML AfterMid Merged

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 389

Applied Machine Learning

SE/SS ZG568

Swarna Chaudhary
Assistant Professor – BITS CSIS
BITS Pilani swarna.chaudhary@pilani.bits-pilani.ac.in
Pilani Campus
BITS Pilani
Pilani Campus

Logistic Regression Classifiers


These slides are prepared by the instructor, with grateful
acknowledgement of Tom Mitchell, and many others who made
their course materials freely available online.
BITS Pilani, Pilani Campus
Session Content
• Discriminant Functions
• Probabilistic Discriminative Classifiers
• Logistic regression
• Evaluation Measures

3
BITS Pilani, Pilani Campus
Logistic Regression

4
BITS Pilani, Pilani Campus
Linear Regression versus logistic
regression

• Linear Regression could help us predict the


student’s test score on a scale of 0 - 100. Linear
regression predictions are continuous (numbers
in a range).
• Logistic Regression could help use predict
whether the student passed or failed. Logistic
regression predictions are discrete (only specific
values or categories are allowed). We can also
view probability scores underlying the model’s
classifications.
IS ZC464, Machine Learning 5
BITS Pilani, Pilani Campus
Linear Regression versus Logistic Regression

Classification requires discrete values:


y = 0 or 1

For linear Regression output values:


can be much > 1 or much < 0

Logistic Regression:

BITS Pilani, Pilani Campus


Sigmoid/Logistic Function

• Sigmoid/logistic function takes a real value as


input and outputs another value between 0
and 1
• That framework is called logistic regression
– Logistic: A special mathematical sigmoid function
it uses
– Regression: Combines a weight vector with
observations to create an answer

7
BITS Pilani, Pilani Campus
Sigmoid Function

• Any function that takes


the shape of ‘S’ is called a
sigmoid function
• Maps input horizon to a
bounded, symmetrical
output range
• Examples:
– Logistic Function
– Hyperbolic Tangent

BITS Pilani, Pilani Campus


Sigmoid Function
• Known as ‘squashing function’
• Logistic function (a variant of sigmoid) bounds the
output between 0 and 1
• Conditional probability of a random variable can
be expressed as a sigmoid (logistic) function

BITS Pilani, Pilani Campus


Hypothesis representation

• Want 0 ≤ ℎ! 𝑥 ≤ 1 1
ℎ! 𝑥 = "! !#
" 1+ 𝑒
• ℎ! 𝑥 = 𝑔 𝜃 𝑥 ,

#
where 𝑔 𝑧 = #$% !" 𝑔(𝑧)

• Sigmoid function
𝑧
• Logistic function Slide credit: Andrew
Ng

BITS Pilani, Pilani Campus


Logistic Regression
1

z
Suppose predict “ “ if

predict “ “ if

BITS Pilani, Pilani Campus


Logistic regression

ℎ! 𝑥 = 𝑔 𝜃 " 𝑥
1 𝑔(𝑧)
𝑔 𝑧 =
1 + 𝑒 &'

Suppose predict “y = 1” if ℎ! 𝑥 ≥ 0.5 𝑧 = 𝜃 %𝑥

𝑧 = 𝜃 "𝑥 ≥ 0
predict “y = 0” if ℎ! 𝑥 < 0.5
𝑧 = 𝜃 "𝑥 < 0
Slide credit: Andrew
Ng

BITS Pilani, Pilani Campus


Learning model parameters

Training set:

m examples

How to choose parameters


(feature weights) ?

BITS Pilani, Pilani Campus


Cost function for Linear Regression

) )
1 & & + 1
𝐽 𝜃 = , ℎ* 𝑥 −𝑦 = , Cost(ℎ* (𝑥 & ), 𝑦))
2𝑚 𝑚
&'( &'(

1 #
Cost(ℎ" 𝑥 , 𝑦) = ℎ" 𝑥 − 𝑦
2

Slide credit: Andrew


Ng

BITS Pilani, Pilani Campus


MSE Cost Function

Linear regression:

“non-convex” “convex”

BITS Pilani, Pilani Campus


Error (Cost) Function

• Our prediction function is non-linear (due to


sigmoid transform)
• Squaring this prediction as we do in MSE results
in a non-convex function with many local minima.
• If our cost function has many local minimums,
gradient descent may not find the optimal global
minimum.
• So instead of Mean Squared Error, we use a error/
cost function called Cross-Entropy, also known as
Log Loss. 16
BITS Pilani, Pilani Campus
Cross Entropy
• Cross entropy can be used to define a loss function
• The true probability y is the true label, and the given distribution hꝊ(x) is
the predicted value of the current model.
• Cross-entropy loss, or log loss, measures the performance of a
classification model whose output is a probability value between 0 and 1.
• Cross-entropy loss increases as the predicted probability diverges from the
actual label.
• So predicting a probability of .012 when the actual observation label is 1
would be bad and result in a high loss value.
• A perfect model would have a log loss of 0.
• Cross-entropy loss can be divided into
two separate cost functions:
one for y=1 and
one for y=0.

17
BITS Pilani, Pilani Campus
Logistic regression cost function (cross entropy)

If y = 1

Cost

0 1

BITS Pilani, Pilani Campus


Logistic regression cost function

If y = 0

Cost=0; If y=0 and hꝊ(x)=0

Cost

0 1

BITS Pilani, Pilani Campus


Cost function for Logistic Regression

−log ℎ! 𝑥 if 𝑦 = 1
Cost(ℎ! 𝑥 , 𝑦) = ,
−log 1 − ℎ! 𝑥 if 𝑦 = 0

if 𝑦 = 1 if 𝑦 = 0

0 ℎ* 𝑥 1 0 ℎ* 𝑥 1
Slide credit: Andrew
Ng

BITS Pilani, Pilani Campus


Logistic regression cost function

−log ℎ! 𝑥 if 𝑦 = 1
• Cost(ℎ! 𝑥 , 𝑦) = 3
−log 1 − ℎ! 𝑥 if 𝑦 = 0

• Cost ℎ! 𝑥 , 𝑦 = −𝑦 log h! x − (1 − y) log 1 − ℎ! 𝑥

• If 𝑦 = 1: Cost ℎ! 𝑥 , 𝑦 = −log ℎ! 𝑥
• If 𝑦 = 0: Cost ℎ! 𝑥 , 𝑦 = −log 1 − ℎ! 𝑥
Slide credit: Andrew
Ng

BITS Pilani, Pilani Campus


Cost function

To fit parameters : Apply Gradient Descent Algorithm

To make a prediction given new :


Output

BITS Pilani, Pilani Campus


Linear Decision boundary

• ℎ! 𝑥 = 𝑔(𝜃? + 𝜃# 𝑥# + 𝜃@ 𝑥@ )

Age
E.g., 𝜃? = −3, 𝜃# = 1, 𝜃@ = 1
x2

x1 Tumor Size 3

• Predict “𝑦 = 1” if −3 + 𝑥# + 𝑥@ ≥ 0
• i.e 𝑥# + 𝑥@ ≥ 3

Slide credit: Andrew Ng

BITS Pilani, Pilani Campus


Nonlinear Decision boundary

• ℎ# 𝑥 = 𝑔(𝜃$ + 𝜃% 𝑥% + 𝜃& 𝑥&


+ 𝜃' 𝑥%& + 𝜃( 𝑥&& )
E.g., 𝜃$ = −1, 𝜃% = 0, 𝜃& = 0, 𝜃' = 1, 𝜃( = 1
• Predict “𝑦 = 1” if −1 + 𝑥%& + 𝑥&& ≥ 0

• ℎ# 𝑥 = 𝑔(𝜃$ + 𝜃% 𝑥% + 𝜃& 𝑥& + 𝜃' 𝑥%& +


𝜃( 𝑥%& 𝑥& + 𝜃) 𝑥%& 𝑥&& + 𝜃* 𝑥%' 𝑥& + ⋯ )
Slide credit: Andrew Ng

BITS Pilani, Pilani Campus


Multi-class classification

• Email foldering/tagging: Work, Friends, Family, Hobby

• Medical diagrams: Not ill, Cold, Flu

• Weather: Sunny, Cloudy, Rain, Snow

Slide credit: Andrew


Ng

BITS Pilani, Pilani Campus


Multi-class classification

Binary classification Multiclass classification

𝑥@ 𝑥@

𝑥# 𝑥#

BITS Pilani, Pilani Campus


One-vs-all (one-vs-rest)

𝑥+
"
ℎ! 𝑥
𝑥(
𝑥@
# 𝑥+
ℎ! 𝑥

𝑥# 𝑥(
Class 1:
Class 2:
Class 3:
$
ℎ! 𝑥 𝑥+
&
ℎ* 𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3) 𝑥(
Slide credit: Andrew
Ng

BITS Pilani, Pilani Campus


One-vs-all

3
• Train a logistic regression classifier
ℎ" 𝑥 for
each class 𝑖 to predict the probability that 𝑦 = 𝑖

• Given a new input 𝑥, pick the class 𝑖 that


maximizes
3
max ℎ" 𝑥
4

Slide credit: Andrew


Ng

BITS Pilani, Pilani Campus


Softmax Regression
• Logistic regression can be used for multi-class classification without having
to train and combine multiple binary classifiers using Softmax Regression,
aka Multinomial Logistic Regression
• For instance x, softmax first computes a score sk(x) for each class k, then estimates
the probability of each class using the softmax function to the scores.

where, each class has its own dedicated parameter vector θ(k)
• Compute the probability pk that the instance belongs to class k

where, K is the number of classes., s(x) is a vector containing the scores of each class for the instance x.
and σ(s(x))k is the estimated probability that the instance x belongs to class k given the scores of each
class for that instance.
• the Softmax Regression classifier predicts the class with the highest probability

29
BITS Pilani, Pilani Campus
Logistic regression (Classification)

• Model
"
ℎ! 𝑥 = 𝑃 𝑌 = 1 𝑋" , 𝑋# , ⋯ , 𝑋$ = #
"%& !" $
• Cost function
)
1 −log ℎ! 𝑥 if 𝑦 = 1
𝐽 𝜃 = . Cost(ℎ! (𝑥 ' ), 𝑦 (') )) Cost(ℎ! 𝑥 , 𝑦) = 6
𝑚 −log 1 − ℎ! 𝑥 if 𝑦 = 0
'("

• Learning
" $
Gradient descent: Repeat {𝜃! ≔ 𝜃! − 𝛼 # ∑#
$%" ℎ& 𝑥
$ −𝑦 $ 𝑥! }

• Inference
1
𝑌= = ℎ! 𝑥 ,-., = # 0 %&'%
1 + 𝑒 /!

30
BITS Pilani, Pilani Campus
Logistic Regression Applications

• Credit Card Fraud : Predicting if a given credit


card transaction is fraud or not
• Health : Predicting if a given mass of tissue is
benign or malignant
• Marketing : Predicting if a given user will buy
an insurance product or not
• Banking : Predicting if a customer will default
on a loan.
IS ZC464, Machine Learning 31
BITS Pilani, Pilani Campus
Evaluation Measures

32
BITS Pilani, Pilani Campus
Classifier Evaluation Metrics: Confusion
Matrix
Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in class i
that were labeled by the classifier as class j
May have extra rows/columns to provide totals
• True Positive (TP): It refers to the number of
predictions where the classifier correctly predicts
the positive class as positive. Predicted C1 ¬ C1
• True Negative (TN): It refers to the number of class ->
predictions where the classifier correctly predicts Actual
the negative class as negative. classß
• False Positive (FP): It refers to the number of
C1 True False
predictions where the classifier incorrectly Positives Negatives
predicts the negative class as positive. (TP) (FN)
• False Negative (FN): It refers to the number of
¬ C1 False True
predictions where the classifier incorrectly predicts
Positives Negatives
the positive class as negative. (FP) (TN)

33

33
BITS Pilani, Pilani Campus
Classifier Evaluation Metrics: Accuracy, Error Rate,
Classifier Accuracy, or recognition rate: percentage of test set tuples that are
correctly classified
Accuracy = (TP + TN)/All
most effective when the class distribution is relatively balanced

Classification Error/ Misclassification rate: 1 – accuracy, or


= (FP + FN)/All

A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All

34

34
BITS Pilani, Pilani Campus
Example
Given below is a confusion matrix for medical data where the class values are yes and
no for a class label attribute, cancer. Calculate the accuracy of the classifier.

35

BITS Pilani, Deemed to be University under Section


BITS Pilani,
3 of UGC
PilaniAct,
Campus
1956
Example

Actual Class\Predicted class cancer = yes cancer = Total Recognition(%)


no
cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56
(specificity)
Total 230 9770 10000 96.40 (accuracy)

36

36
BITS Pilani, Pilani Campus
Class Imbalance Problem
• The main class of interest is rare.
• the data set distribution reflects a significant majority of the negative class and a
minority positive class.
• For example,
– fraud detection applications, the class of interest (or positive class) is “fraud,”
– medical tests, there may be a rare class, such as “cancer”
• Accuracy might not be a good option for measuring performance in case of class
imbalance problem

37

BITS Pilani, Deemed to be University under Section


BITS Pilani,
3 of UGC
PilaniAct,
Campus
1956
Approaches to solve Class imbalance problem
• Generate Synthetic Samples
– new samples based on the distances between the point and its nearest
neighbors
• Change the performance metric
– Use Recall, Precision or ROC curves instead of accuracy
• Try different algorithms
– Some algorithms as Support Vector Machines and Tree-Based
algorithms may work better with imbalanced classes.

38
BITS Pilani, Pilani Campus
Model Evaluation Measures
• True positive rate (TPR) or sensitivity is defined as the fraction of positive examples
predicted correctly by the model

• True negative rate (TNR) or specificity is defined as the fraction of negative examples
predicted correctly by the model

• False positive rate (FPR) is the fraction of negative examples predicted as a positive class

• False negative rate (F N R) is the fraction of positive examples predicted as a negative class

39

BITS Pilani, Deemed to be University under Section


BITS Pilani,
3 of UGC
PilaniAct,
Campus
1956
Model Evaluation Measures
Precision: exactness – what % of tuples that the classifier labeled as positive are
actually positive

Recall/Sensitivity: completeness – what % of positive tuples did the classifier label as


positive?
Perfect score is 1.0
Inverse relationship between precision & recall

F measure (F1 or F-score): harmonic mean of precision and recall,

High value of F1 measure ensures that both precision and recall are high

Fß: weighted measure of precision and recall


– assigns ß times as much weight to recall as to precision

40

40
BITS Pilani, Pilani Campus
Example
Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

Actual Class\Predicted class cancer = yes cancer = Total Recognition(%)


no
cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56
(specificity)
Total 230 9770 10000 96.40 (accuracy)

41

41
BITS Pilani, Pilani Campus
Example: Contingency for Multi-Class
Classifier

42

BITS Pilani, Pilani Campus


Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
Holdout method
– Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies obtained
Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
– At i-th iteration, use Di as test set and others as training set
– *Stratified cross-validation*: folds are stratified so that class dist. in
each fold is approx. the same as that in the initial data
– The Accuracy of the model is the average of the accuracy of each fold.

43

43
BITS Pilani, Pilani Campus
Cross Validation

44

BITS Pilani, Pilani Campus


Stratified Cross Validation

45

BITS Pilani, Pilani Campus


Receiver Operating Characteristic (ROC)
Curve
• ROC curve plots TPR and FPR, to graphically represent their trade-off
• AUC represents degree or measure of separability. It tells how much model is
capable of distinguishing between classes.
– Higher the AUC, better the model is at predicting
• Area Under Curve (AUC) of ROC – evaluates model performance on average
– AUC of ROC = 1, for a perfect model
– AUC of ROC = 0.5, if the model is random
• For model comparison, AUC of ROC should be larger for the model to be superior
or better performing

46

BITS Pilani, Deemed to be University under Section


BITS Pilani,
3 of UGC
PilaniAct,
Campus
1956
Example
The table below shows the probability value (column 3) returned by a probabilistic
classifier for each of the 10 tuples in a test set, sorted by decreasing probability
order. The corresponding ROC is given on right hand side.

47

BITS Pilani, Deemed to be University under Section


BITS Pilani,
3 of UGC
PilaniAct,
Campus
1956
AUC-ROC
• Threshold selection
• Performance assessment
• Classifier comparison

48

BITS Pilani, Deemed to be University under Section


BITS Pilani,
3 of UGC
PilaniAct,
Campus
1956
References

• Tom M. Mitchell
Generative and discriminative classifiers: Naïve Bayes
and Logistic Regression
http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.
pdf

• Andrew Ng, Michael Jordan


On discriminative vs. generative classifiers: A
comparison of logistic regression and naive bayes
http://papers.nips.cc/paper/2020-on-discriminative-vs-
generative-classifiers-a-comparison-of-logistic-
regression-and-naive-bayes.pdf

BITS Pilani, Pilani Campus


References

• http://www.cs.cmu.edu/~tom/NewChapters.html
• http://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf
• https://medium.com/@sangha_deb/naive-bayes-vs-logistic-regression-
a319b07a5d4c
• https://www.youtube.com/watch?v=-la3q9d7AKQ
• http://www.datasciencesmachinelearning.com/2018/11/handling-outliers-in-
python.html
• http://www.cs.cmu.edu/~tom/10601_fall2012/slides/GenDiscr_LR_9-20-2012.pdf
• https://www.statlect.com/fundamentals-of-statistics/logistic-model-maximum-
likelihood
• http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf
• https://towardsdatascience.com/logistic-regression-explained-9ee73cede081

Interpretability
• https://christophm.github.io/interpretable-ml-book/logistic.html

50
BITS Pilani, Pilani Campus
Naïve Bayesian Classifier

Swarna Chaudhary
BITS Pilani Assistant Professor
Pilani|Dubai|Goa|Hyderabad
swarna.chaudhary@pilani.bits-pilani.ac.in

1
• The slides presented here are obtained from the authors of the books and from
various other contributors. I hereby acknowledge all the contributors for their
material and inputs.
• I have added and modified a few slides to suit the requirements of the course.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


In this segment
• Probability Basics
• Bayes Theorem
• Naïve Bayesian Classifier
• An illustrative Example
• Naïve Bayes Classifier is a generative model
• Advantages of Naïve Bayes Classifier and when to use Naïve Bayes Classifier?
• Interpretability of Naïve Bayes Classifier
• Python Implementation

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Probability

Question: You roll a fair die. What is the probability of E={1,5}?

BITS Pilani
Random Variables
• A random variable, is a variable whose possible values are numerical outcomes of
a random experiment.
• Random variable X is a function from the sample space to the real numbers.
• Denoted by Capital letter

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


5
Random Variables
• 1. Discrete Random Variable is one which may take on only a countable number
of distinct values such as 0,1,2,3,4
• X = sum of values on the roll of two dice,
• X has to be either 2, 3, 4, …, or 12.
• X be the number of heads that result from the toss of 2 coins.
• Here X can take values 0,1, or 2. X is a discrete random variable.

2. Continuous Random Variable Continuous Data can take any value within a
range.
• Example, List all the real numbers between [0,1]

5/15/22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Random Variables

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Probability Distribution of Discrete Random Variable
• Discrete Probability Distribution: The mathematical definition of a discrete
probability function, p(x), is a function that satisfies the following properties. This is
referred as Probability Mass Function.

• Probability mass function (pmf): P(X=xi)


• Simple facts about pmf

5/15/22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Probability Distribution - Example

• Sum of 2 dice

10

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Probability of Continuous Random Variable
• Continuous Probability Distribution: The mathematical definition of a continuous
probability function, f(x), is a function that satisfies the following properties. This is
referred as Probability Density Function.
• Simple facts about pdf

11

5/15/22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Probability Density Function

BITS Pilani
Probability Density Function

BITS Pilani
Gaussian Distribution
• The normal curve is bell-shaped and has a single peak at the
• exact center of the distribution.
• The arithmetic mean, median, and mode of the distribution are equal and
located at the peak.
• Half the area under the curve is above the peak, and the other half is
below it.
• The normal distribution is symmetrical about its mean.
• The normal distribution is asymptotic - the curve gets closer and closer to
the x-axis but never actually touches it.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Gaussian Distribution

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


MATHEMATICAL FUNCTION (pdf)

16

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Gaussian Distribution- Parameter
§ The normal distribution can be completely specified by two
parameters:

1. Mean
2. Standard deviation

17

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Properties of normal curve
No matter what the value of µ and σ are, area under normal curve
remain in certain fixed proportions within a specified number of
standard deviation on either side of µ. For example the interval
§ µ ± σ will always contain 68.26%
§ µ ± 2σ will always contain 95.44%
§ µ ± 3σ will always contain 99.73%

18

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Gaussian Distribution

19

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Joint Probability
• Joint probability is the probability of two events happening together. The two
events are usually designated event A and event B. In probability terminology, it can
be written as:
• P(X and Y) or P(A n B) or P(X,Y)
• Example: The probability that a card is a five and black, p(five and black) = 2/52
= 1/26, (There are two black fives in a deck of 52 cards, the five of spades and the
five of clubs)

20

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Conditional Probability
§ Conditional Probability is a measure of the probability of an event given that (by assumption,
presumption, assertion or evidence) another event has already occurred. If the event of interest
is A and the event B is known or assumed to have occurred.
§ This probability is written P(A|B), notation for the probability of A given B.

§ Example: Let a pair of dice is thrown. If the sum of the numbers on the dice is 7, Find the
probability that at least one of the dice shows 2.

21

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Independence
• Two events A and B are independent if and only if P(A∩B)=P(A)P(B).
• if two events A and B are independent and P(B)≠0, then P(A|B)=P(A).

• In general, for n events A1,A2,⋯,An to be independent we must have

• Example: If a dice is thrown twice, What is the probability that the first throw results in a
number greater than 4 and the second throw results in a number less than 3.

22

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Law of Total Probability

23

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Law of Total Probability

24

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Bayes Rule
From the definition of conditional distribution

25

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example of Bayes Theorem
Suppose the fraction of undergraduate students who play sport is 15% and the
fraction of graduate students who play sports is 23%. If one-fifth of the college
students are graduate students and the rest are undergraduates, what is the
probability that a student who plays a sport is a graduate student?

26

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example of Bayes Theorem

Classify the record:


X = (Home Owner = No, Marital Status = “Married”, Annual Income =
$120k)
27

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Bayesian Classifiers
Let the attributes X1, X2,…,Xn and class labels Y1, Y2,..Ym be random variables

Given a record with attributes (X1, X2,…,Xn)


– Goal is to predict class Y=Yk
– We want to find the value of Y that maximizes P(Y| X1, X2,..,Xn)

Can we estimate P(Y| X1, X2,..,Xn) directly from data?

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Bayesian Classifiers
– Compute the posterior probability P(Y| X1, X2,.., Xn) for all values of Y using
the Bayes theorem

P(X1, X2,..,Xn | Y)P(Y)


P(Y| X1, X2,.., Xn) =
P(X1, X2,..,Xn)

– Choose value of Class that maximizes P(Y| X1, X2,.., Xn)

– Equivalent to choosing value of Y that maximizes P(X1, X2,.., Xn | Y)P(Y)

How to estimate P(X1, X2,.., Xn | Y)?

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Conditional independence

30

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Applying conditional
independence
Naïve Bayes assumes 𝑋! are conditionally independent given 𝑌
e.g., 𝑃 𝑋" 𝑋#, 𝑌 = 𝑃(𝑋"|𝑌)

𝑃 𝑋", 𝑋# 𝑌 = 𝑃 𝑋" 𝑋#, 𝑌 𝑃 𝑋# 𝑌 = 𝑃 𝑋" 𝑌 𝑃(𝑋#|𝑌)

General form: 𝑃 𝑋", ⋯ , 𝑋$ 𝑌 = ∏$%&" 𝑃(𝑋% |𝑌)


How many parameters to describe 𝑃 𝑋", ⋯ , 𝑋$ 𝑌 ?

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Naïve Bayes Independence
assumption
Assumption:
"

𝑃 𝑋! , ⋯ , 𝑋" 𝑌 = ' 𝑃(𝑋# |𝑌)


#$!

i.e., 𝑋% and 𝑋# are conditionally independent


given 𝑌 for 𝑖 ≠ 𝑗

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Naïve Bayes Classifier
Assume independence among attributes Xi when class is given:
– P(X1, X2, …, Xd |Yj) = P(X1| Yj) P(X2| Yj)… P(Xd| Yj)

– Now we can estimate P(Xi| Yj) for all Xi and Yj combinations from the training
data

– New point is classified to Yj if P(Yj) P P(Xi| Yj) is maximal.

Slide adopted from “Introduction to Data mining” Vipin Kumar

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Class Conditional Probabilities

34

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


How to Estimate Probabilities from Data?

• Class: P(C) = Nc/N


Tid Refund Marital Taxable Evade • e.g., P(No) = 7/10,
Status Income P(Yes) = 3/10
1 Yes Single 125 No
2 No Married 100 No • For discrete attributes:
3 No Single 70 No P(Ai | Ck) = |Aik|/ Nck
4 Yes Married 120 No
5 No Divorced 95 Yes
• where |Aik| is number of
instances having attribute Ai
6 No Married 60 No
and belongs to class Ck
7 Yes Divorced 220 No • Examples:
8 No Single 85 Yes
P(Status=Married|No) = 4/7
9 No Married 75 No P(Refund=Yes|Yes)=0
10 No Single 90 Yes

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


How to Estimate Probabilities from Data?
• For continuous attributes:
• Discretize the range into bins
• one ordinal attribute per bin
• Too many bins – training records are too few to provide reliable probability for
each interval
• Too few bins – some intervals may aggregate from different classes & we may miss
the correct decision boundary
• Probability density estimation:
• Assume attribute follows a normal distribution
• Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
• Once probability distribution is known, can use it to estimate the conditional
probability P(Ai|Ck)

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


How to Estimate Probabilities from Data?

Tid Refund Marital


Status
Taxable
Income
Evade • Normal distribution:
( Ai - µ ij ) 2
-
1 Yes Single 125K No 1 2s ij2
P( Ai | c j ) = e
2 No Married 100K No 2ps ij2
3 No Single 70K No
4 Yes Married 120K No • One for each (Ai,ci) pair
5 No Divorced 95K Yes
6 No Married 60K No • For (Income, Class=No):
7 Yes Divorced 220K No
• If Class=No
8 No Single 85K Yes
• sample mean = 110
9 No Married 75K No
• sample variance = 2975
10 No Single 90K Yes

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example of Naïve Bayes Classifier
Given a Test Record:
X = (Refund = No, Married, Income = 120K)

Tid Refund Marital Taxable Evade ! P(X|Class=No) = P(Refund=No|Class=No)


Status Income ´ P(Married| Class=No)
1 Yes Single 125K No ´ P(Income=120K| Class=No)
2 No Married 100K No = 4/7 ´ 4/7 ´ 0.0069 = 0.0023
3 No Single 70K No
! P(X|Class=Yes) = P(Refund=No| Class=Yes)
4 Yes Married 120K No
´ P(Married| Class=Yes)
5 No Divorced 95K Yes ´ P(Income=120K| Class=Yes)
6 No Married 60K No = 1 ´ 0 ´ 1.2 ´ 10-9 = 0
7 Yes Divorced 220K No
8 No Single 85K Yes Since P(X|No)P(No) > P(X|Yes)P(Yes)
9 No Married 75K No Therefore P(No|X) > P(Yes|X)
10 No Single 90K Yes => Class = No

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Naive Bayes Classifier
• If one of the conditional probability is zero, then the entire expression
becomes zero
• Probability estimation is done with Laplacian correction:

N ic
Original : P( Ai | C ) =
Nc
C: No. of distinct
N +1 values of an
Laplace : P( Ai | C ) = ic
Nc + c attribute

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example of Naïve Bayes Classifier
Tid Refund Marital Taxabl Evade
Status e
Income
1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes

Data Mining
40
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Naïve Bayes for Text Classification

Data Mining
41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
A Simple Example
Which tag does the sentence A very close game
Text Tag
belong to? i.e. P(sports| A very close game)
“A great game” Sports
Feature Engineering: Bag of words i.e use word
“The election was over” Not sports frequencies without considering order
Using Bayes Theorem:
“Very clean match” Sports
P(sports| A very close game)
“A clean but forgettable
Sports = P(A very close game| sports) P(sports)
game”
----------------------------------------------
“It was a close election” Not sports P(A very close game)

We assume that every word in a sentence is independent of the other ones

“close” doesn’t appear in sentences of sports tag, So P(close | sports) = 0, which makes product
0
Data Mining
42
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Laplace smoothing

• Laplace smoothing: we add 1 or in general constant k to


every count so it’s never zero.
• To balance this, we add the number of possible words to the
divisor, so the division will never be greater than 1
• In our case, the possible words are ['a', 'great', 'very', 'over',
'it', 'but', 'game', 'election', 'clean', 'close', 'the', 'was',
'forgettable', 'match'].

Data Mining

43
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Apply Laplace Smoothing
Word P(word | Sports) P(word | Not Sports)

a 2+1 / 11+14 1+1 / 9+14

very 1+1 / 11+14 0+1 / 9+14

close 0+1 / 11+14 1+1 / 9+14

game 2+1 / 11+14 0+1 / 9+14

Data Mining

44
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Learning to Classify Text

• Why?
• Learn which news articles are of interest
• Learn to classify web pages by topic

• Naive Bayes is among most effective algorithms


• What attributes shall we use to represent text
documents??

Data Mining

45
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Naïve Bayes Generative Model
• Naïve Bayes classifier uses likelihood and prior probability to calculate
conditional probability of the class
• Likelihood is based on joint probability, which is the core principle of probabilistic
generative model
• Naïve Bayes simplifies the calculation of likelihood by the assumption of
conditional independence among input parameters
• Each parameter’s likelihood is determined using joint probability of the input
parameter and the output label

BITS Pilani
Naïve Bayes – When to use
• When the training data is small
• When the features are conditionally independent (mostly)
• When we have a fewer missing data
• When we have large number of features with minimal data set
• Ex: Text classification

BITS Pilani
Gaussian Naïve Bayes Algorithm
• A Gaussian Naive Bayes algorithm is a special type of NB
algorithm.

• It’s specifically used when the features have continuous values.

• It’s also assumed that all the features are following a Gaussian
distribution i.e, normal distribution.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Naïve Bayesian in Sklearn
• GaussianNB: implements the Gaussian Naive Bayes algorithm for classification. The likelihood
of the features is assumed to be Gaussian:

The parameters σy and μy are estimated using maximum likelihood


• MultinomialNB: implements the naive Bayes algorithm for multinomially distributed data, and
is one of the two classic naive Bayes variants used in text classification (where the data are
typically represented as word vector counts, although tf-idf vectors are also known to work well
in practice).

• Bernoulli NB: implements the naive Bayes training and classification algorithms for data that is
distributed according to multivariate Bernoulli distributions; i.e., there may be multiple features
but each one is assumed to be a binary-valued (Bernoulli, boolean) variable. Therefore, this class
requires samples to be represented as binary-valued feature vectors;

• Categorical NB: implements the categorical naive Bayes algorithm for categorically distributed
data. It assumes that each feature, which is described by the index i, has its own categorical
distribution.

Reference: https://scikit-learn.org/stable/modules/naive_bayes.html
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Extra Reading
• Text Classification in NLP using Naïve Bayes
• https://medium.com/@theflyingmantis/text-classification-in-nlp-naive-bayes-a606bf419f8c

• More on Probability
• https://www.probabilitycourse.com/chapter1/1_0_0_introduction.php

50

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


References
• https://towardsdatascience.com/probability-concepts-explained-bayesian-inference-for-
parameter-estimation-90e8930e5348
• https://towardsdatascience.com/probability-concepts-explained-introduction-a7c0316de465
• https://medium.com/@theflyingmantis/text-classification-in-nlp-naive-bayes-a606bf419f8c
• https://www.youtube.com/watch?v=5Pck0Cqw-zc
• https://towardsdatascience.com/basic-probability-theory-and-statistics-3105ab637213
• https://nlp.stanford.edu/IR-book/html/htmledition/properties-of-naive-bayes-1.html

51

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Thank You

52

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Classification: Decision Tree

Swarna Chaudhary
BITS Pilani swarna.chaudhary@pilani.bits-pilani.ac.in
Pilani Campus
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• I have added and modified a few slides to suit the requirements
of the course.
2

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Topics to be covered
Module 5 : Decision Tree
1. Decision Tree Representation
2. Entropy and Information Gain for an attribute
3. ID3 Algorithm for decision tree learning
4. Alternative measures for selecting attributes
5. Prefer short hypothesis to longer ones, Occam’s razor
6. Overfitting in Decision Tree
7. Reduced Error Pruning and Rule post pruning
8. Interpretability of Decision Trees

BITS Pilani, Pilani Campus


BITS Pilani
Pilani Campus

Decision Tree
Decision trees
§ Decision Trees is one of the most widely used and practical methods of inductive inference
§ Features
§ Method for approximating discrete-valued functions
§ Learned functions are represented as decision trees (or if-then-else rules)
§ Interpretable Not black box
§ Humans can understand decisions
§ Decision tree good at handling noisy or missing system (low information gain)
§ Fast and Compact
§ Greedy (disadvantages)
§ Overfitting avoided by pruning

BITS Pilani, Pilani Campus


Decision Tree Induction
• Decision tree induction is the learning of decision trees from class-labeled training tuples.
• Decision tree is a flowchart-like tree structure,
• internal node (denoted by rectangles) denotes a test on an attribute,
• each branch represents an outcome of the test, and
• each leaf node (or terminal node, denoted by ovals ) holds a class label
• Used for classification.
• Easily converted to classification rules.
• Does not require any domain knowledge.
• Decision tree algorithms: ID3(Iterative Dichotomiser), C4.5 (successor of ID3),
CART(classification and regression trees).
• Adopt greedy approach
• Based on top- down recursive divide and conquer approach.

Data Mining

6 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Decision Tree Induction: An Example
q Training data set: Buys_computer age income student credit_rating buys_computer
Youth high no fair no

q Resulting tree:
Youth high no excellent no
Middle_aged
high no fair yes
Senior medium no fair yes
Senior low yes fair yes
Senior low yes excellent no
Middle_aged
low yes excellent yes
age? Youth medium
Youth low
no fair
yes fair
no
yes
Senior medium yes fair yes
Youth medium yes excellent yes
Middle_aged
medium no excellent yes
youth Midle_aged
overcast Senior Middle_aged
high yes fair yes
Senior medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes yesData Mining


no
7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, they are discretized in advance)
• It uses “Attribute_selection_method” to determine the splitting criteria.
• Attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)
• Conditions for stopping partitioning
• All samples for a given node belong to the same class
• There are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf
• There are no samples left

• Complexity is O(n X |D| X log|D|), where n is the number of attributes describing the
tuples in D, |D| is the number of training tuples.

Data Mining

8 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Splitting Criteria
• Splitting criterion tells us which attribute to test at node N by determining the
“best” way to separate or partition the tuples in D into individual classes
• Also, tells us which branches to grow from node N with respect to the outcomes
of the chosen test.
• Splitting criterion indicates the splitting attribute and may also indicate either a
split-point or a splitting subset
• Partition is “Pure” i.e. all the tuples in it belong to the same class.
• Splitting attribute A can be:
• A is discrete valued: outcome correspond to the known values of A.
• A is continuous-valued: the test at node N has two possible outcomes,
corresponding to the conditions A<=split_point and A > split_point,
respectively.
• A is dicrete-values and Binary tree: The test at node N is of the form “A
∈SA?,” where SA is the splitting subset for A,. If a given tuple has value aj of A
and if aj ∈SA, then the test at node N is satisfied.

Data Mining

9 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Splitting criteria…contd

a) Discrete-valued b) Continuous-valued c) discrete-valued and binary tree

Data Mining

10 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Decision-tree Algorithm

Data Mining

11 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Attribute Selection Measures
• Also, known as Splitting rules
• Measure is a heuristic for selecting the splitting criterion that “best” separates a
given data partition D, of class-labeled training tuples into individual classes.
• Partition should be pure (i.e., all the tuples that fall into a given partition would
belong to the same class).
• Provides ranking to each attribute of training tuple, and the attribute having “best”
score is chosen as the splitting attribute.

Data Mining

12 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Mining
13
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Brief Review of Entropy

m=2
Data Mining
14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attribute Selection Measure: Information
Gain (ID3/C4.5)
n Select the attribute with the highest information gain
n This attribute minimizes the expected number of tests needed to classify a given tuple.
n Let pi be the probability that a tuple in D belongs to class Ci, estimated by |Ci, D|/|D|, m is
the number of distinct classes, v is the number of distinct values in an attribute.
n Expected information (entropy) needed to classify a tuple in D:
m
Info ( D) = -å pi log 2 ( pi )
i =1
n Information needed (after using A to split D into v partitions) to classify D
v | Dj |
InfoA ( D) = åj =1 |D|
´ Info( D j )

n The smaller the expected information required, greater the purity of the partitions.

Data Mining

15 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Attribute Selection Measure: Information
Gain (ID3/C4.5)
n Information gained by branching on attribute A

Gain(A) = Info(D) - InfoA(D)


n Defined as the difference between the original information requirement (i.e. based on just
the proportion of the classes) and the new requirement (i.e. obtained after partitioning of
attribute A).
n Attribute with highest information gain Gain(A), is chosen as the splitting attribute at
node N.

Data Mining

16 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Attribute Selection: Information Gain
5 4
gClass P: buys_computer = “yes” Infoage ( D) = I ( 2,3) + I ( 4,0)
14 14
gClass N: buys_computer = “no”
5
9 9 5 5 + I (3,2) = 0.694
Info( D) = I (9,5) = - log 2 ( ) - log 2 ( ) =0.940 14
14 14 14 14
age pi ni I(pi, ni) 5
Youth 2 3 0.971 I ( 2,3) means “age <=30” has 5 out of 14
14
Middle_aged
4 0 0 samples, with 2 yes’es and 3 no’s.
Senior 3 2 0.971
Infoage ( D) = (5 / 14) X (-2 / 5 log(2 / 5) - 3 / 5 log(3 / 5)
age income student credit_rating buys_computer
Youth high no fair no + (4 / 4) X (-4 / 4) log(4 / 4)
Youth high no excellent no
Middle_aged
high no fair yes + 5 / 14((-3 / 5) X log(3 / 5) - (2 / 5) log(2 / 5)) = 0.694
Senior medium no fair yes
Senior low
Senior low
yes
yes
fair
excellent
yes
no Gain(age ) = Info ( D) - Infoage ( D) = 0.246
Middle_aged
low yes excellent yes Similarly,
Youth medium no fair no
Youth low yes fair yes
Senior medium yes fair yes
Youth medium yes excellent yes Gain(income) = 0.029
Middle_aged
medium no excellent yes
Middle_aged
high yes fair yes Gain( student ) = 0.151
Senior medium no excellent no
Data Mining
Gain(credit _ rating ) = 0.048
17 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attribute Selection: Information Gain

Data Mining

18 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Decision Tree

Decision tree for the concept “buys_computer”

Data Mining

19 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Splitting Based on Continuous Attributes
• Different ways of handling
• Discretization to form an ordinal categorical attribute

• Binary Decision: (A < v) or (A ³ v)


• consider all possible splits and finds the best cut
• can be more compute intensive

Data Mining
20
5/15/22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Splitting Based on Continuous Attributes

Data Mining
21
5/15/22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Computing Information-Gain for
Continuous-Valued Attributes
• Let attribute A be a continuous-valued attribute
• Must determine the best split point for A
• Sort the value A in increasing order
• Typically, the midpoint between each pair of adjacent values is considered as
a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
• The point with the minimum expected information requirement for A is
selected as the split-point for A
• Split:
• D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of
tuples in D satisfying A > split-point

Data Mining

22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Question?
• Compute the information gain for every possible split, for the given continuous
valued attribute?
• Sort the values and find the the midpoint between each A1 A2 Class
pair of adjacent values (split_point). T 1 +

T 6 +

Class
+ - + - + - - T 5 -

Sorted F 4 +
• Calculate1 Info(D)
A2
3 4 5 6 7 8
Split_Poi F 7 -
• Calculate
nt
2
the entropy 3.5each 4.5
for 5.5 for 6.5
split_point 7.5“>”
“<=” and
F 3 -
• Find Gain for each split_point
F 8 -

T 7 +

F 5 -

Data Mining

23 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Gain Ratio for Attribute Selection (C4.5)
• Information gain measure is biased towards attributes with a large number of distinct
values. Eg. Product_ID (unique for every tuple), resulting in large number of partitions
as Infoproduct_ID (D) = 0, Such partitioning is useless.
• C4.5 (a successor of ID3) uses gain ratio to overcome the problem.
• It applies normalization to information gain using a “split information”

v | Dj | | Dj |
SplitInfo A ( D) = -å ´ log 2 ( )
j =1 |D| |D|
• GainRatio(A) = Gain(A)/SplitInfoA (D)
• Ex. Gain Ratio of “income” on the given data set.

• gain_ratio(income) = 0.029/1.557 = 0.019


• The attribute with the maximum gain ratio is selected as the splitting attribute

Data Mining

24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Gini Index (CART, IBM IntelligentMiner)
• If a data set D contains examples from n classes, gini index, gini(D) is
defined as n 2
gini( D) = 1- å p j
j =1

where pj is the probability that a tuple in D belongs to class j is |Cj,D |/|D|


• It considers binary-split for each attribute
• If a data set D is split on A (such that A∈ SA?) into two subsets D1 and D2,
the gini index gini(D) is defined as
| D1| |D |
gini A ( D) = gini( D1) + 2 gini( D 2)
| D| | D|
• Reduction in Impurity:

Dgini( A) = gini(D) - giniA (D)


• The attribute provides the minimum ginisplit(D) (or the largest reduction in
impurity) is chosen to split the node (need to enumerate all the possible
splitting points for each attribute)
Data Mining
25
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Computation of Gini Index
• Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
æ 9 ö æ 5 ö
gini ( D ) = 1 - ç ÷ -ç ÷ = 0.459
è 14 ø è 14 ø
• Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in
D2
æ 10 ö æ4ö
giniincomeÎ{low,medium} ( D) = ç ÷Gini ( D1 ) + ç ÷Gini ( D2 )
è 14 ø è 14 ø

age income student credit_rating buys_computer


Youth high no fair no
Youth high no excellent no
Middle_aged
high no fair yes
Senior medium no fair yes
Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split Senior low yes fair yes
on the {low,medium} (and {high}) since it has the lowest Senior low
Middle_aged
low
yes excellent
yes excellent
no
yes
Gini index Youth medium no fair no
Youth low yes fair yes
• May need other tools, e.g., clustering, to get the possible Senior medium yes fair yes
Youth medium yes excellent yes
• split values Middle_aged
medium no excellent
Middle_aged
high yes fair
yes
yes
Senior medium no excellent no

Data Mining

26
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Comparing Attribute Selection Measures
• The three measures, in general, return good results but
• Information gain:
• biased towards multivalued attributes
• Gain ratio:
• tends to prefer unbalanced splits in which one partition is much smaller
than the others
• Gini index:
• biased to multivalued attributes
• has difficulty when # of classes is large

Data Mining

27 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Decision Tree Based Classification
• Advantages:
• Inexpensive to construct
• Extremely fast at classifying unknown records
• Easy to interpret for small-sized trees
• Accuracy is comparable to other classification techniques for many simple
data sets

Data Mining
28
5/15/22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Inductive bias in decision tree learning
§ Inductive bias is the assumptions made by the model to learn the target
function and to generalize beyond training data.
§ What is the inductive bias of DT learning?
1. Shorter trees are preferred over longer trees
This is the bias exhibited by a simple breadth first algorithm
generating all DT's e selecting the shorter one
2. Prefer trees that place high information gain attributes close to the root

BITS Pilani, Pilani Campus


Prefer shorter hypotheses: Occam's razor
§ Why prefer shorter hypotheses?
§ Arguments in favor:
§ There are fewer short hypotheses than long ones
§ If a short hypothesis fits data unlikely to be a coincidence
§ Elegance and aesthetics
§ Arguments against:
§ Not every short hypothesis is a reasonable one.
§ Occam's razor says that when presented with competing hypotheses that make the
same predictions, one should select the solution which is simple"

BITS Pilani, Pilani Campus


Issues in decision trees learning
§ Overfitting
• Building trees that “adapt too much” to the training examples may lead to
“overfitting”.
• May therefore fail to fit additional data or predict future observations
reliably

BITS Pilani, Pilani Campus


Overfitting in decision tree learning

BITS Pilani, Pilani Campus


How to Address Overfitting
• Pre-Pruning (Early Stopping Rule)
– Stop the algorithm before it becomes a fully-grown tree
– General stopping conditions for a node:
• Stop if all instances belong to the same class
• Stop if all the attribute values are the same
– More restrictive conditions (for pre-pruning) :
• Stop if number of instances is less than some user-specified threshold
• Stop if class distribution of instances are independent of the available features
(e.g., using c 2 test)
• Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).

5/15/22
33
BITS Pilani, Pilani Campus
How to Address Overfitting…
• Post-pruning
– Grow decision tree to its entirety
– Trim the nodes of the decision tree in a bottom-up fashion
– If generalization error(i.e. expected error of the model on previously unseen
records) improves after trimming, replace sub-tree by a leaf node.
– Class label of leaf node is determined from majority class of instances in
the sub-tree

5/15/22
34
BITS Pilani, Pilani Campus
Good References
Decision Tree
• https://www.youtube.com/watch?v=eKD5gxPPeY0&list=PLBv0
9BD7ez_4temBw7vLA19p3tdQH6FYO&index=1
Overfitting
• https://www.youtube.com/watch?time_continue=1&v=t56Nid85
Thg
• https://www.youtube.com/watch?v=y6SpA2Wuyt8
• Decision tree for regression
• https://www.saedsayad.com/decision_tree_reg.htm

BITS Pilani, Pilani Campus


Textbook/Reference Books
• Tom M. Mitchell: Machine Learning, The McGraw-Hill Companies, Inc..
• Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson Education
• Data Mining: Concepts and Techniques, Third Edition by Jiawei Han, Micheline Kamber and
Jian Pei Morgan Kaufmann Publishers

Data Mining
36
5/15/22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You

15 May 2022 BITS Pilani, Pilani Campus


Classification: Ensemble Methods

BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus

Text Book(s)
R1 An Introduction to Data Mining – Pang-Ning Tan, Michael Steinbach, Anuj
Karpatne, Vipin Kumar - 2005

These slides are prepared by the instructor, with grateful acknowledgement of and many others who made their
course materials freely available online.
Topics to be covered

• Ensemble Method
• Methods for constructing an Ensemble Classifier
• Bagging, Boosting
• Random Forest
• AdaBoost
• eXtreme Gradient Boosting (XGBoost)

BITS Pilani, Pilani Campus


Ensemble Methods
• Ensemble methods use multiple learning algorithms to
obtain better predictive performance than could be
obtained from any of the constituent learning algorithms
alone
• Construct a set of classifiers from the training data
• Predict class label of test records by combining the
predictions made by multiple classifiers
• Tend to reduce problems related to over-fitting of the
training data.
• By combining individual models, the ensemble model
tends to be more flexible (less bias) and less data-
sensitive(less variance).
4
BITS Pilani, Pilani Campus
Why Ensemble Methods work?
• 25 base classifiers
• Each classifier has
error rate, e = 0.35
• If base classifiers are identical, then the
ensemble will misclassify the same
examples predicted incorrectly by the
base classifiers depicted by dotted line
• Assume errors made
by classifiers are uncorrelated
• ensemble makes a wrong prediction
only if more than half of the base
classifiers predict incorrectly
• Probability that the ensemble classifier
makes a wrong prediction: 25
æ 25 ö i
P( X ³ 13) = å çç ÷÷e (1 - e ) 25-i = 0.06
i =13 è i ø

5
BITS Pilani, Pilani Campus
When does Ensemble work?

• Ensemble classifier performs better than the


base classifiers when each classifier error is
smaller than 0.5
• Necessary conditions for an ensemble classifier
to perform better than a single classifier:
– Base classifiers should be independent of
each other
– Base classifiers should do better than a
classifier that performs random guessing

BITS Pilani, Pilani Campus


General Approach
Original
D Training data

Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets

Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers

Step 3:
Combine C*
Classifiers

7
BITS Pilani, Pilani Campus
Methods for constructing Ensemble
Classifier
• Using different algorithms
• Using different parameters/hyperparameters
• Using different training sets
• By manipulating input features
• By manipulating the class labels

BITS Pilani, Pilani Campus


Types of Ensemble Methods
• Simple Ensemble methods: Max Voting, Averaging,
Weighted Averaging
• Advanced Ensemble Methods
• Bagging, often considers homogeneous weak learners, learns
them independently from each other in parallel and combines
them by some kind of averaging process
• Boosting, often considers homogeneous weak learners, learns
them sequentially in a very adaptative way (a base model
depends on the previous ones) and combines them by some
deterministic strategy
• stacking, often considers heterogeneous weak learners, learns
them in parallel and combines them by training a meta-model to
output a prediction based on the different weak models
predictions
9
BITS Pilani, Pilani Campus
Simple Ensemble Techniques

Max Voting
Ex: Movie rating
The result of max voting would be something like this:
– Rating by 5 friends: 5 4 5 4 4
Averaging- (5+4+5+4+4)/5 = 4.4 Final rating
Weighted Average
Weight-0.23 0.23 0.18 0.18 0.18
The result is calculated as [(5*0.23) + (4*0.23) +
(5*0.18) + (4*0.18) + (4*0.18)] = 4.41.

BITS Pilani, Pilani Campus


Bootstrap Sampling
Bootstrap
– Bootstrapping is a sampling technique in which we create subsets of
observations from the original dataset, with replacement.
– Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected again
and re-added to the training set
– When you sample with replacement, items are independent. One item does not
affect the outcome of the other. You have 1/7 chance of choosing the first item
and a 1/7 chance of choosing the second item.
– If the two items are dependent, or linked to each other. When you choose the
first item, you have a 1/7 probability of picking a item. Assuming you don’t
replace the item, you only have six items to pick from. That gives you a 1/6
chance of choosing a second item.

11
BITS Pilani, Pilani Campus
Bootstrap Sampling
• Several bootstrap methods, and a common one is 0.632 boostrap
– A data set with d tuples is sampled d times, with replacement, resulting in a training set
of d samples.
– About 63.2% of the original data end up in the bootstrap, and the remaining 36.8% form
the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
• Where does the figure, 63.2%, come from?
Each tuple has a probability of 1/d of being selected, so the probability of not being
chosen is (1-1/d). We have to select d times, so the probability that a tuple will not be chosen
during this whole time is since (1 – 1/d)d . If d is large, the probability approaches e-1 = 0.368.
Thus, 36.8% of tuples will not be selected for training

12
BITS Pilani, Pilani Campus
Bootstrap Sampling

BITS Pilani, Pilani Campus


Bagging (Bootstrap Aggregating)
• Bagging stands for Bootstrap Aggregation
• Draw N items from X (initial dataset) with replacement
• Desired learners with High Variance (given different data, they give different
models)
• Decision tree and ANNs
• Use bootstrapping to generate L training sets and train one base learner with each
• Use Voting

BITS Pilani, Pilani Campus


Bagging

• Multiple subsets are created from the original dataset, selecting


observations with replacement.
• A base model (weak model) is created on each of these subsets.
• The models run in parallel and are independent of each other.
• The final predictions are determined by combining the predictions from all
the models.
BITS Pilani, Pilani Campus
Bagging Example

• Consider 1-dimensional data set:


Original Data:
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1

• Classifier is a decision stump


• Decision tree with one internal node (the root) which is immediately connected to the
terminal nodes (its leaves). Decision stump makes a prediction based on the value of just a
single input feature. Sometimes they are also called 1-rules

– Decision rule: x£k


– Split point k is chosen based on entropy
x£k

True False

yleft yright
16
BITS Pilani, Pilani Campus
Bagging Example
Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35 è y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35 è y = -1

Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1 x <= 0.7 è y = 1
y 1 1 1 -1 -1 -1 1 1 1 1 x > 0.7 è y = 1

Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9 x <= 0.35 è y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.35 è y = -1

Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9 x <= 0.3 è y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.3 è y = -1

Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1 x <= 0.35 è y = 1
y 1 1 1 -1 -1 -1 -1 1 1 1 x > 0.35 è y = -1

17
BITS Pilani, Pilani Campus
Bagging Example
Bagging Round 6:
x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1 x <= 0.75 è y = -1
y 1 -1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 è y = 1

Bagging Round 7:
x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1 x <= 0.75 è y = -1
y 1 -1 -1 -1 -1 1 1 1 1 1 x > 0.75 è y = 1

Bagging Round 8:
x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1 x <= 0.75 è y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 è y = 1

Bagging Round 9:
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1 1 x <= 0.75 è y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 è y = 1

Bagging Round 10:


x 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9 x <= 0.05 è y = 1
y 1 1 1 1 1 1 1 1 1 1 x > 0.05 è y = 1

18
BITS Pilani, Pilani Campus
Bagging Example

• Assume test set is the same as the original data


• Use majority vote to determine class of ensemble classifier

Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 1 1 1 -1 -1 -1 -1 -1 -1 -1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
4 1 1 1 -1 -1 -1 -1 -1 -1 -1
5 1 1 1 -1 -1 -1 -1 -1 -1 -1
6 -1 -1 -1 -1 -1 -1 -1 1 1 1
7 -1 -1 -1 -1 -1 -1 -1 1 1 1
8 -1 -1 -1 -1 -1 -1 -1 1 1 1
9 -1 -1 -1 -1 -1 -1 -1 1 1 1
10 1 1 1 1 1 1 1 1 1 1
Sum 2 2 2 -6 -6 -6 -6 2 2 2
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class

19
BITS Pilani, Pilani Campus
Bagging Algorithm

20
BITS Pilani, Pilani Campus
Boosting
• What if a data point is incorrectly predicted by the first
model, and then the next (probably all models), will
combining the predictions provide better results? Such
situations are taken care of by boosting.
• Boosting is a sequential process, where each
subsequent model attempts to correct the errors of the
previous model.
• The succeeding models are dependent on the previous
model.

BITS Pilani, Pilani Campus


Boosting

• A subset is created from the original dataset.


• Initially, all data points are given equal weights.
• A base model is created on this subset.
• This model is used to make predictions on the whole dataset.
• Errors are calculated using the actual values and predicted values.
• The observations which are incorrectly predicted, are given higher
weights.
• Another model is created and predictions are made on the dataset.
(This model tries to correct the errors from the previous model)

BITS Pilani, Pilani Campus


Boosting
• Similarly, multiple models are created, each correcting the
errors of the previous model.
• The final model (strong learner) is the weighted mean of all
the models (weak learners).
• Individual models would not perform well on the entire
dataset, but they work well for some part of the dataset.
Thus, each model actually boosts the performance of the
ensemble.

BITS Pilani, Pilani Campus


Boosting

• Records that are wrongly classified will have their weights increased
• Records that are classified correctly will have their weights decreased

Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4

• Example 4 is hard to classify


• Its weight is increased, therefore it is more likely
to be chosen again in subsequent rounds

24
BITS Pilani, Pilani Campus
Bagging and Boosting Algorithms
Bagging algorithms:
Boosting algorithms:
– AdaBoost
– XGBoost

BITS Pilani, Pilani Campus


Random Forest

• Random Forest is ensemble machine learning


algorithm that follows the bagging technique.
• The base estimators in random forest are
decision trees.
• Random forest randomly selects a set of
features which are used to decide the best split
at each node of the decision tree.

BITS Pilani, Pilani Campus


Random Forest

• Random subsets are created from the original


dataset (bootstrapping).
• At each node in the decision tree, only a random
set of features are considered to decide the best
split.
• A decision tree model is fitted on each of the
subsets.
• The final prediction is calculated by averaging
the predictions from all decision trees.

BITS Pilani, Pilani Campus


Random Forest
• Combines the predictions
made by multiple decision
trees
• Each tree is generated
based on the values of an
independent set of random
vectors
• Randomness is injected into
the model-building process

28
BITS Pilani, Pilani Campus
Random Forest

29
BITS Pilani, Pilani Campus
Random Vector selection –
Forest RI (Random Input selection)
• Randomly select F input features to split at each node of the
decision tree and then fully grow the tree without pruning
• This helps reduce the bias present in the resulting tree
• The predictions are combined using a majority voting scheme
• To increase randomness, bagging can also be used to generate
bootstrap samples
• The strength and correlation of random forests may depend on
the size of F features
– If F is sufficiently small, then the trees tend to become less
correlated
• The strength of the tree classifier tends to improve with a larger
number of F
• Optimal number of F = log2 d + 1 (where d is number of input
features)
30
BITS Pilani, Pilani Campus
AdaBoost
• Adaptive boosting or AdaBoost is one of the
simplest boosting algorithms. Usually, decision
trees are used for modelling. Multiple sequential
models are created, each correcting the errors
from the last model.
• AdaBoost assigns weights to the observations
which are incorrectly predicted and the
subsequent model works to predict these values
correctly.

BITS Pilani, Pilani Campus


Adaboost Algorithm

• Initially, all observations (n) in the dataset are


given equal weights (1/n).
• A model is built on a subset of data.
• Using this model, predictions are made on the
whole dataset.
• Errors are calculated by comparing the
predictions and actual values.
• While creating the next model, higher weights
are given to the data points which were
predicted incorrectly.
BITS Pilani, Pilani Campus
Adaboost Algorithm

• Weights can be determined using the error


value. For instance, higher the error more
is the weight assigned to the observation.
• This process is repeated until the error
function does not change, or the maximum
limit of the number of estimators is
reached.

BITS Pilani, Pilani Campus


AdaBoost Algorithm

Alpha is weight for the classifier. It is how much influence this stump will have in the final classification.

Weight Update:

BITS Pilani, Pilani Campus


AdaBoost
• Reduce weight if correctly classified else increase
• If any intermediate rounds produce error rate higher than 50%, the weights
are reverted to 1/n and the resampling procedure is repeated

BITS Pilani, Pilani Campus


Classifier weight

ε = 0.3
α = 1/2 * ln(1- 0.3 / 0.3) = 0.42365
ε = 0.7
α = 1/2 * ln(1- 0.7 / 0.7) = -0.42365
ε = 0.5
α = 1/2 * ln(1- 0.5 / 0.5) = 0
Notice three interesting observations:
1) classifier with accuracy higher than 50% results in a positive
weight for the classifier
(in other words, α > 0 if ε <= 0.5),
2) classifier with exact 50% accuracy is 0, and thus, does not
contribute to the final prediction, and
3) errors 0.3 and 0.7 lead to classifier weights with inverse signs.
36
BITS Pilani, Pilani Campus
Training instance weight update

37
BITS Pilani, Pilani Campus
Adaboost Example
Initial weights for each data point Data points
for training

0.1 0.1 0.1


Original
Data +++ - - - - - ++

B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - a = 1.9459

38
BITS Pilani, Pilani Campus
Adaboost Example
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - a = 1.9459

B2
0.3037 0.0009 0.0422
Boosting
Round 2 - - - - - - - - ++ a = 2.9323

B3
0.0276 0.1819 0.0038
Boosting
Round 3 +++ ++ ++ + ++ a = 3.8744

Overall +++ - - - - - ++
39
BITS Pilani, Pilani Campus
Good References

Ensemble methods
https://www.slideshare.net/hustwj/an-introduction-to-ensemble-methodsboosting-
bagging-random-forests-and-more
Bagging and Boosting
• https://www.youtube.com/watch?time_continue=2&v=m-S9Hojj1as

XGBoost
• https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/
• https://xgboost.readthedocs.io/en/latest/tutorials/model.html
• https://www.youtube.com/watch?time_continue=71&v=Vly8xGnNiWs
• https://www.slideshare.net/ShangxuanZhang/xgboost-55872323
• https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/
• https://towardsdatascience.com/a-beginners-guide-to-xgboost-87f5d4c30ed7
• https://towardsdatascience.com/machine-learning-for-diabetes-562dd7df4d42

40
BITS Pilani, Pilani Campus
Thank You

41
15 May 2022 BITS Pilani, Pilani Campus
Support Vector Machines
BITS Pilani
Pilani Campus

Slide Credits: Dr. Chetana Gavankar


BITS Pilani
Pilani Campus

Text Book(s)
T1 Christopher Bishop: Pattern Recognition and Machine Learning, Springer
International Edition
T2 Tom M. Mitchell: Machine Learning, The McGraw-Hill Companies, Inc..

Theseslidesarepreparedbytheinstructor,withgratefulacknowledgementof Prof.TomMitchell,Prof. Burges, Prof.


AndrewMooreand manyotherswhomade theircoursematerialsfreelyavailableonline.
Topics to be covered
§ Linear Classifiers
§ Maximum Margin Classification
§ Linear SVM
§ SVM optimization problem
§ Soft Margin SVM

BITS Pilani, Pilani Campus


Linear Classifiers
w x + b>0 f(x,w,b) = sign(w x + b)

denotes +1
denotes -1 How would you
classify this data?

w x + b<0

BITS Pilani, Pilani Campus


Linear Classifiers
f(x,w,b) = sign(w x + b)

denotes +1
denotes -1 How would you
classify this data?

BITS Pilani, Pilani Campus


Linear Classifiers
f(x,w,b) = sign(w x + b)

denotes +1
denotes -1 How would you
classify this data?

BITS Pilani, Pilani Campus


Linear Classifiers
f(x,w,b) = sign(w x + b)

denotes +1
denotes -1 Any of these
would be fine..

..but which is
best?

BITS Pilani, Pilani Campus


Linear Classifiers
f(x,w,b) = sign(w x + b)

denotes +1
denotes -1
How would you
classify this data?

Misclassified
to +1 class

BITS Pilani, Pilani Campus


Linear Classifier

BITS Pilani, Pilani Campus


Linear Classifier

Decision Boundary

BITS Pilani, Pilani Campus


Large margin and support vectors

BITS Pilani, Pilani Campus


Support Vectors
• Geometric description of SVM is that the
max-margin hyperplane is completely
determined by those points that lie nearest to
it.
• Points that lie on this margin are the support
vectors.
• The points of our data set which if removed,
would alter the position of the dividing
hyperplane

BITS Pilani, Pilani Campus


Example

BITS Pilani, Pilani Campus


Support Vector Machines

BITS Pilani, Pilani Campus


Weight vector is perpendicular to
the hyperplane
Consider the points xa and xb,
which lie on the decision boundary.
This gives us two equations:
wT xa +b=0
wT xb+b=0
Subtracting these two equations gives us
wT.(xa − xb)=0
Note that the vector xa − xb lies on the decision boundary,
and it is directed from xb to xa.
Since the dot product wT.(xa − xb) is zero,
wT must be orthogonal to xa − xb and
in turn, to the decision boundary.
BITS Pilani, Pilani Campus
Linear SVM Mathematically

2 parallel hyperplanes
bi1 and bi2 can be
written as,
bi1: w.x + b = 1,
bi2: w.x + b = -1

Margin of the decision


boundary is given by
the distance between
the 2 hyperplanes:
w.(x1 – x2) = 2
||w||*d = 2
d = 2/||w||

BITS Pilani, Pilani Campus


Linear SVM Mathematically
x+ M=Margin Width

X-

Distance between lines given by solving linear equation:


What we know: 2
Maximize margin: M =
• w . x+ + b = +1 ||w ||
• w . x- + b = -1 1
Equivalent to minimize: ||w || 2
2

BITS Pilani, Pilani Campus


Solving the Optimization Problem

𝟏 2 T
L(w, b, αi)= 𝟐||w|| - Σ αi [yi (w xi + b) -1]

yi (wTxi + b) ≥1
Find w and b such that
𝟏 +1(wTxi + b) ≥ 1
Φ(w) = 𝟐||w||2 is minimized; -1(wTxi+ b) ≤ 1
and for all {(xi ,yi)}: yi (wTxi + b) ≥1 same as (wTxi + b) ≥ 1

BITS Pilani, Pilani Campus


Solving the Optimization Problem

Find w and b such that


𝟏
Φ(w) = 𝟐||w||2 is minimized; Type equation
ß Primal
here.
and for all {(x ,y )}: y (wTx + b) ≥1
i i i i

n Need to optimize a quadratic function subject to linear


inequality constraints.
n All constraints in SVM are linear
n Quadratic optimization problems are a well-known class of
mathematical programming problems, and many (rather
intricate) algorithms exist for solving them.
n The solution involves constructing a unconstrained problem
where a Lagrange multiplier αi is associated with every
constraint in the primary problem:

BITS Pilani, Pilani Campus


Optimization Problem

• Optimization problem is typically written:


Minimize f(x)
subject to
gi(x) = 0, i=1,…,p
hi(x) <= 0, i=1,…,m
• f(x) is called the objective function
• By changing x (the optimization variable) we wish to
find a value x∗ for which f(x) is at its minimum.
• p functions of gi define equality constraints and
• m functions hi define inequality constraints.
• The value we find MUST respect these constraints!

BITS Pilani, Pilani Campus


Solving the Optimization Problem

BITS Pilani, Pilani Campus


Solving the Optimization Problem

BITS Pilani, Pilani Campus


Dataset with noise
Hard Margin: So far we require
denotes +1 n
all data points be classified correctly
denotes -1 - No training error
n What if the training set is noisy?

BITS Pilani, Pilani Campus


Soft Margin Classification
Slack variables ξi can be added to allow
misclassification of difficult or noisy examples.

What should our quadratic optimization criterion be?


Minimize
R
1
w.w +C å ε k
2 k =1

Constraints are

BITS Pilani, Pilani Campus


Slack Variable
• Slack variable as giving the classifier some
leniency when it comes to moving around
points near the margin.
• When C is large, larger slacks penalize the
objective function of SVM’s more than when C
is small.

BITS Pilani, Pilani Campus


Soft margin example

26
BITS Pilani, Pilani Campus
Soft Margin

BITS Pilani, Pilani Campus


Hard Margin versus
Soft Margin
n Hard Margin:
Find w and b such that
𝟏
Φ(w) = 𝟐 wTw is minimized and for all {(x i ,yi)}
yi (wTxi + b) ≥ 1

n Soft Margin incorporating slack variables:


Find w and b such that
𝟏
Φ(w) = 𝟐 wTw + C Σ ξ i is minimized and for all {(x i ,yi)}
yi (wTxi + b) ≥1- ξi and ξi ≥ 0 for all i

n Parameter C can be viewed as a way to control


overfitting.

BITS Pilani, Pilani Campus


Value of C parameter

• C parameter tells the SVM optimization how


much you want to avoid misclassifying each
training example.
• For large values of C, the optimization will choose
a smaller-margin hyperplane if that hyperplane
does a better job of getting all the training points
classified correctly.
• Conversely, a very small value of C will cause the
optimizer to look for a larger-margin separating
hyperplane, even if that hyperplane misclassifies
more points.

BITS Pilani, Pilani Campus


Effect of Margin size v/s
misclassification cost

BITS Pilani, Pilani Campus


Effect of Margin size v/s
misclassification cost

BITS Pilani, Pilani Campus


Effect of Margin size v/s
misclassification cost

BITS Pilani, Pilani Campus


Linear SVMs: Overview
n The classifier is a separating hyperplane.
n Most “important” training points are support vectors; they
define the hyperplane.
n Quadratic optimization algorithms can identify which training
points xi are support vectors with non-zero Lagrangian
multipliers αi.

f(x) = ΣαiyixiTx + b

BITS Pilani, Pilani Campus


Good Web References for SVM
• Text categorization with Support Vector Machines:
learning with many relevant features - T. Joachims, ECML
• A Tutorial on Support Vector Machines for Pattern Recognition, Kluwer Academic
Publishers - Christopher J.C. Burges
• http://www.cs.utexas.edu/users/mooney/cs391L/
• https://www.coursera.org/learn/machine-learning/home/week/7
• https://towardsdatascience.com/support-vector-machine-introduction-to-machine-
learning-algorithms-934a444fca47
• https://data-flair.training/blogs/svm-kernel-functions/
• MIT 6.034 Artificial Intelligence, Fall 2010
• https://stats.stackexchange.com/questions/30042/neural-networks-vs-support-
vector-machines-are-the-second-definitely-superior
• https://www.sciencedirect.com/science/article/abs/pii/S0893608006002796
• https://medium.com/deep-math-machine-learning-ai/chapter-3-support-vector-
machine-with-math-47d6193c82be
• Radial basis kernel

BITS Pilani, Pilani Campus


Thank You

27 December 2020 BITS Pilani, Pilani Campus


Clustering: Expectation Maximization

Swarna Chaudhary
BITS Pilani Asst. Professor
Pilani Campus WILP Division, BITS-Pilani
•The slides presented here are obtained from the authors of the books and from various other contributors. I
hereby acknowledge all the contributors for their material and inputs.
•I have added and modified a few slides to suit the requirements of the course.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Unsupervised Learning
Learning from unlabelled data
Let X = {x(1),x(2),x(3),...,x(N)}
– The points does not carry labels
Supervised vs. Unsupervised Learning
Objective:
– Find patterns / sub-groups among the the data
points using data similarity

Unsupervised Learning - Find grouping based using


data similarity

BITS Pilani, Pilani Campus


Clustering

Clustering aims to find groupings in data


– Given a X, find K clusters using
data similarity

Unsupervised Learning - Find grouping based using


data similarity

BITS Pilani, Pilani Campus


Clustering
Inter-cluster
distances
Intra-cluster are
distances are
maximized
• Clustering aims to find groupings in minimized

data
• Given a X, find K clusters
using data similarity

Unsupervised Learning - Find grouping


based using data similarity

BITS Pilani, Pilani Campus


Quality: What Is Good Clustering?
• A good clustering method will produce high quality clusters with
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the similarity measure used by
the method and its implementation
• The quality of a clustering method is also measured by its ability to discover some
or all of the hidden patterns

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


What is not Cluster Analysis?
• Supervised classification
– Have class label information

• Simple segmentation
– Dividing students into different registration groups alphabetically, by last name

• Results of a query
– Groupings are a result of an external specification

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Applications: Customer Segmentation
Help marketers discover distinct groups in their customer bases, and then use this
knowledge to develop targeted marketing programs

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Document Clustering

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Recommendation System

https://ieeexplore.ieee.org/document/7019655 10

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Other Applications
• Clustering on Gene Expression data
• https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3491664/

• Identifying Fake News


• http://snap.stanford.edu/mis2/files/MIS2_paper_2.pdf
• https://ai.intelligentonlinetools.com/ml/k-means-clustering-example-word2vec/

• COVID-19 Cluster Analysis


• https://towardsdatascience.com/covid-19-cluster-analysis-405ebbd10049

• Case Study
• https://medium.com/@msuginoo/three-different-lessons-from-three-different-clustering-
analyses-data-science-capstone-5f2be29cb3b2

11

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters

12

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Types of Clustering
• A clustering is a set of clusters

• An important distinction among types of clustering : hierarchical and partitional sets of clusters

• Partitional Clustering
– A division data objects into non-overlapping subsets (clusters) such that each data object is in
exactly one subset

• Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree

• Density based
– identify distinctive groups/clusters in the data, based on the idea that a cluster in a data space
is a contiguous region of high point density, separated from other such clusters by contiguous
regions of low point density.

• Distribution Based
– Idea is data generated from the same distribution, belongs to the same cluster if there exists
several distributions in the dataset.

13

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Partitional Clustering

Original Points A Partitional Clustering

14

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Hierarchical Clustering

p1
p3 p4
p2

p1 p2 p3 p4
Hierarchical Clustering
Dendrogram

15

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Density-Based
Density-based
– A cluster is a dense region of points, which is separated by low-density regions,
from other regions of high density.
– Used when the clusters are irregular or intertwined, and when noise and
outliers are present.

6 density-based clusters
16

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Similarity and Dissimilarity
Similarity
– Numerical measure of how alike two data objects are
– Value is higher when objects are more alike
– Often falls in the range [0,1]
Dissimilarity (e.g., distance)
– Numerical measure of how different two data objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
Proximity refers to a similarity or dissimilarity

17

BITS Pilani, Pilani Campus


Data Matrix and Dissimilarity Matrix
Data matrix é x11 ... x1f ... x1p ù
ê ú
– n data points with p dimensions ê ... ... ... ... ... ú
êx ... xif ... xip ú
ê i1 ú
ê ... ... ... ... ... ú
êx ... xnf ... xnp úú
êë n1 û
Dissimilarity matrix
– n data points, but registers only the distance
– A triangular matrix
é 0 ù
ê d(2,1) 0 ú
ê ú
ê d(3,1) d ( 3,2) 0 ú
ê ú
ê : : : ú
êëd ( n,1) d ( n,2) ... ... 0úû

18

BITS Pilani, Pilani Campus


Proximity Measure for Nominal
Attributes
Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a binary
attribute)
Simple matching
– m: # of matches, p: total # of variables d (i, j) = p -
p
m

19

BITS Pilani, Pilani Campus


Proximity Measure for Binary Attributes
Object j
A contingency table for binary data
Object i
Distance measure for symmetric binary variables:
Distance measure for asymmetric binary variables:
Jaccard coefficient (similarity measure for asymmetric binary variables):

20

BITS Pilani, Pilani Campus


Dissimilarity between Binary Variables
Example Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

– Gender is a symmetric attribute


– The remaining attributes are asymmetric binary
– Let the values Y and P be 1, and the value N 0
0+1
d ( jack , mary ) = = 0.33
2+ 0+1
1+1
d ( jack , jim ) = = 0.67
1+1+1
1+ 2
d ( jim , mary ) = = 0.75
1+1+ 2

21

BITS Pilani, Pilani Campus


Distance on Numeric Data: Minkowski Distance
Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and h is
the order (the distance so defined is also called L-h norm)
Properties
– d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
– d(i, j) = d(j, i) (Symmetry)
– d(i, j) £ d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric

22

BITS Pilani, Pilani Campus


Example: Minkowski Distance
(Dissimilarity Matrices)

point attribute 1 attribute 2


Manhattan (L1) L x1 x2 x3 x4
x1 1 2
x1 0
x2 3 5 x2 5 0
x3 2 0 x3 3 6 0
x4 4 5 x4 6 1 7 0

x2 x4 Euclidean (L2) L2 x1 x2 x3 x4
x1 0
4 x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

2 x1 L¥ x1 x2 x3 x4
Supremum
x1 0
x2 3 0
x3 2 5 0
x3 x4 3 1 5 0
0 2 4

23

BITS Pilani, Pilani Campus


Example
The two tables above show the ‘area’ and ‘price’
of the same objects. Only the units of the variables
change.
Calculate Euclidean distance in both the cases.

24
BITS Pilani, Pilani Campus
Ordinal Variables
An ordinal variable can be discrete or continuous
Order is important, e.g., rank
Can be treated like interval-scaled
– replace xif by their rank rif Î{1,..., M f }
– map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by

rif -1
zif =
M f -1
– compute the dissimilarity using methods for interval-scaled variables

25

BITS Pilani, Pilani Campus


Attributes of Mixed Type: Gower distance
• A database may contain all attribute types
• Nominal, symmetric binary, asymmetric binary, numeric, ordinal

https://healthcare.ai/clustering-non-continuous-variables/
26
BITS Pilani, Pilani Campus
Example
Based on the information given in the table below, find most similar and most dissimilar persons
among them. Apply min-max normalization on income to obtain [0,1] range. Consider profession
and mother tongue as nominal. Consider native place as ordinal variable with ranking order of
[Village, Small Town, Suburban, Metropolitan]. Give equal weight to each attribute.

Name Income Profession Mother tongue Native Place

Ram 70000 Doctor Bengali Village

Balram 50000 Data Scientist Hindi Small Town

Bharat 60000 Carpenter Hindi Suburban

Kishan 80000 Doctor Bhojpuri Metropolitan

27
BITS Pilani, Pilani Campus
Solution
After normalizing income and quantifying native place, we get

Name Income Profession Mother tongue Native Place


Ram 0.67 Doctor Bengali 1
Balram 0 Data Scientist Hindi 2
Bharat 0.33 Carpenter Hindi 3
Kishan 1 Doctor Bhojpuri 4

d(Ram, Balram) = 0.67+1+1+(2-1)/(4-1)=3 d(Ram, Bharat) = 0.33+1+1+(3-1)/(4-1)=3


d(Ram, Kishan) = 0.33+0+1+(4-1)/(4-1) = 2.33 d(Balram, Bharat) = 0.33+1+0+(3-2)/(4-1)=1.67
d(Balram, Kishan) = 1+1+1+(4-2)/(4-1) = 3.67 d(Bharat, Kishan) = 0.67+1+1+(4-3)/(4-1) = 3

Most similar – Balram and Bharat; Most dissimilar – Balram and Kishan

28
BITS Pilani, Pilani Campus
Cosine Similarity
A document can be represented by thousands of attributes, each recording the frequency of a
particular word (such as keywords) or phrase in the document.

Other vector objects: gene features in micro-arrays, …


Applications: information retrieval, biologic taxonomy, gene feature mapping, ...
Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d||: the length of vector d

29

BITS Pilani, Pilani Campus


Example: Cosine Similarity
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d|: the length of vector d

Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 =
6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 =
4.12
cos(d1, d2 ) = 0.94
30

BITS Pilani, Pilani Campus


K-Means

31
BITS Pilani, Pilani Campus
K-Means Algorithm
• Works iteratively to find {𝜇k}and {rnk} such that J is minimized

• Iteration involves two key steps


• Find {rnk} , fixing {𝜇k} to minimize J
• Find {𝜇k} , fixing {rnk} to minimize J
• Let us look at each of these steps

BITS Pilani, Pilani Campus


K-Means Algorithm
A sample E-Step

BITS Pilani, Pilani Campus


K-Means Algorithm
A sample E-Step

BITS Pilani, Pilani Campus


K-Means Algorithm
A sample E-Step

BITS Pilani, Pilani Campus


K-Means Algorithm
E-Step:

For all xt ∊ X:

BITS Pilani, Pilani Campus


K-Means Algorithm

M-Step:

For all 𝜇k[where k = 1,2,...,K ] :

BITS Pilani, Pilani Campus


K-Means Algorithm

Algorithm:

Initialize 𝜇k[where k = 1,2,...,K ]


Repeat
E-Step [as defined earlier]
M-Step [as defined earlier]
Until convergence of 𝜇k .

BITS Pilani, Pilani Campus


K-Means Algorithm

Algorithm:

Initialize 𝜇k[where k = 1,2,...,K ]


Repeat
E-Step [as defined earlier]
M-Step [as defined earlier]
Until convergence of 𝜇k .

E-Step in the second iteration

BITS Pilani, Pilani Campus


K-Means Algorithm

Algorithm:

Initialize 𝜇k[where k = 1,2,...,K ]


Repeat
E-Step [as defined earlier]
M-Step [as defined earlier]
Until convergence of 𝜇k .

M-Step in the second iteration

BITS Pilani, Pilani Campus


K-Means Algorithm

Algorithm:

Initialize 𝜇k[where k = 1,2,...,K ]


Repeat
E-Step [as defined earlier]
M-Step [as defined earlier]
Until convergence of 𝜇k .

M-Step in the second iteration

BITS Pilani, Pilani Campus


Example
Consider the analysis of weights of individuals and their respective blood glucose levels as given
below:
a) Identify the clusters using K-means clustering (k=2) for the given data, assuming candidate
1 and 2 as initial centroids.
b) How many iterations does it take before termination?

Candidate
Glucose
Weight level
1 72 185
2 56 170
3 60 168
4 68 179
5 72 182
6 77 188
7 70 180
8 84 183

42

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example
Candidate Cluster1(72, 185) Cluster2(56, 170)
3(60, 168) =172 +122 = 22+42
4(68 179) =62+42 = 92+122
5(72, 182) = 32+02 = 122+162
6(77, 188) = 32+52 = 182+112
7(70, 180) = 52+22 = 102+142
8(183, 84) = 22+122 = 132+282

Candidat Cluster1(182.83, 73.83) Cluster2(169, 58)


After 1st iteration, the cluster groups are C1{1, 4, 5, 6, 7, e
8} and C2{2, 3} 1(185, 72) (182.83 -185)2 +(73.83 -72)2 (169 -185)2 +(58 -72)2
Re-computing centroids: =2.172 +1.832 = 162+142
C1( [72+68+72+77+70+84]/6, [185 2(170, 56) (182.83 -170)2 +(73.83 -56)2 (169 -170)2 +(58 -56)2
+179+182+188+180+183]/6 ) =C1(73.83, 182.83,) =12.832 +17.832 = 12+22
C2( [56 +60]/2, [170+168]/2) =C2(58, 169) 3(168, 60) (182.83 -168)2 +(73.83 -60)2 (169 -168)2 +(58 -60)2

=14.832 +13.832 = 12+22


2nd Iteration –computing distances from resulting 4(179, 68) (182.83 -179)2 +(73.83 -68)2 (169 -179)2 +(58 -68)2
centroids: =3.832 +5.832 = 102+102
5(182, 72) (182.83 -182)2 +(73.83 -72)2 (169 -182)2 +(58 -72)2
After 2nd iteration, the cluster groups are C1{1, 4, 5, 6, 7,
=0.832 +1.832 = 132+142
8} and C2{2, 3}; there is no change in cluster groups 6(188, 77) (182.83 -188)2 +(73.83 -77)2 (169 -188)2 +(58 -77)2
hence the algorithm terminates.
=5.172 +3.172 = 192+192
7(180, 70) (182.83 -180)2 +(73.83 -70)2 (169 -180)2 +(58 -70)2

=2.172 +1.832 = 112+122


8(183, 84) (182.83 -183)2 +(73.83 -84)2 (169 -183)2 +(58 -84)2
43
= 0.172+10.172 = 142+262

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Importance of Choosing Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

44

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Importance of Choosing Initial Centroids

Iteration 6
1
2
3
4
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

45

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Importance of Choosing Initial Centroids …
Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Iteration 3 Iteration 4 Iteration 5


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
46

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Importance of Choosing Initial Centroids …
Iteration 5
1
2
3
4
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

47

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Variations of the K-Means Method
Most of the variants of the k-means which differ in

– Selection of the initial k means

– Dissimilarity calculations

– Strategies to calculate cluster means

Handling categorical data: k-modes

– Replacing means of clusters with modes

– Using new dissimilarity measures to deal with categorical objects

– Using a frequency-based method to update modes of clusters

48

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Limitations of K-means
• K-means has problems when clusters are of differing
– Sizes
– Densities
– Non-globular shapes

• K-means has problems when the data contains outliers.

49

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

50

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

51

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

52

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Discussion on the K-means Method
• Efficiency: O(tkn) where n:#of objects, k:#of clusters, and t:#of iterations
• Normally, k,t<<n;thus, an efficient method
• K-means clustering often terminates at a local optimal
• Initialization can be important to find high-quality clusters
• Need to specify k, the number of clusters, in advance
• In practise, one often runs a range of values and selected the “best” k value
• Sensitivity to noisy data and outliers
• Variations: using k-medians, k-medoids, etc.
• K-means is applicable only to objects in a continuous n-dimensional space
• Using k-modes for categorical data
• Not suitable to discover clusters with non-convex shapes
• Using density based clustering, kernel k-means etc

BITS Pilani, Pilani Campus


Drawbacks of K-means Clustering
• Hard Clustering vs Soft Clustering

https://towardsdatascience.com/gaussian-mixture-models-d13a5e915c8e
BITS Pilani, Pilani Campus
Gaussian Distribution
• The normal curve is bell-shaped and has a single peak at the
• exact center of the distribution.
• The arithmetic mean, median, and mode of the distribution are equal and located at
the peak.
• Half the area under the curve is above the peak, and the other half is below it.
• The normal distribution is symmetrical about its mean.

55

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


MATHEMATICAL FUNCTION (Pdf)

56

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Multivariate Gaussian
• Gaussian for a single variable (Univariate) x:

• Given the x is D-dimensional vector & x is Gaussian distributed :

Where
x: input vector
μ : D-dimensional mean vector
Σ : D × D covariance matrix
|Σ| : Determinant of Σ

BITS Pilani, Pilani Campus


Univariate and Multivariate
Gaussian density

BITS Pilani, Pilani Campus


Multimodal Data
• Gaussian distribution is that it is intrinsically unimodal
• A single distribution is not sufficient to fit many real data sets which is multimodal

Time to next Geyser Eruptions (in Mins) Length of Geyser Eruptions (in
Mins)
Demo of fitting a single gaussian to the data.

BITS Pilani, Pilani Campus


Multimodal Data

• A single distribution is not sufficient to fit


many real data sets

Time to next Geyser Eruptions (in Mins) Length of Geyser Eruptions (in
Mins)
Demo of fitting a k gaussian to the data.

BITS Pilani, Pilani Campus


Mixture Model
n A mixture model assumes that a set of observed objects is a mixture of instances
from multiple probabilistic clusters, and conceptually each observed object is
generated independently
n probabilistic cluster is a distribution over the data space, which can be mathematically
represented using a probability density function (or distribution function).

n Out task: infer a set of k probabilistic clusters that is mostly likely to generate D

BITS Pilani, Pilani Campus


BITS Pilani, Pilani Campus
Gaussian Mixture Model
n A Gaussian mixture model is a probabilistic model that assumes all the data points are
generated from a mixture of a finite number of Gaussian distributions with unknown
parameters.
• A Gaussian Mixture is a function that is comprised of several Gaussians, each identified
by k ∈ {1,…, K}, where K is the number of clusters of our dataset. Each Gaussian k in
the mixture is comprised of the following parameters:
• A mean μ that defines its centre.
• A covariance Σ that defines its width. This would be equivalent to the dimensions of an
ellipsoid in a multivariate scenario.
• A mixing probability π that defines how big or small the Gaussian function will be.

https://scikit-learn.org/stable/modules/mixture.html

BITS Pilani, Pilani Campus


Gaussian Mixture Model

BITS Pilani, Pilani Campus


Mixture of Gaussians
• Mixture of Gaussians: Component densities are Gaussian.
Mixing Coefficients

𝙆 - Number of Gaussians

Data Set: The data used this module came from ‘Old Faithful Data’ available from https://www.kaggle.com/janithwanni/old-faithful/data for download & used by the text book
PRML.

BITS Pilani, Pilani Campus


Mixture of Gaussians
• Mixture of Gaussians: Component densities are Gaussian.

Component Densities

𝙆 - Number of Gaussians

Data Set: The data used this module came from ‘Old Faithful Data’ available from https://www.kaggle.com/janithwanni/old-faithful/data for download & used by the text book
PRML.

BITS Pilani, Pilani Campus


Mixture of Gaussians
• Mixture of Gaussians: Component densities are Gaussian.

Data Set: The data used this module came from ‘Old Faithful Data’ available from https://www.kaggle.com/janithwanni/old-faithful/data for download & used by the text book
PRML.

BITS Pilani, Pilani Campus


Mixture of Gaussians
• Mixture of Gaussians: Component densities are Gaussian.

Parameters of MoG:

π : {π1 , . . . , πK },
μ : {μ1 , . . . , μK }
Σ : {Σ1, . . . ΣK }

Data Set: The data used this module came from ‘Old Faithful Data’ available from https://www.kaggle.com/janithwanni/old-faithful/data for download & used by the text book
PRML.

BITS Pilani, Pilani Campus


Log Likelihood
• Mixture of Gaussians: Component densities are Gaussian.

Parameters of MoG:

π : {π1 , . . . , πK },
μ : {μ1 , . . . , μK }
Σ : {Σ1, . . . ΣK }

Log likelihood

Data Set: The data used this module came from ‘Old Faithful Data’ available from https://www.kaggle.com/janithwanni/old-faithful/data for download & used by the text book
PRML.

BITS Pilani, Pilani Campus


γ(zk )

Parameters of MoG:

π : {π1 , . . . , πK },
μ : {μ1 , . . . , μK }
Σ : {Σ1, . . . ΣK }

Data Set: The data used this module came from ‘Old Faithful Data’ available from https://www.kaggle.com/janithwanni/old-faithful/data for download & used by the text book
PRML.

BITS Pilani, Pilani Campus


Expectation-Maximization Algorithm
• Method for finding ML solutions for models with Latent Variables
• Broad applicability for estimating parameters for various models
• Estimating parameters of a MoG is one application
• Task : To Find ML parameters of MoG

● Set the derivative of ㏑ p( X | 𝝿, 𝜇,Σ)


○ w.r.t 𝜇k to 0 & solve it for 𝜇k
○ w.r.t Σk to 0 & solve it for Σk
○ w.r.t 𝝿k to 0 & solve it for 𝝿k - [Constrained optimization using Lagrangian Multipliers
since mixing coefficients sum should be 1]

BITS Pilani, Pilani Campus


Expectation-Maximization Algorithm

● We get 𝜇k as

Where

BITS Pilani, Pilani Campus


Expectation-Maximization Algorithm

● We get Σk as

Where

BITS Pilani, Pilani Campus


Expectation-Maximization Algorithm
● We get 𝝿k as

Where

BITS Pilani, Pilani Campus


The EM (Expectation Maximization) Algorithm
• Powerful method for finding maximum likelihood solutions for models
• The (EM) algorithm: A framework to approach maximum likelihood of parameters in
statistical models.
– E-step assigns objects to clusters according to the current parameters of probabilistic
clusters
– M-step finds the new clustering or parameters that maximize the expected likelihood

• The k-means algorithm has two steps at each iteration:


– Expectation Step (E-step): Given the current cluster centers, each object is assigned to
the cluster whose center is closest to the object: An object is expected to belong to the
closest cluster
– Maximization Step (M-step): Given the cluster assignment, for each cluster, the
algorithm adjusts the center so that the sum of distance from the objects assigned to this
cluster and the new center is minimized

75
75
BITS Pilani, Pilani Campus
EM Algorithm for MoG
1. Start by placing gaussians randomly.
2. Repeat until it converges.
1. E step: With the current means and variances, find the probability of each data point xi
coming from each gaussian.
2. M step: Once it computed these probability assignments it will use these numbers to re-
estimate the gaussians’ mean and variance to better fit the data points.

BITS Pilani, Pilani Campus


EM for Gaussian Mixtures
Given a Gaussian mixture model, the goal is to maximize the likelihood function with respect to
the parameters (comprising the means and covariances of the components and the mixing
coefficients).
1. Initialize the means μk, covariances Σk and mixing coefficients πk, and evaluate the initial
value of the log likelihood.
2. E step. Evaluate the responsibilities using the current parameter values
• For fixed values of μ1, μ2, .. μk and Σ1, Σ2, .. Σk and Π1 ….Πk, Compute soft assignments per
data point (allocating the probability of that data point belonging to each one of the clusters).

• For each observation i, vector γ (aka responsibility vector) is (γi1, γi2,… γiK ), where K
is the total number of clusters, or often referred to as the number of components.
• The cluster responsibilities for a single data point i should sum to 1.

BITS Pilani, Pilani Campus


EM for Gaussian Mixtures
3. M step. Re-estimate the parameters using the current responsibilities

, where

4. Evaluate the log likelihood

and check for convergence of either the parameters or the log likelihood. If the convergence
criterion is not satisfied return to step 2.

BITS Pilani, Pilani Campus


Expectation-Maximization Algorithm
To Estimate: M-Step
Initialize 𝝿, 𝜇,Σ and
E-Step
also evaluate the log likelihood

Perform E-Step Given 𝛾(zk)

Perform M-Step Given 𝝿, 𝜇,Σ

Repeat Until Convergence - [ Use log likelihood / parameters to decide this]

BITS Pilani, Pilani Campus


BITS Pilani
Pilani Campus

Thank You!
Applied Machine Learning
Swarna Chaudhary
Assistant Professor
BITS Pilani swarna.chaudhary@pilani.bits-pilani.ac.in
Pilani Campus
BITS Pilani
Pilani Campus

NeuralNetwork
These slides are prepared by the instructor, with grateful acknowledgement of
Tom Mitchell, Andrew Ng and many others who made their course materials
freely available online.
Session Content

• Perceptron (Chapter 4 Tom Mitchell)


• Neural Network Architecture (Andrew Ng Notes
and Chapter 4 Tom Mitchell)
• Back propagation Algorithm (Andrew Ng Notes)

3
Artificial Neural Network
Neural Networks
• Origins: Algorithms that try to mimic the brain.
• Very widely used in 80s and early 90s; popularity diminished in late 90s.
• Recent resurgence: State-of-the-art technique for many applications
• Artificial neural networks are not nearly as complex or intricate as the actual
brain structure

Based on slide by Andrew Ng 8


When to use Neural Network
Perceptron

-
Perceptron Training rule
Perceptron Training
Gradient Descent
Gradient Descent
Gradient Descent

12
Perceptron: Sigmoid Function

13
Decision Surface of Perceptron
AND Operation
Multilayer network

• Single perceptrons can only express linear


decision surfaces.
• In contrast, the kind of multilayer networks
learned by the FORWARD and
BACKPROPAGATION algorithm are capable
of expressing a rich variety of nonlinear
decision surfaces
Neural networks

Output units

Hidden units

Input units
Layered feed-forward network

• Neural networks are made up of nodes or units,


connected by links
• Each link has an associated weight and activationlevel
• Each node has an input function (typically summing over
weighted inputs), an activation function, and an output
Based on slide by T. Finin, M. desJardins, L Getoor, R. Par 9
Neural Network

bias units x0 a(2)


0

hθ(x)

Layer 1 Layer 2 Layer 3


(Input Layer) (Hidden Layer) (Output Layer)

Slide by Andrew Ng 12
Feed-Forward Process
• Input layer units are set by some exterior function
(think of these as sensors), which causes their output
links to be activated at the specified level
• Working forward through the network, the input
function of each unit is applied to compute the input
value
– Usually this is just the weighted sum of the activation on
the links feeding into this node

• The activation function transforms this input


function into a final value
– Typically this is a nonlinear function, often a sigmoid
function corresponding to the “threshold”of thatnode
Based on slide by T. Finin, M. desJardins, L Getoor, R. Par 13
Neural Network
x0
Multiple Output Units:One-vs-Rest
Multiple Output Units:One-vs-Rest
Neural Network Classification
Given:
{(x1,y1), (x2,y2), ..., (xn,yn)}
S ϵ N + L contains # nodes at eachlayer
– s0 = d (#features)

Binary classification Multi-class classification (K classes)


y = 0 or 1 y ϵR e.g. , , ,
K
pedestrian car motorcycle truck
1 output unit (sL-1= 1)
K output units (sL-1= K)

Slide by Andrew Ng 19
Layering Representations
x1 ... x20
x21 ... x40
x41 ... x60

.
..
x381 ... x400
20 ×20 pixelimages
d = 400 10 classes

Each image is “unrolled” into a vector x of pixel intensities


24
Cost Function
(9.1 NN video of Andrew Ng)
Optimizing the Neural Network
Forward Propagation
• Given one labeled training instance (x , y):

Forward Propagation
• a(1) = x
• z(2) = Θ(1)a(1) a(1)
a(4)
• a(2) = g(z(2)) [add a 0(2)] a(2) a(3)
• z(3) = Θ(2)a(2)
• a(3) = g(z(3)) [add a0(3)]
• z(4) = Θ(3)a(3)
• a(4) = hΘ(x) = g(z(4))

Based on slide by Andrew Ng 34


Learning in NN: Backpropagation
• Similar to the perceptron learning algorithm, we cycle
through our examples
– If the output of the network is correct, no changes aremade
– If there is an error, weights are adjusted to reduce theerror

• The trick is to assess the blame for the error and divide
it among the contributing weights

Based on slide by T. Finin, M. desJardins, L Getoor, R. Par 31


Backpropagation Intuition
• Each hidden node j is “responsible” for some fraction
of the error δj(l) in each of the output nodes to
which it connects

• δj(l) is divided according to the strength of the


connection between hidden node and the output
node

• Then, the “blame” is propagated back to provide the


error values for the hidden layer

Based on slide by T. Finin, M. desJardins, L Getoor, R. Par 35


Backpropagation

32
Han Kamber, 3rd Edition
34
Random Initialization
Training a Neural Network
Pick a network architecture (connectivity pattern between nodes)

• # input units = # of features in dataset


• # output units = # classes

Reasonable default: 1 hidden layer


• or if >1 hidden layer, have same # hidden units in
every layer (usually the more the better)

Based on slide by Andrew Ng 53


Training a Neural Network
1. Randomly initialize weights
2. Implement forward propagation to get hΘ(xi)
for any instance xi
3. Implement code to compute cost function J(Θ)
4. Implement backprop to compute partial derivatives

5. Use gradient descent with backprop to fit the network

Based on slide by Andrew Ng 54


38
Early Stopping
If we let a complex model train long
enough on a given data set it can
eventually learn the dataexactly.

Given data that isn’t represented in the


training set, the model will perform
poorly when analyzing the data
(overfitting).

How is the sweet spot for


training located?

When the error on the training set begins


to deviate from the error on the
validation set, a threshold can be set to
determine the early stopping condition
and the ideal number of epochs to train.
26
39
BITS Pilani, Pilani Campus
Dropout

40
Dropouts

28
41
BITS Pilani, Pilani Campus
Good References for understanding
Neural Network
Andrew Ng videos on neural network

https://www.youtube.com/watch?v=EVeqrPGfuCY&li
st=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=45

Autonomous driving using neural network


https://www.youtube.com/watch?v=ppFyPUx9RIU&li
st=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=57
Applied Machine Learning
Swarna Chaudhary
Assistant Professor
BITS Pilani swarna.chaudhary@pilani.bits-pilani.ac.in
Pilani Campus
BITS Pilani
Pilani Campus

ConvolutionalNeural Network
These slides are prepared by the instructor, with grateful acknowledgement of Andrew
Ng, Tom Mitchell, Mitesh Khapra and many others who made their course materials
freely available online.
Session Content

• Convolutional Neural Network


• Recurrent Neural Network

3
Convolutional Neural Network

Source: https://www.youtube.com/watch?v=40riCqvRoMs
Convolutional Neural Network

5/68
How a computer sees an image.
source: http://cs231n.github.io/classification/

6/68

In CNN, every image is represented in the form of an array of pixel values.


Convolution
Convolution
Convolution
Let us apply this idea to a toyexample and see the results

Input
Kernel
a b c d
w x
e f g h
y z
i j k A

Output

aw+bx+ey+fz

6/68
Let us apply this idea to a toyexample and see the results

Input
Kernel
a b c d
w x
e f g h
y z
i j k A

Output

aw+bx+ey+fz bw+cx+fy+gz

6/68
Let us apply this idea to a toyexample and see the results

Input
Kernel
a b c d
w x
e f g h
y z
i j k A

Output

aw+bx+ey+fz bw+cx+fy+gz cw+dx+gy+hz

6/68
Let us apply this idea to a toyexample and see the results

Input
Kerne
a b c d l
w x
e f g h
y z
i j k A

Output

aw+bx+ey+fz bw+cx+fy+gz cw+dx+gy+hz

ew+fx+iy+jz

6/68
Let us apply this idea to a toyexample and see the results

Input
Kernel
a b c d
w x
e f g h
y z
i j k A

Output

aw+bx+ey+fz bw+cx+fy+gz cw+dx+gy+hz

ew+fx+iy+jz fw+gx+jy+kz

6/68
Let us apply this idea to a toyexample and see the results

Input
Kernel
a b c d
w x
e f g h
y z
i j k A

Output

aw+bx+ey+fz bw+cx+fy+gz cw+dx+gy+hz

ew+fx+iy+jz fw+gx+jy+kz gw+hx+ky+Az

6/68
Example of kernel: Blur

1 1 1
∗ 1 1 1 =
1 1 1
blurs the image

CS7015 (Deep Mitesh M. Khapra 9/68


Learning) : Lecture 11
Example of kernel: Edge detection

1 1 1
∗ 1 -8 1 =
1 1 1
detec s h
edges

CS7015 (Deep Mitesh M. Khapra 11/68


Learning) : Lecture 11
Convolution
Wejust slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output

CS7015 (Deep Mitesh M. Khapra 13/68


Learning) : Lecture 11
Convolution
Wejust slide the kernel over the input
image
Each time we slide the kernel we get
one value in the output

CS7015 (Deep Mitesh M. Khapra 13/68


Learning) : Lecture 11
2D convolutions applied to images
• Wejust slide the kernel over the
input image
• Each time we slide the kernel we
get one value in the output
• The resulting output is called a
feature map.
• Wecan use multiple filters to get
multiple feature maps.

CS7015 (Deep Mitesh M. Khapra 13/68


Learning) : Lecture 11
R GB

What would a 3D filter looklike?


It will be 3D and we will refer to it as a volume
Once again we will slide the volume over the
3D input and compute the convolution oper-
ation
We will assume that the filter always extends
to the depth of the image

In effect, we are doing a 2D convolution oper-


ation on a 3D input (because the filter moves
along the height and the width but not along
the depth)
As a result the output will be 2D (only width
OUTPUT
and height, no depth)
INPUT
Once again we can apply multiple filters to get
multiple feature maps
CS7015 (Deep Learning) : Lecture 11
Let us compute the dimension (W 2, H 2) of
the output
Notice that we can’t place the kernel at the
= corners as it will cross the input boundary

pixel of interest

CS7015 (Deep Mitesh M. Khapra 19/68


Learning) : Lecture 11
Let us compute the dimension (W 2, H 2) of
the output
Notice that we can’t place the kernel at the
= corners as it will cross the input boundary
This is true for all the shaded points (the
kernel crosses the input boundary)
This results in an output which is of smaller
dimensions than the input
As the size of the kernel increases, this be-
comes true for even more pixels

CS7015 (Deep Mitesh M. Khapra 20/68


Learning) : Lecture 11
Let us compute the dimension (W 2, H 2) of
the output
Notice that we can’t place the kernel at the
= corners as it will cross the input boundary
This is true for all the shaded points (the
kernel crosses the input boundary)
This results in an output which is of smaller
dimensions than the input
pixel of interest
As the size of the kernel increases, this be-
comes true for even more pixels
For example, let’s consider a 5 ×5 kernel
We have an even smaller output now

CS7015 (Deep Mitesh M. Khapra 20/68


Learning) : Lecture 11
Let us compute the dimension (W 2, H 2) of the output

Notice that we can’t place the kernel at the


=
corners as it will cross the input boundary
This is true for all the shaded points (the
kernel crosses the input boundary)
This results in an output which is of smaller
dimensions than the input
As the size of the kernel increases, this be-
In general, W2 = W1 − F +1
comes true for even more pixels
H2 = H1 − F + 1 For example, let’s consider a 5 ×5 kernel
We will refine this formula further We have an even smaller output now

20/68
What if we want the output to be of same size
0 0 0 0 0 0 0 0 0
as the input?
0 0
We can use something known as padding
0 0
0 0
Pad the inputs with appropriate number of 0
0 0
= inputs so that you can now apply the kernel
at the corners
0 0
0 0 Let us use pad P = 1 with a 3 × 3 kernel
0 0
This means we will add one row and one
0 0 0 0 0 0 0 0 0
column of 0 inputs at the top, bottom, left
and right

We now have, W2 = W1 − F + 2P + 1 H2 = H1 − F + 2P + 1

21/68
0
0
0 0 0 0 0 0 0 0
0
What does the stride S do?
0 0 It defines the intervals at which the
0 0 filter is applied (here S =2)
0 0 =
Here, we are essentially
0 0
0 0
skipping every 2nd pixel which
0 0 will again result in an output
0 0 0 0 0 0 0 0 0 which is of smaller dimensions

22/68
0
0
0 0 0 0 0 0 0 0
0
What does the stride S do?
0 0 It defines the intervals at which the
0 0 filter is applied (here S =2)
0 0 =
Here, we are essentially
0 0
0 0
skipping every 2nd pixel which
0 0 will again result in an output
0 0 0 0 0 0 0 0 0 which is of smaller dimensions

So what should our final formula look like,


W1 − F +2P
W = +1
S

H1 − F +2P
H = +1
S

22/68
Finally, coming to the depth of the
output.
Each filter gives us one 2D output.
K filters will give us K such 2D out-
H2
puts

H1

W2

W1
D2 =K

D1
W2 = W1−F +2P +1
S

H2 = H1−F +2P +1
S

D2 =K

23/68
Finally, coming to the depth of the
output.
Each filter gives us one 2D output.
K filters will give us K such 2D out-
H2
puts
We can think of the resulting output
as K × W2 × H2 volume
H1
Thus D2 = K

W2

W1
D2 =K

D1
W2 = W1−F +2P +1
S

H2 = H1−F +2P +1
S

D2 =K

23/68
Example

H 2 =?

11
∗ =

11
227
3
W 2 =?
96filters
227 Stride =4
Padding = 0
W 2 = W 1 − F +2P + 1 96
S
3 H 2 = H 1 − F +2P + 1
S

CS7015 (Deep Mitesh M. Khapra 24/68


Learning) : Lecture 11
Example

55 = 227−11 + 1
4

11
∗ =

11
227
3
55 = 227−11 + 1
96filters 4

227 Stride =4
Padding = 0
W 2 = W 1 − F +2P + 1 96
S
3 H 2 = H 1 − F +2P + 1
S

24/68
Applying multiple filters
Features
Raw pixels
car, bus,monument, flower

Edge Detector
car, bus,monument, flower

S I F T /H OG
car, bus,monument, flower

28/68
Input Features Classifier

car, bus,monument, flower

0 0 0 0 0
0 1 1 1 0
0 1 -8 1 0
0 1 1 1 0
0 0 0 0 0

car, bus,monument, flower

-1.21358689e-033.23652686e-03····· -
2.06615720e-02
-1.5275..7822e-032.3613..0832e-03····· - ..
1.1982.4838e-02
.
.
.
.
Learn these weights
.. . .
-8.25322699e-04-5.14897937e-03····· -
9.90395527e-
03
Instead of using handcrafted kernels such as edge detectors can we learn meaningful kernels/filters in
addition to learning the weights of the classifier?
CS7015 (Deep Mitesh M. Khapra 29/68
Learning) : Lecture 11
Input Features Classifier

car, bus,monument, flower

0 0 0 0 0
0 1 1 1 0
0 1 -8 1 0
0 1 1 1 0
0 0 0 0 0

car, bus,monument, flower

-0.01871333-0.01075948······
0.04684572
0.001..043250.019..35937··· ··· ..
. .
0.010.16542
.. .. ..
. . .
0.030087770.00335217······ -0.02791128

Even better: Instead of using handcrafted kernels (such as edge detectors) can we learn multiple
meaningful kernels/filters in addition to learning the weights of the classifier?
CS7015 (Deep Mitesh M. Khapra 30/68
Learning) : Lecture 11
Convolutional Neural Network
Input Classifier

car, bus,monument, flower

-1.21358689e-033.23652686e-03·····-2.06615720e- -0.011125820.02185669······ 0.00015161


0-12.52757822e-032.36130832e-03····-1.19824838e- -0.006875870.01229961······ 0.00214013
02 .. .. .. .. .. .. backpropagation
. . . . . .
.. .. .. .. .. ..
. . . . . .
-8.25322699e-04-5.14897937e-03······-9.90395527e- -0.00372989-0.00886137······ -0.01974954
03

Can we learn multiple layers of meaningful kernels/filters in addition to


learning the weights of the classifier?
Yes, we can !
Simply by treating these kernels as parameters and learning them in addition to the
weights of the classifier (using back propagation)
Such a network is called a Convolutional Neural Network.
CS7015 (Deep Mitesh M. Khapra 31/68
Learning) : Lecture 11
10 classes(digits)

This is what a regular feed-forward

... neural network will look like

There are many dense


connections here

For example all the 16 input neurons


are contributing to the computation
of h11
16

Contrast this to what happens in the


2 case of convolution

33/68
Sparse Connectivity
h11 h12 Only a few local neurons participate
in the computation of h11

For example, only pixels 1, 2, 5, 6


... contribute to h11
The connections are much
1 sparser
6
In CNN, by superimposing the kernel

2
over the image, only a few inputs are
considering at a time to compute the
* = weighted average of selected pixel
inputs.
The output h₁₁ is calculated using
sparser connections rather than
CS7015 (Deep Mitesh M. Khapra considering all the connections. 34/68
Learning) : Lecture 11

This sparse connectivity reduces the number of parameters in the model


Convolution Layer 2
Input Convolution Layer 1 Pooling Layer 2

A
Pooling Layer 1 FC 1(120)FC2(84)Output(10)

32

28
14
P a r am
32
10 P a r am P a r a m
5 = 48120= 10164 = 850
28 14
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P =0, K = 6,P =0,
S = 1,F = 5, S = 1,F = 2,
P a r a m =150 P a r a m =0 K = 16,P = 0, K = 16,P = 0,
P a r a m = 2400 P a r a m =0

It has alternate convolution and pooling layers


What does a pooling layer do?
Let us see

CS7015 (Deep Mitesh M. Khapra 39/68


Learning) : Lecture 11
Pooling
1 4 2 1

5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1filter

1 4 2 1

5 8 3 4 maxpool 8 8

7 6 4 5 2x2 filters (stride 1)

1 3 1 2

CS7015 (Deep Mitesh M. Khapra 40/68


Learning) : Lecture 11
Pooling
1 4 2 1

5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7 5

1 3 1 2

Input 1filter

1 4 2 1

5 8 3 4 maxpool 8 8 4

7 6 4 5 2x2 filters (stride 1) 8 8 5

1 3 1 2 7 6 5

Instead of max pooling we can also do average


pooling

CS7015 (Deep Mitesh M. Khapra 40/68


Learning) : Lecture 11
Pooling Layer

43/68

The pooling layer uses various filters to identify different parts of the image like edges, corners, etc
Training CNN
Input Kernel

b c d l m n o

w x
e f g
y z
h i j b c d e f g h i j

Output ACNN can be implemented as a


feedforward neural network
A m
wherein only a few weights(in color)
are active
the rest of the weights (in gray) are
zero

44/68
Training CNN
Input Kerne
l
b c d l m n o

w x
e f g
y z
g
h i j b c d e f h i j

ACNN can be implemented as a


feedforward neural network
wherein only a few weights(in color)
are active
the rest of the weights (in gray) are
We can thus train a convolution zero
neural network using
backpropagation by thinking of it as
a feedforward neural network with
sparse connections 44/68
Recurrent Networks: Time Series

• Suppose we want to predict next state of world


– and it depends on history of unknown length
– e.g., robot with forward-facing sensors trying to predict next
sensor reading as it moves and turns

• Idea: use hidden layer in network to capture state history


Recurrent Neural Network
Recurrent Neural Network
Recurrent Neural Network
Recurrent Neural Network
Artificial Neural Networks: Summary

• Highly non-linear regression/classification


• Hidden layers learn intermediate representations
• Potentially millions of parameters to estimate
• Stochastic gradient descent, local minima problems

• Deep networks have produced real progress in many fields


– computer vision
– speech recognition
– mapping images to text
– recommender systems
– …
• They learn very useful non-linear representations
Good References for understanding
Neural Network
Mitesh Khapra https://www.youtube.com/watch?v=yw8xwS15Pf4

Visualization of CNN
https://www.youtube.com/watch?v=cNBBNAxC8l4

Back propagation https://www.youtube.com/watch?v=G5b4jRBKNxw&lis


t=PLZbbT5o_s2xq7LwI2y8_QtvuXZedL6tQU&index=25

https://www.simplilearn.com/tutorials/deep-learning-
tutorial/convolutional-neural-
network?source=sl_frs_nav_playlist_video_clicked

You might also like