Download as pdf or txt
Download as pdf or txt
You are on page 1of 327

Applied Machine Learning

Decision Trees

BITS Pilani Anita Ramachandran


Pilani | Dubai | Goa | Hyderabad
2024

1
Decision Trees
• Decision tree
• A flow-chart-like tree structure
• Internal node denotes a test on an attribute
• Branch represents an outcome of the test
• Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
• Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
• Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
• Test the attribute values of the sample against the decision tree

2
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Construction:
Hunt’s Algorithm
Let Dt be the set of training records that are associated with node t and
y = {y1, y2, y3.. yc} be the class labels.

3
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Construction:
Example

4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions
• How should the training records be split?

• How should the splitting procedure stop?

5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions - Splitting
Methods
Binary Attributes Nominal Attributes

Ordinal Attributes
Continuous Attributes

6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions - Selecting
the best split
• p(i|t): fraction of records associated with node t belonging to class i
• Best split is selected based on the degree of impurity of the child
nodes
• Class distribution (0,1) has high purity
• Class distribution (0.5,0.5) has the smallest purity (highest impurity)

• Intuition: high purity  small value of impurity measures  better


split
c
Entropy (t )   p(i | t ) log p (i | t )
i 1
c
Gini (t )  1    p(i | t )
2

i 1

Classifica tion error (t )  1  max i  p(i | t )


7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions - Selecting
the best split
c
Entropy (t )   p(i | t ) log p (i | t )
i 1
c
Gini (t )  1    p(i | t )
2

i 1

Classifica tion error (t )  1  max i  p(i | t )

Gini = ?
Entropy = ?
Error = ?

Comparison among the


impurity measures for binary
classification problems

8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions -
Information Gain
• In general the different impurity measures are consistent
• Gain of a test condition: compare the impurity of the
parent node with the impurity of the child nodes
k N (v j )
  I ( parent )   I (v j )
j 1 N
• I(.) is the impurity measure of a given node, N is the total no: of
records at the parent node, k is the no: of attribute values, and
N(vj) is the no: of nodes associated with the child node vj

• Maximizing the gain == minimizing the weighted


average impurity measure of children nodes
• If I() = Entropy(), then Δinfo is called information gain
9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions -
Information Gain

Information gain measures the expected reduction in entropy


caused by partitioning the training set according to an attribute
10
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions - Example –
Splitting Binary Attributes
With A:
Gini for N1 = 1 – 16/49 – 9/49 = 0.489
Gini for N2 = 1 – 4/25 – 9/25 = 0.48
Weighted avg Gini = 0.489 x 7/12 + 0.48 x 5/12 = 0.485

With B:
Gini for N1 = 1 – 1/25 – 16/25 = 0.32
Gini for N2 = 1 – 25/49 – 4/49 = 0.408
Weighted avg Gini = 0.32 x 5/12 + 0.408 x 7/12 = 0.37

11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions - Example –
Splitting Nominal Attributes

12
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions - Example –
Splitting Continuous Attributes
• Brute force method – high complexity
• Sort training records:
• Based on their annual income - O(N log N) complexity
• Candidate split positions are identified by taking the midpoints
between two adjacent sorted values
• Measure Gini index for each split position, and choose the one that
gives the lowest value
• Further optimization: consider only candidate split positions located
between two adjacent records with different class labels

13
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Summary so far - Decision
Tree
• Build a model (based on past vector) in the form of a tree structure,
that predicts the value of the output variable based on the input
variables in the feature vector
• Each node (decision node) of a decision tree corresponds to one
feature vector
• Root node, Branch node, Leaf node
• Building a Decision Tree - Recursive partitioning
• Splits data into multiple subsets on the basis of feature values
• Root node – entire dataset
• First selects the feature which predicts the target class in the strongest
way
• Splits the dataset into multiple partitions
• Stopping criteria
• All or most of the examples at a particular node have the same class
• All features have been used up in the partitioning
• The tree has grown to a pre-defined threshold limit

14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example
• Consider the data set available for a company’s hiring
cycles. A student wants to find out if he may be offered a
job in the company. His parameters are as follows:
• CGPA: High, Communication: Bad, Aptitude: High,
Programming skills: Bad

15
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example
• Outcome = false for
all the cases where
Aptitude = Low,
irrespective of other
conditions

• So feature Aptitude
can be taken up as
the first node of the
decision tree

16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example
• For Aptitude = HIGH, job offer condition is TRUE for all the cases
where Communication = Good.
• For cases where Communication = Bad, job offer condition is TRUE
for all the cases where CGPA = HIGH
• Use the below decision tree to predict outcome for (Aptitude = high,
Communication = Bad and CGPA = High)

17
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example: Entropy & Information
Gain Calculation - Level 1

18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 1

19
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 1

• Information gain from CGPA = 0.99-0.69 = 0.3


• Information gain from Communication = 0.99-0.63 =
0.36
• Information gain from Programming skills = 0.04
• Information gain from Aptitude = 0.47 -> Aptitude
will be the first node of the decision tree formed 20
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 1
• For Aptitude = Low, entropy is 0
• The result will be the same always regardless of the values of
the other features
• The branch for Aptitude = Low will not continue any further

21
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 2
• We will have only one branch to navigate: Aptitude =
High

22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 2

23
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 2

24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 2
Entropy values at the end of Level 2:
• 0.85 before the split
• 0.33 when CGPA is used for split
• 0.30 when Communication is used for split
• 0.80 when Programming skill is used for split

Information Gains
• After split with CGPA = 0.52
• After split with Communication = 0.55
• After split with Programming skill = 0.05

• Highest information gain from Communication, hence it should used


for next level split
• Entropy = 0 for Communication = Good, so that branch will not
continue further

25
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 3
Entropy values at the end of Level 3:
• 0.81 before the split
• 0 CGPA is used for split
• 0.50 when Programming skill is used for split

26
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 2

27
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Characteristics of Decision
Tree Induction
• Non-parametric approach
• Computationally inexpensive, even with large training set
• What is the worst case complexity of classifying a test record?
• Easy to interpret
• Accuracy is comparable to other classifiers
• Robust to noise, with methods to prevent overfitting
• Immune to presence of redundant or irrelevant attributes
• Suffers from data fragmentation (where have we heard this before?)
• How can we recover from this?
• Splits using single attribute at a time -> rectilinear decision
boundaries
• Limits decision tree representation for modeling complex relationships
among continuous attributes
• Tree pruning strategies has more effect on the performance of
decision trees rather than choice of impurity measure

28
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Issues in Decision Tree
Learning
• Handling training examples with missing data, attributes
with differing costs
• Model overfitting
• Causes of model overfitting
• Estimating generalization error
• Handling overfitting
• Evaluating classifier performance

29
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Issues in Decision Tree
Learning - Model Overfitting
• Training error (resubstitution error, apparent error)
• Generalization error
• Good model - low training error, low generalization error
• Low training error, high generalization error
==>overfitting
• Reasons for overfitting
• Presence of noise
• Lack of representative samples
• Model complexity
• Primary reason for overfitting – still a subject of debate

30
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example Data Set
Two class problem:
+ : 5200 instances
• 5000 instances generated
from a Gaussian centered at
(10,10)

• 200 noisy instances added

o : 5200 instances
• Generated from a uniform
distribution

10 % of the data used for


training and 90% of the
data used for testing

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Increasing number of nodes in
Decision Trees

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Decision Tree with 4 nodes

Decision Tree

Decision boundaries on Training data

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Decision Tree with 50 nodes

Decision Tree

Decision boundaries on Training data

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Which tree is better?

Decision Tree with 4 nodes

Which tree is better ?


Decision Tree with 50 nodes

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Model Overfitting

Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is large

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Issues in Decision Tree
Learning - Example
• Overfitting due to the presence of noise
Training set Test set

What are the error rates of the


decision trees on the test set?

37
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Issues in Decision Tree
Learning - Example
• Overfitting due to the lack of representative samples
Training set Test set

What is the error rate of the


decision tree on the test set?

38
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Issues in Decision Tree
Learning - Model Complexity
• The chance for model overfitting increases as the model becomes more
complex.
• Prefer simpler models - Occam's razor or the principle of parsimony
• Given two models with the same generalization errors, the simpler model is preferred
over the more complex model.
• Ideal complexity – produces the lowest generalization error, but only
training data to learn from
• Best possibility – estimate the generalization error
• Methods to estimate generalization error
• Resubstitution estimate
• Selects the model that produces the lowest training error rate as its final model
• Pessimistic error estimate
• Minimum Description Length Principle
• Seek a model that minimizes the overall cost function
• Refer example
• Using a validation set

39
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Issues in Decision Tree
Learning - Handling Overfitting
Two approaches

• Prepruning (early stopping rule)


• Stop growing the tree earlier, before it reaches the
point where it perfectly classifies the training data
• Difficult to estimate when to stop growing the tree

• Post-pruning
• Allow the tree to overfit the data, and then post-
prune the tree

40
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Model Selection for
Decision Trees
• Pre-Pruning (Early Stopping Rule)
– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
• Stop if all instances belong to the same class
• Stop if all the attribute values are the same
– More restrictive conditions:
• Stop if number of instances is less than some user-specified
threshold
• Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
• Stop if estimated generalization error falls below certain threshold

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Model Selection for
Decision Trees
• Post-pruning
– Grow decision tree to its entirety
– Subtree replacement
• Trim the nodes of the decision tree in a bottom-up fashion
• If generalization error improves after trimming, replace
sub-tree by a leaf node
• Class label of leaf node is determined from majority class
of instances in the sub-tree

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Evaluating the Classifier
Performance
• Holdout method
• Random subsampling
• Cross-validation
• Bootstrap

43
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
ID3
• Uses Hunt’s algorithm
• Uses information gain as the measure to select the best
attribute at each step in the growing tree
• Does not have the ability to determine how many
alternative decision trees are consistent with the
available training data
• Performs no backtracking in its search
• Shorter trees are preferred over longer trees. Trees that
place high information gain attributes close to the root
are preferred over those that do not - Inductive bias of
ID3
• Refer: Mitchell

44
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Combining Multiple Learners

S.P.Vimal
Agenda

1. Combining Learners - Getting Started


2. Bagging
3. Boosting (AdaBoost)
Getting Started

● No Free Lunch Theorem: There is no algorithm that is always the most


accurate
● Each learning algorithm dictates a certain model that comes with a set of
assumptions
○ Each algorithm converges to a different solution and fails under different
circumstances
■ The best tuned learners could miss some or examples and there could be other
learners which works better on (may be only) those !
○ In the absence of a single expert ( a superior model ) , a committee
(combinations of models) can do better !
■ A committee can work in many ways ...
Committee of Models

● Committee Members are base


learners !
● Major challenges dealing with this
committee
○ Expertise of each of the
members (Does it help / not?)
○ Combining the results from the
members for better performance
Issue -1 : On the members ( Base Learners )

● It does not help if all learners are


good/bad at roughly same thing
○ Need Diverse Learners
Issue -1 : On the members ( Base Learners )

● Use Different Algorithms


○ Different algorithms make
different assumptions
● Use Different Hyperparameters, that
is ,
○ vary the structure of neural nets
○ different kernel functions in SVM
Issue -1 : On the members ( Base Learners )

● Use different input representations


○ Uttered words + video
information of speakers clips
○ image + text annotations
● Use different training sets
○ Draw different random samples
of data
○ Partition data in the input space
and have learners specialized in
those spaces (mixture of
experts)
Issue -1 : On the members ( Base Learners )

● Diversity Vs. Accuracy


○ Learners must be reasonably
accurate to begin with
○ Base learners must be chosen
for the simplicity, not for their
accuracy
Issue -2 : Combining Results of Base Learners

A Simple Combination Scheme:


Issue -2 : Combining Results of Base Learners
Issue -2 : Combining Results of Base Learners
Bagging

• Technique that uses subsets (bags) to get a fair idea of the distribution (complete set).
• The size of subsets created for bagging may be less than the original set.
• Bootstrapping is a sampling technique in which we create subsets of observations from the
original dataset, with replacement.
• When you sample with replacement, items are independent. One item does not affect the outcome
of the other. You have 1/7 chance of choosing the first item and a 1/7 chance of choosing the
second item.
• If the two items are dependent, or linked to each other. When you choose the first item, you have
a 1/7 probability of picking a item. Assuming you don’t replace the item, you only have six items to
pick from. That gives you a 1/6 chance of choosing a second item.
• Multiple subsets are created from the original dataset, selecting observations with
replacement.
• A base model (weak model) is created on each of these subsets.
• The models run in parallel and are independent of each other.
• The final predictions are determined by combining the predictions from all the models.
Bagging

● A voting method whereby


base-learners are made
different by training them
over slightly different
training sets
○ L slightly different
samples from a given
sample by bootstrap X1 X2 XL
○ Use averaging as
required for
classification /
regression
Bagging
Example of dataset used to construct an ensemble of bagging classifiers
Bagging

• Repeatedly samples with replacement


from a dataset.
• Each bootstrap sample has the
same size as the original data.
• Some instances may appear several
times in the same training set, while
others may be omitted.
• On an average, contains nearly 63%
of the original training data – Find
out why.

• Improves generalization error by


reducing the variance of the base
classifiers
• If base classifier is unstable, bagging
helps to reduce errors associated with
random fluctuations in the training data
• If base classifier is stable, error of the
ensemble is primarily caused by bias in
the classifier
• Less susceptible to model overfitting
when applied to noisy data because it
doesn’t focus on any particular instance
of training data
Random Forests (using bagging)

A hyper parameter

Generalizes well as the number of trees in the forest is large !

Taken from Prof. Sugata’s Slide Deck


Boosting

● Bagging : Generating complementary base-learners is left to chance and


the unstability of the learning method
● Boosting : Actively try to generate complementary base-learners by
training the next learner on the mistakes of the previous learners
● ( Schapire 1990 )
○ Combines 3 weak learners (error probability < 0.5)
○ Divides data into three sets, X1,X2,X3
○ Uses 3 learners d1, d2 and d3.
■ Use X1 to train d1.
■ Feed X2 and test d1
■ Train d2 with all misclassified by d1 and examples from X2
■ ...
AdaBoost (Freund and Schapire (1996))

● Adaptive Boosting, which uses


○ the same training set over and again
○ simple classifiers which does not overfit !
● Works by modifying the probabilities of drawing the instances as a
function of the error
● Let pjt denotes probability that the instance pair (xt , rt ) is drawn to train
the jth base-learner
○ Start by setting all pjt = 1/N
○ Create new base learners which stresses more on the instances
misclassified by the previous one by updating pjt accordingly.
AdaBoost (Freund and Schapire (1996)

From, L ́eon Bottou


AdaBoost (Freund and Schapire (1996)

From, L ́eon Bottou


AdaBoost (Freund and Schapire (1996)

From, L ́eon Bottou


AdaBoost (Freund and Schapire (1996)

From, L ́eon Bottou


AdaBoost (Freund and Schapire (1996)

How do we combine the results now?

From, L ́eon Bottou


AdaBoost (Freund and Schapire (1996)

How do we combine the results now?

From, L ́eon Bottou


Thank You!
Issue -2 : Combining Results of Base Learners
f (x) = 2 sin(1.5x), and noisy samples with (N (0, 1)) noise
Issue -2 : Combining Results of Base Learners
Order 1, polynomial
Issue -2 : Combining Results of Base Learners
Order 3, polynomial
Issue -2 : Combining Results of Base Learners
Order 5, polynomial
Issue -2 : Combining Results of Base Learners
Order 5, polynomial

Averaging over models with large variance, we get a better fit than those
of the individual models …
Issue -2 : Combining Results of Base Learners
Order 5, polynomial

Idea for Voting:


● Vote over models with high variance and low bias
● After combination, the bias remains small and we reduce the
variance by averaging
● Even if the individual models are biased, the decrease in variance
may offset this bias and still a decrease in error is possible.
AdaBoost (Freund and Schapire (1996)
AdaBoost (Freund and Schapire (1996)

Begin here, initializing all examples assigned same probability


AdaBoost (Freund and Schapire (1996)

Start with L base learners, train one by one


AdaBoost (Freund and Schapire (1996)

With Xj chosen using pjt , train dj and compute yj


AdaBoost (Freund and Schapire (1996)

Weighted sum of all errors -- Errors weighted by the importance of the example
AdaBoost (Freund and Schapire (1996)

If the error rate is > 0.5, the base learner is beyond acceptable and cannot proceed !
AdaBoost (Freund and Schapire (1996)

Very naive to scale epsilon in the range [0-1].


-> Low beta for correctly classified instances
AdaBoost (Freund and Schapire (1996)
Approach #2:

Very naive to scale epsilon in the range [0-1].


AdaBoost (Freund and Schapire (1996)
Approach #2:

Instead compute 𝛼, a magic quantity ?

ꭤ --->
ε --->
AdaBoost (Freund and Schapire (1996)
Approach #2:

Higher 𝛼 → better learner

Instead compute 𝛼, a magic quantity ?


AdaBoost (Freund and Schapire (1996)

Approach #2:

For all incorrectly classified Example,


increase weights as
pjt = pjt * e𝛼

For all correctly classified Examples


decrease weights as
pjt = pjt * e-𝛼
AdaBoost (Freund and Schapire (1996)

Normalize the probabilities to use in the next iteration


Unsupervised Learning Algorithms

S.P.Vimal
BITS, Pilani
Topics

• Unsupervised Learning & Clustering


• Hierarchical Agglomerative Clustering
• K-Means - A Quick Overview
• EM Algorithm
Introduction
Unsupervised Learning
• Learning from unlabelled data
• Let X = {x(1),x(2),x(3),...,x(N)}
• The points does not carry labels
• Supervised vs. Unsupervised Learning
• Objective:
• Find patterns / sub-groups among the the data
points using data similarity

Unsupervised Learning - Find grouping based using


data similarity
Introduction

Hierarchical Clustering
Agglomerative Hierarchical Clustering

• More common than divisive hierarchical clustering


• Starts with each point being a cluster, and at each step, merge the closest pair of clusters
• Displayed graphically using a dendrogram – a tree like structure (dendro "tree",
gramma "drawing")

6 5

0.2
4
3 4
2
0.15 5
2
0.1

0.05 1
3 1
0
1 3 2 5 4 6
Agglomerative Hierarchical Clustering
Basic algorithm
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains

 Key operation is the computation of the proximity of two clusters


– Different approaches to defining the distance between clusters distinguish the different algorithms
Starting Situation

• Start with clusters of individual points and a proximity matrix


p1 p2 p3 p4 p5 ...
p1

p2
p3

p4
p5
.
.
. Proximity Matrix

...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation

• After some merging steps, we have some clusters C1 C2 C3 C4 C5


C1

C2
C3
C3
C4
C4
C5

Proximity Matrix
C1

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation

• We want to merge the two closest clusters (C2 and C5) and update the proximity matrix.
C1 C2 C3 C4 C5
C1

C2
C3
C3
C4
C4
C5
C1 Proximity Matrix

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12
After Merging

C2
• The question is “How do we update the proximity matrix?” U
C1 C5 C3 C4

C1 ?

? ? ? ?
C2 U C5
C3
C3 ?
C4
C4 ?

Proximity Matrix
C1

C2 U C5

...
p1 p2 p3 p4 p9 p10 p11 p12
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2

p3

p4

p5
 MIN
.
 MAX
.
 Group Average .
Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function
– Ward’s Method uses squared error
Cluster Similarity: MAX or Complete Linkage

• Similarity of two clusters is based on the two least similar (most distant) points in the
different clusters

I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Cluster Similarity: Group Average

• Proximity of two clusters is the average of pairwise proximity between points in the two
clusters.

• cluster similarity = average similarity of all pairs

I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Hierarchical Clustering: Comparison

5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4

5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3
Example

p1 0
p2 0.24 0
p3 0.22 0.15 0
p4 0.37 0.20 0.15 0
p5 0.34 0.14 0.28 0.29 0
p6 0.23 0.25 0.11 0.22 0.39 0
p1 p2 p3 p4 p5 p6

4/4/2024 15
Single Link or MIN

4/4/2024 16
Single Link or MIN

• The height at which 2 clusters are merged in the Dendrogram reflects the distance of the two
clusters
• The distance between 3 and 6 is .11

• The distance between clusters {3,6} and {2,5} is given by

p1 0 p1 0
p2 0.24 0 p2 0.24 0
(p3, p6) 0.22 0.15 0 (p3, p6) 0.22 0.15 0
p4 0.37 0.20 0.15 0 p4 0.37 0.20 0.15 0
p5 0.34 0.14 0.28 0.29 0 p5 0.34 0.14 0.28 0.29 0
p1 p2 (p3, p6) p4 p5 p1 p2 (p3, p6) p4 p5
Agglomerative Algorithm
K-Means Introduction

Clustering

• An unsupervised Learning task aiming to find groupings


in data
• Given a X, find K clusters using data similarity

Unsupervised Learning - Find grouping based using


data similarity
K-Means Introduction
Clustering

Choosing K=2
K-Means Example

Iteration 1:
• C1 = {2,3} C2 = {4,10,11,12,20,25,30}
• μ1 = (2+3)/2 = 2.5
• μ2 = (4+10+11+12+20+25+30)/7 = 16
Iteration 2:
• C1 = {2,3,4} C2 = {10,11,12,20,25,30}
• μ1 = (2+3+4)/3 = 3
• μ2 = (10+11+12+20+25+30)/=18

and so on..
K-Means
What is it about?
• k-means partitions points in K clusters such that
• Distance between points inside the clusters is
smaller than distance between points across
clusters

• Let
• 𝜇k - Mean of all points in cluster k
- a d-dimensional vector

• rnk 𝜖 {0,1}
• rnk = 1, point xn is assigned to cluster k
• rnk = 0, otherwise
K-Means
Distortion Measure
• k-means identifies K clusters such that the sum of the
squares of the distances of each data point to its center
(represented by its mean) is minimized.

Objective function:

• Objective of Learning: Find {𝜇k}and {rnk} such that J is


minimized for the given X and K
• K-Means is an iterative algorithm to find such {𝜇k} and
{rnk}
K-Means Algorithm
How does it work?
• Works iteratively to find {𝜇k} and {rnk} such that J is
minimized

Iteration involves two key steps


(1) Find {rnk} , fixing {𝜇k} to minimize J
(2) Find {𝜇k} , fixing {rnk} to minimize J
Let us look at each of these steps
K-Means Algorithm
How does it work?
• Works iteratively to find {𝜇k}and {rnk} such that J is
minimized

Let us determine rnk :


Assume 𝜇1 and 𝜇2 are fixed.
K-Means Algorithm
Understand E-Step
• Works iteratively to find {𝜇k}and {rnk} such that J is
minimized

Let us determine rnk :


Assume 𝜇1 and 𝜇2 are fixed.
J is a linear combination of rnk.
Each rik can be optimized independently.
Assigning rik to the closest 𝜇j minimizes J.
[Called as E-Step]
K-Means Algorithm

E-Step:

For all xt ∊ X:
K-Means Algorithm

• Works iteratively to find {𝜇k}and {rnk} such that J is


minimized

Let us determine 𝜇k :
Assume {rnk} are determined in the E-Step are fixed.
K-Means Algorithm

• Works iteratively to find {𝜇k}and {rnk} such that J is


minimized

Let us determine 𝜇k :
Assume {rnk} are determined in the E-Step are fixed.
J is a quadratic function of 𝜇k .
To optimize,
Take the derivative of J w.r.t 𝜇k, & set it to 0
Solve for 𝜇k, we get
K-Means Algorithm

• Works iteratively to find {𝜇k}and {rnk} such that J is


minimized

Let us determine 𝜇k :
Assume {rnk} are determined in the E-Step are fixed.
J is a quadratic function of 𝜇k .
To optimize,
Take the derivative of J w.r.t 𝜇k, & set it to 0
Solve for 𝜇k, we get Sum of all the points in a cluster
Number of points in a cluster
K-Means Algorithm

• Works iteratively to find {𝜇k}and {rnk} such that J is


minimized

Let us determine 𝜇k :
Assume {rnk} are determined in the E-Step are fixed.
J is a quadratic function of 𝜇k .
To optimize,
Take the derivative of J w.r.t 𝜇k, & set it to 0
Solve for 𝜇k, we get Sum of all the points in a cluster
Number of points in a cluster
K-Means Algorithm

• Works iteratively to find {𝜇k}and {rnk} such that J is


minimized

Let us determine 𝜇k :
Assume {rnk} are determined in the E-Step are fixed.
J is a quadratic function of 𝜇k .
To optimize,
Take the derivative of J w.r.t 𝜇k, & set it to 0
Solve for 𝜇k, we get Sum of all the points in a cluster
Number of points in a cluster
K-Means Algorithm

• Works iteratively to find {𝜇k}and {rnk} such that J is


minimized

Let us determine 𝜇k :
Assume {rnk} are determined in the E-Step are fixed.
J is a quadratic function of 𝜇k .
To optimize,
Take the derivative of J w.r.t 𝜇k, & set it to 0
Solve for 𝜇k, we get Sum of all the points in a cluster
Number of points in a cluster
K-Means Algorithm

M-Step:

For all 𝜇k[where k = 1,2,...,K ] :


K-Means Algorithm

Algorithm:

Initialize 𝜇k[where k = 1,2,...,K ]


Repeat
E-Step [as defined earlier]
M-Step [as defined earlier]
Until convergence of 𝜇k .
K-Means Algorithm

Algorithm:

Initialize 𝜇k[where k = 1,2,...,K ]


Repeat
E-Step [as defined earlier]
M-Step [as defined earlier]
Until convergence of 𝜇k .

E-Step in the second iteration


K-Means Algorithm

Algorithm:

Initialize 𝜇k[where k = 1,2,...,K ]


Repeat
E-Step [as defined earlier]
M-Step [as defined earlier]
Until convergence of 𝜇k .

M-Step in the second iteration


K-Means Algorithm

Algorithm:

Initialize 𝜇k[where k = 1,2,...,K ]


Repeat
E-Step [as defined earlier]
M-Step [as defined earlier]
Until convergence of 𝜇k .

M-Step in the second iteration


K-Means Algorithm
Convergence

• E and M step of each iteration


optimizes J. Stop E- Means when
• J no longer changes or changes
to J is not not significant
• There may be an upper limit on
the number of iterations set.
• Choice of initial values have an
impact on the convergence
Multimodal Data
Issues fitting single gaussian to real data set

• Gaussian distribution is that it is

Time to next Geyser Eruptions (in Mins)


intrinsically unimodal
• A single distribution is not sufficient to
fit many real data sets which is
multimodal

Length of Geyser Eruptions (in Mins)

Data Set: The data used this module came from ‘Old Faithful Data’ available from https://www.kaggle.com/janithwanni/old-faithful/data for download & is used by the text book
PRML.
Multimodal Data
Issues fitting single gaussian to real data set

• Gaussian distribution is that it is

Time to next Geyser Eruptions (in Mins)


intrinsically unimodal
• A single distribution is not sufficient to
fit many real data sets which is
multimodal

Length of Geyser Eruptions (in Mins)

Data Set: The data used this module came from ‘Old Faithful Data’ available from https://www.kaggle.com/janithwanni/old-faithful/data for download & used by the text book PRML.
Mixture Distributions

• Mixture Distributions: Linear combinations of basic distributions (such as Gaussian) to


approximate complex density

Data Set: The data used this module came from ‘Old Faithful Data’ available from https://www.kaggle.com/janithwanni/old-faithful/data for download & used by the text book PRML.
EM Clustering

• The EM algorithm finds


maximum-likelihood
estimates for model
parameters when you have
incomplete data. The "E-
Step" finds probabilities for
the assignment of data
points, based on a set of
hypothesized probability
density functions; The "M-
Step" updates the original
hypothesis with new data.
The cycle repeats until the
parameters stabilize.
EM Clustering - Example
Additional Slides
Mean Shift Algorithm

Outline

The mean shift algorithm seeks modes or local maxima of density in the feature
space
Mean shift
Search
window

Center of
mass

Mean Shift
vector

Slide by Y. Ukrainitz & B. Sarel


Mean shift
Search
window

Center of
mass

Mean Shift
vector

Slide by Y. Ukrainitz & B. Sarel


Mean shift
Search
window

Center of
mass

Mean Shift
vector

Slide by Y. Ukrainitz & B. Sarel


Mean shift
Search
window

Center of
mass

Mean Shift
vector

Slide by Y. Ukrainitz & B. Sarel


Mean shift
Search
window

Center of
mass

Mean Shift
vector

Slide by Y. Ukrainitz & B. Sarel


Mean shift
Search
window

Center of
mass

Mean Shift
vector

Slide by Y. Ukrainitz & B. Sarel


Mean shift
Search
window

Center of
mass

Ref: https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68
Slide by Y. Ukrainitz & B. Sarel
Thank You!
ANN and Deep Learning

Dr. Sugata Ghosal


sugata.ghosal@pilani.bits-pilani.ac.in
Linear Classification
Consider Following data

x1 X2 y Data is in 2D, so let us visualize

1 9 Green
x2
10 9 Green
4 7 Green
4 5 Red
5 3 Red
8 9 Green
x1
4 2 Red
2 5 Red
• Data looks linearly separable
7 1 Red
• What is the decision boundary?
2 10 Green
Many Possibilities, such as
8 5 Green
1 2 Red if (2x1 + 3x2 − 25 > 0) it is green
8 2 Red otherwise red
What about this arrangement?
With chosen decision boundary 2x1 + 3x2 − 25 = 0
x2
-25

x1 2
sign

x2 3
x1

• This illustration is called as perceptron


• Provides a graphical way to represent the linear boundary
• Values 3, 2, -25 are its parameters or weights

Given a data
“How to find appropriate parameters?” is an important issue
Perceptron Training Rule
Different algorithms may converge to different acceptable hypotheses

Algorithm: Perceptron training rule


1 Begin with random weights w
2 repeat
3 for each misclassified example do
wi = wi + η(t −o)x i
4
5 until all training examples are correctly classified;
6 return w

• Why would this strategy converge?


• Weight does not change when classification is correct
• If perceptron outputs -1 when target is +1: weight increases ↑
• If perceptron outputs +1 when target is -1: weight decreases ↓

Conversion with perceptron training rule is subject to linear separability of training example and appropriate η
An Example
Consider a perceptron with output 0/1 as shown x1 x2 Output

x0=+1 w0 =-30 0 0 0

w1=+20 σ(w T x) 0 1 0
x1
1 0 0
x2 w2=+20
1 1 1

Represents a linear decision


x2 boundary

This perceptron computes logical AND


+ve region
• w0=-10 gives logical OR
• w0=10, w1=-20 with single input gives logical NOT
-ve region
• XOR is not possible
x1
An Example
Design a perceptron for
We have following four equations
X1 x2 Classification w0 + w1 × (0) + w2 × (0) < 0 (1)
0 0 0 w0 + w1 × (0) + w2 × (1) < 0 (2)
0 1 0 w0 + w1 × (1) + w2 × (0) ≥ 0 (3)
1 0 1 w0 + w1 × (1) + w2 × (1) < 0 (4)

1 1 0
By (1) w0 < 0 so let w0 = −1
Let us assume following
By (2) w0+w2<0 so let w2 =−1
+1 w0
By (3) w0+w1≥0 so let w1 =1.5
By (4) w0+w1+w2<0 that is valid
w1 σ(w T x)
x1 So (w0,w1,w2) = (−1,−1,1.5)
Other possibilities are also there
w2
x2
Perceptron Training (delta rule)

• Delta rule converges to a best-fit approximation of the target


• Uses gradient descent
• Consider un-thresholded perceptron, i.e., output
• Training error is defined as

• Gradient would specify direction of steepest increase

• Weights can be learned as

• It can be seen that


Perceptron Training (delta rule)

Algorithm: Gradient Descent (D,η)


1 Initialize wi with random weights
2 repeat
3 For each wi , initialize δw i = 0
4 for each training example d ∈D do
5 Compute output o using model for d whose target is t
6 For each wi , update δw i = δw i + η(t−o)xi
7 For each wi , set wi = wi + δw i
8 until termination condition is met;
9 return w

• A date item d ∈ D, is supposed to be multidimensional d = (x1, x2, ...,xn, t )


• Algorithm converges toward the minimum error hypothesis.
• Linear programming can also be an approach
Neural Network
When neurons are interconnected in layers

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer

• Number of layers may differ


• Nodes in each intermediate layers may also differ
• Multiple output neurons are used for different class
• Two levels deep NN can represent any Boolean function
Nonlinear Decision Boundary
Effect of number of hidden nodes

1
https://github.com/dennybritz/nn-from-scratch/blob/master/nn-from-scratch.ipynb
Neuron
Neuron uses nonlinear activation functions (sigmoid, tanh, ReLU, softplus etc.)
at the place of thresholding

+1 w0
sigmoid(x)
tanh(x)
softplus(x)

w1 act(w T x) 1
x1
0
w2
x
x2
Neural Network Applicability

• NN is appropriate for problems with the following characteristics:

• Instances are provided by many attribute-value pairs (more data)


• The target function output may be discrete-valued, real-valued, or a vector of several real or
discrete valued attributes
• The training examples may contain errors

• Long training times are acceptable


• Fast evaluation of the target function may be required
• The ability of humans to understand the learned target function is not important
Visualizing Neural Networks

• http://neuralnetworksanddeeplearning.com/chap4.html
Backpropagation
Session Content

• Backpropagation Algorithm
• Behavior of Backpropagation
Learning in NN: Backpropagation
Learning in NN: Backpropagation

• Computation of weights for hidden nodes is not trivial because it is difficult to assess their error
term (partial derivative) without knowing what their output values should be

• Back propagation: Two phases in each iteration of the algorithm: the forward phase and the
backward phase
• During the forward phase, the weights obtained from the previous iteration are used to compute the
output value of each neuron in the network
• The computation progresses in the forward direction; i.e., outputs of the neurons at level k are computed
prior to computing the outputs at level k + 1
• During the backward phase, the weight update formula is applied in the reverse direction
• In other words, the weights at level k + 1 are updated before the weights at level k are updated
• This back-propagation approach allows us to use the errors for neurons at layer k + t to estimate the errors
for neurons at Iayer k
Learning in NN: Backpropagation
Learning in NN: Backpropagation
Learning in NN: Backpropagation
Learning in NN: Backpropagation
Learning in NN: Backpropagation
Learning in NN: Backpropagation
Incremental (Stochastic) vs Batch Gradient Descent

Batch gradient descent:


• Computes the gradient using the whole
dataset – run through all samples in the
training set to find the delta
• Slow in very large datasets, because we
have to run through the entire training set
to update the values in every iteration

Stochastic gradient descent:


• Computes the gradient using a single
sample
• Update the parameters according to the
gradient of the error with respect to that
single training example only
• Computationally faster – we use only one
training sample at a time, and it starts
making progress right from the first
sample
Introduction to Deep Learning

Dr. Sugata Ghosal


Session Content

• General machine learning pipeline


• Characteristics of Deep Learning
• Error Surface
• Hyperparameters
General ML Pipeline

Quality Feature
input Enhancement
Check Extraction

Database Matching Decision

Deep Learning can provided end-to-end learning


Key Points
DL is constructing network of parameterized functional modules and training them from
examples using gradient based optimization – Yann Lecun

• Traditional way to hand engineer feature is brittle and not scalable


• Why now? because we have data (ImageNet …), compute power (GPUs) and software
(tensorFlow)
• DL learns underlying features/representations directly from data.
• DL have been found useful to many real life problems. Capable of solving very complex
problems.
• As the data increases algorithm DL becomes better
• Deep learning is NOT just adding more layers to a neural network.
DL Builds Hierarchy of Features
Higher (or deeper) layers represents abstraction of the increasing complex features
Hyperparameter Tuning
There are many interesting questions for the network
• How many layers?
• How many units in each layer?
• What should be the learning rate?
• What is right activation function? etc.
Difficult to answer in the beginning
• It is an iterative process

Demo
https://playground.tensorflow.org/
Neural Networks II
Convolutional Neural Networks

Dr. Sugata Ghosal


CSIS Off Campus Faculty
BITS Pilani
Topics

• Introduction to Convolutional Networks


• Fully Connected Layer
• Convolution Layer
• Edge Detector
Convolutional Neural Networks (CNN)
• Special kind of neural network for processing data that has
a known, grid-like topology.
• e.g., Time-series data – 1D grid taking samples at
regular time intervals.
• e.g., Image data – 2D grid of pixels.
• Network employs a mathematical operation called
convolution.
• Convolutional networks are simply neural networks that use
convolution in place of general matrix multiplication in some
of their layers.
Convolutional Neural Networks (CNN)
Example – Recognizing 7
Example – Recognizing 7
More filters
As we go deeper into the layers
As we go deeper into the layers
Types of Layers

• Convolution (Conv) Layer


• Pooling (Pool) Layer
• Fully Connected (FC) Layer
Types of Layers
Convolution (Conv) Layer
• Perform Convolution operation
Image Kernel Conv 1
5 ×5 ×3

Image

Input Conv 1
(1) (1)
nH × nW ×nC nH ×n W ×n C
Convolution Operation

• Suppose we want to detect edges from an image.


• The image will have vertical edges and horizontal edges.
• Convolve the image with a kernel or filter to extract both
the vertical edges and horizontal edges.

Input Size : n ×n
Kernel Size : f ×f
Ouput Size : (n − f + 1) × (n − f + 1)
Convolution Operation

Input Image 6 × 6
3 0 1 2 7 4 Output Image 4 × 4
Kernel 3 × 3
1 5 8 9 3 1
1 0 -1
2 7 2 5 1 3 * =
1 0 -1
0 1 3 1 7 8
1 0 -1
4 2 1 6 2 8
2 4 5 2 3 9
Convolution Operation

Input Image 6 × 6
3 0 1 2 7 4 Output Image 4 × 4
Kernel 3 × 3
1 5 8 9 3 1
1 0 -1
2 7 2 5 1 3 * =
1 0 -1
0 1 3 1 7 8
1 0 -1
4 2 1 6 2 8
2 4 5 2 3 9

value = 3(1) + 1(1) + 2(1) + 0(0) + 0(5) + 7(0) + 1(−1) + 8(−1) + 2(−1)
= −5
Convolution Operation

Input Image 6 × 6
3 0 1 2 7 4 Output Image 4 × 4
Kernel 3 × 3
1 5 8 9 3 1 -5
1 0 -1
2 7 2 5 1 3 * =
1 0 -1
0 1 3 1 7 8
1 0 -1
4 2 1 6 2 8
2 4 5 2 3 9

value = 0(1) + 5(1) + 7(1) + 1(0) + 8(0) + 5(0) + 2(−1) + 9(−1) + 5(−1)
= −4
Convolution Operation

Input Image 6 × 6
3 0 1 2 7 4 Output Image 4 × 4
Kernel 3 × 3
1 5 8 9 3 1 -5 -4 0 8
1 0 -1
2 7 2 5 1 3 * = -10 -2 -2 3
1 0 -1
0 1 3 1 7 8 0 -2 -4 -7
1 0 -1
4 2 1 6 2 8 -3 -2 -3 -16
2 4 5 2 3 9

value = 1(1) + 6(1) + 2(1) + 7(0) + 2(5) + 3(0) + 8(−1) + 8(−1) + 9(−1)
= −16
Vertical Edge Detector

Input Image 6 × 6
10 10 10 0 0 0 Output 4× 4
Kernel 3 × 3
10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0 * = 0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
10 10 10 0 0 0
Horizontal Edge Detector

Input Image 6 × 6
10 10 10 10 10 10 Output 4× 4
Kernel 3 × 3
10 10 10 10 10 10 0 0 0 0
1 1 1
10 10 10 10 10 10 * = 30 30 30 30
0 0 0
0 0 0 0 0 0 30 30 30 30
-1 -1 -1
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0
Edge Detectors
Similarly edge at 45◦, 60 ◦ and all can be detected, by
appropriately choosing the kernels or filters.
Then stack the different filters to obtain edges.
The aim of the Convolution layer will be learn the filter or kernel
matrix.
The NN will learn the 9 weights in thefilter.

Filter 3 ×3
w1 w2 w3
w4 w5 w6
w7 w8 w9
Fully Connected (FC) Layer
• Stack of neurons in one layer is connected to every otherneuron
in another layer.

FC1 FC2 LR

yˆ1
Neural Networks II
Pa d d i n g , S t r i d i n g , Po o l i n g

Dr. Sugata Ghosal


CSIS Off Campus Faculty
BITS Pilani
Topics

• Padding
• Striding
• Pooling
• Max Pooling
• Average Pooling
Convolution Operation

Input Image 6 × 6
3 0 1 2 7 4 Output Image 4 × 4
Kernel 3 × 3
1 5 8 9 3 1 -5 -4 0 8
1 0 -1
2 7 2 5 1 3 * = -10 -2 -2 3
1 0 -1
0 1 3 1 7 8 0 -2 -4 -7
1 0 -1
4 2 1 6 2 8 -3 -2 -3 -16
2 4 5 2 3 9
Padding

When filter is convolved with the input image


) Image shrinks – 6 × 6 image became a 4 × 4 image.
) Pixels in the corners are used only once, when compared the
pixels in the middle. The data in the corners are thrown away.
To resolve, apply padding to the input image before convolution.
Padding means append zeros around the boundary of the input.
Convolution with Padding

Input Image with padding 8 × 8


0 0 0 0 0 0 0 0
0 3 0 1 2 7 4 0
Kernel 3 × 3
0 1 5 8 9 3 1 0
1 0 -1
0 2 7 2 5 1 3 0 *
1 0 -1
0 0 1 3 1 7 8 0
1 0 -1
0 4 2 1 6 2 8 0
0 2 4 5 2 3 9 0
0 0 0 0 0 0 0 0
Padding

Two kinds of padding


) Valid Padding
Ç Padding is notapplied.
p=0
Ç Input size ƒ= outputsize.
Ç Output shrinks when compared to theinput.
) Same Padding
Ç Padding is applied.
f− 1
p=
2
Ç Input size = outputsize.
Ç We maintain the size of input and output as the same.
Padding
How much to pad?
f -1
Padding : p = 2
Kernel 3 × 3 : p = 1
Kernel 5 × 5 : p = 2
Kernel 7 × 7 : p = 3

Padding preserves the input size.


Input Size : n ×n
Padding : p
Kernel Size : f ×f
Ouput Size : (n + 2p − f + 1) × (n + 2p − f + 1)
Striding

• Skip by how many pixels.

Input Size : n × n
Padding : p
Stride : s
Kernel Size : f ×f
n + 2p − f n + 2p − f
Ouput Size :, + 1 , ×, +1 ,
s s

• The filter must lie completely inside the input.


Convolution with Stride

Input Image 7 × 7
2 4 7 4 6 2 9
Kernel 3 × 3 Output 3 × 3
6 6 9 8 7 4 3
Stride = 2 91
3 4 8 3 8 9 7
* 3 4 4 =
7 8 3 6 6 3 4
1 0 2
4 2 1 8 2 4 6
-1 0 3
3 2 4 1 9 8 3
0 1 3 9 2 1 4
Convolution with Stride

Input Image 7 × 7
2 4 7 4 6 2 9
Kernel 3 × 3 Output 3 × 3
6 6 9 8 7 4 3
Stride = 2 91 100
3 4 8 3 8 9 7
* 3 4 4 =
7 8 3 6 6 3 4
1 0 2
4 2 1 8 2 4 6
-1 0 3
3 2 4 1 9 8 3
0 1 3 9 2 1 4
Convolution with Stride

Input Image 7 × 7
2 4 7 4 6 2 9
Kernel 3 × 3
6 6 9 8 7 4 3 Output 3 × 3
Stride = 2
3 4 8 3 8 9 7 91 100 83
* 3 4 4 =
7 8 3 6 6 3 4 69 91 127
1 0 2
4 2 1 8 2 4 6 44 72 74
-1 0 3
3 2 4 1 9 8 3
0 1 3 9 2 1 4
One Layer C N N

Kernel 1 Relu W +
3 ×3 ×3 b

Image Conv
6 ×6 ×3 4 ×4 ×2

Kernel 2 Relu W +
3 ×3 ×3 b
Number of Parameters

• The number of parameters depend on the number of kernels


and the kernel size.

Kernel : f × f ∗ nC
Bias : 1
# Parameters : f ∗ f ∗ nC
# Parameters : 3 ∗ 3 ∗ 3 + 1 = 28 per kernel
2 kernels : 28 ∗ 2 =56
Pooling (Pool) Layer

• Used for subsampling


• Reduce the size of representation
• Speed up computation
Pooling
• Apply a filter of size f and stride s without padding.
• The output dimensions will be
, , , ,
n −f n −f
+1 × +1
s s

• Fixed computation
• Gradient descent is not applied as no learnable parameters
• Pooling is applied on each of the channels.
Pooling Layer

• Two types
• Max Pooling
• Average Pooling
Max Pooling
• Take the maximum of the sub region that is considered.

Input 4 × 4
Output
2 4 7 4
Max pooling 2 ×2
6 6 9 8 → f =2 →
s =2 6 9
3 4 8 3
7 8
7 5 3 6
Average Pooling
• Compute the average of the sub region that is considered.

Input 4 × 4
Output
2 4 7 4
Average pooling 2 ×2
6 6 9 8 → f= 2 →
s =2 4.5 7
3 4 8 3
4.75 5
7 5 3 6
Neural Networks II
Architectures for Classification

Dr. Sugata Ghosal


CSIS Off Campus Faculty
BITS Pilani
Topics

• LeNet-5
• AlexNet
• VGG 16
LeNet-5
• Handwritten Character recognition 10-categories
classification problem
• Approx. 60 thousand parameters are learned.
AlexNet
• ImagetNet dataset
• 1000-categories classification problem
• Approx. 60 million parameters are learned.
VGG 16
• 16 Layers
• Approx. 138 million parameters are learned.
Neural Networks II
Re c u r r e n t N e u r a l N e t w o r k
Dr. Sugata Ghosal
CSIS Off Campus Faculty
BITS Pilani
Recurrent Neural Networks

time step t

y y<t>
Networks we used
previously: also called Recurrent Neural
h h<t>
feedforward neural Network (RNN)
networks

x x<t>

Recurrent edge
Overview
y<t> Single layer RNN y<t-1> y<t> y(t+1)

h<t> Unfold h<t-1> h<t> h(t+1)

x<t> x<t-1> x<t> x(t+1)

Multilayer RNN
y<t> y<t-1> y<t> y(t+1)

h<t> h<t-1> h<t> h(t+1)


2 2 2 2

Unfold
h<t> h<t-1> h<t> h(t+1)
1 1 1 1

x<t> x<t-1> x<t> x(t+1)


Applications

• Text classification
• Speech recognition (acoustic modeling)
• language translation
• ...

Stock market predictions

Displays the actual data and the predicted data from the four models for each stock index in Year Shen, Zhen, Wenzheng Bao, and De-Shuang Huang. "Recurrent
1 from 2010.10.01 to 2011.09.30. Neural Network for Predicting Transcription Factor Binding Sites."
Scientific reports 8, no. 1 (2018): 15270.

DNA or (amino acid/protein)


sequence modeling
Machine Learning - Deployment

Anita Ramachandran
March 2024

BITS Pilani, Pilani Campus


Agenda

• Model engineering
• Model standardization
• Model compression
• Recent topics

BITS Pilani, Pilani Campus


Model Engineering

• ML/AI is rapidly adopted by new applications and industries!

• Goal of a machine learning project is to build a statistical model by using


collected data and applying machine learning algorithms
o Yet building successful ML-based software projects is still difficult

• Every ML-based software needs to manage three main assets:


o Data
o Model
o and Code

• Essential technical methodologies - involved in the development of the


Machine Learning-based software
o Data Engineering: data acquisition & data preparation,
o ML Model Engineering: ML model training & serving, and
o Code Engineering : integrating ML model into the final product.

BITS Pilani, Pilani Campus


Model Engineering

• Phase of writing and executing machine learning algorithms to obtain an ML model

• The Model Engineering pipeline includes a number of operations that lead to a final
model:
– Model Training
o The process of applying the machine learning algorithm on training data to train an ML model
o includes feature engineering and the hyperparameter tuning for the model training activity

– Model Evaluation
o Validating the trained model to ensure it meets original codified objectives before serving the ML
model in production to the end-user

– Model Testing
o Performing the final “Model Acceptance Test” by using the hold backtest dataset

– Model Packaging
o The process of exporting the final ML model into a specific format
o which describes the model, in order to be consumed by the business application

BITS Pilani, Pilani Campus


Model Deployment

• Need to deploy model as part of a business application such as a mobile or desktop


application
o The ML models require various data points (feature vector) to produce predictions
o The final stage of the ML workflow is the integration of the previously engineered ML model into
existing software

• Includes the following operations:


– Model Serving
o The process of addressing the ML model artifact in a production environment

– Model Performance Monitoring


o The process of observing the ML model performance based on live and previously unseen data, such as
prediction or recommendation
o interested in ML-specific signals, such as prediction deviation from previous model performance
o signals might be used as triggers for model re-training

– Model Performance Logging


o Every inference request results in the log-record

BITS Pilani, Pilani Campus


Model Deployment

• Need for focus on deployment


• Options
– Deploying machine learning models as web services
• Persist models, serve them up using REST APIs (e.g. using
Flask)
– Deploying machine learning models for batch
prediction
– Deploying machine learning models on edge devices
as embedded models
• Reduced latency, reduced data bandwidth consumption
• Model compression
• Tflite
Ref: https://towardsdatascience.com/3-ways-to-deploy-machine-learning-models-in-production-cdba15b00e

BITS Pilani, Pilani Campus


• ML deployment techniques
• Key components of model monitoring

BITS Pilani, Pilani Campus


Steps in Operationalizing ML
Models

BITS Pilani, Pilani Campus


Model Engineering

• Machine learning pipelines vs classical software


engineering pipelines
• MLOps
– Comparison with DevOps
– MLOps tools and pipelines
– Pipeline orchestration
• AutoML
– Hyperparameter tuning
– AutoML frameworks

BITS Pilani, Pilani Campus


ML Workflows

• ML architecture patterns
• Model serving patterns

BITS Pilani, Pilani Campus


Model Serving Strategies

• Common ways for wrapping trained models as


deployable services, namely deploying ML
models as
o Docker Containers to Cloud Instances
o Serverless Functions

BITS Pilani, Pilani Campus


Agenda

• Model engineering
• Model standardization
• Model compression
• Recent topics

BITS Pilani, Pilani Campus


Model standardization

• Why is model serving difficult?


– Data scientists and software engineers: two groups with different
goals
– Tool proliferation
• Model representation standards
– Similar to WSDL in the web server world
– PMML and PFA from the Data Mining group
• Support from development libraries
– TensorFlow allows to export models in text & binary for transferring
models between machine learning and model serving
– ONNX
• An open format to represent ML models
• Allows using one tool stack for model training and another for inferencing
• Enables framework interoperability to prevent framework lock-in

BITS Pilani, Pilani Campus


Agenda

• Model engineering
• Model standardization
• Model compression
• Recent topics

BITS Pilani, Pilani Campus


Model Compression

• Resource constrained environments


• Resource requirements for hosting ML models
• Model compression techniques
– Parameter pruning and quantization, low rank
factorization, transferred/compact convolutional
filters, knowledge distillation
• TinyML
– TFLite, GLOW, PyTorch Mobile, Edge Impulse, AIfES
• Architectural considerations
– Tradeoffs – latency, accuracy, compute requirements,
power consumption

BITS Pilani, Pilani Campus


Agenda

• Model engineering
• Model standardization
• Model compression
• Recent topics
– Data Privacy for Machine Learning
– Distributed Machine Learning
– Federated Machine Learning

BITS Pilani, Pilani Campus


Data Privacy for Machine Learning

• Data privacy issues


– What is private and what is not; Who do we want to keep data private
from; Who are the trusted parties
– Protection of Personally identifying information (PII), sensitive data,
quasi-identifying data
– Should users who provide less data receive less accurate predictions
than the users who contribute more data
• Methods to help increase privacy
– Differential Privacy (Local DP, Global DP)
• Federated learning system
• Private Aggregation of Teacher Ensembles (PATE) approach
• TensorFlow Privacy
– Encrypted Machine Learning
• Homomorphic encryption (HE) and Secure Multiparty Computation (SMPC)

BITS Pilani, Pilani Campus


Agenda

• Model engineering
• Model standardization
• Model compression
• Recent topics
– Data Privacy for Machine Learning
– Distributed Machine Learning
– Federated Machine Learning

BITS Pilani, Pilani Campus


Distributed Machine Learning

• Multinode machine learning algorithms and


systems that are designed to improve
performance, increase accuracy, and scale to
larger input data sizes
• Data parallelism
– Data is split across devices; each node trains on a
subset of data
• Model parallelism
– Model is split across devices; each node trains on the
complete dataset

BITS Pilani, Pilani Campus


Agenda

• Model engineering
• Model standardization
• Model compression
• Recent topics
– Data Privacy for Machine Learning
– Distributed Machine Learning
– Federated Machine Learning

BITS Pilani, Pilani Campus


Federated Machine Learning

• A protocol where the training of a machine


learning model is distributed across many
different devices and the trained model is
combined on a central server
• Model training
• Secure aggregation of weights into the central
model
• DP limits the amount of information that each
user can contribute to the final model

BITS Pilani, Pilani Campus


Federated Machine Learning

Source: https://viso.ai/deep-learning/federated-learning/

• Challenges
– Not all data sources may have collected new data between one training run and the
next
– Not all mobile devices are powered on all the time
– Performance issues on devices that train an FL model
– Requires frequent communication across the nodes in the federated network
– Workflows
BITS Pilani, Pilani Campus
Federated Learning vs Distributed
Learning

• FL does not allow direct raw data communication.


DL does not have any such restriction.
• FL employs the distributed computing resources
in multiple regions or organisations. DL utilises a
single server or a cluster in a single region, which
belongs to a single organisation.
• FL generally takes advantage of encryption or
other defence techniques to ensure data privacy
or security. FL promises to safeguard the
confidentiality and security of the raw data.
There is less focus on safety in DL.

BITS Pilani, Pilani Campus


Applications

• TinyML
– Industrial Predictive Maintenance, Healthcare,
Agriculture, Ocean life conservation
• Federated Learning
– Useful in the context of mobile phones with
distributed data, or a user’s browser
– Sharing of sensitive data that is distributed across
multiple data owners
– An example of FL in production is Google’s Gboard
keyboard for Android mobile phones

BITS Pilani, Pilani Campus


FAccT Machine Learning

Dr. Sugata Ghosal


CSIS Off-Campus Faculty
BITS Pilani
Fairness in Machine Learning

• What to do to ensure gender


and ethnic fairness in ML
models?
Accountability

• Who takes the


responsibilities for
failed ML models?
Transparency

• What to do to make ML
models transparent and
comply with regulations?
Privacy Issues

• How to protect user


privacies when exposing
data to ML models?
Security Issues

● How do we defend ML
models against data
poisoning?
FAccT Overview

Psychology
Social Science
Public Policy
Fairness Accountability

Statistics
Transparency Theory

Machine
Privacy Robustness Learning
Fairness and Bias

Dr. Sugata Ghosal


CSIS Off-Campus Faculty
BITS Pilani
Topics

• Sources of Bias
• Real World Examples
• Sensitive Features
• Fairness Criteria
Fairness

• What is Fairness?
• The absence of bias towards an individual or a group (Mehrabi et al, 2019)
• Can ML Models Discriminate?
• Aren't machines just follow human's instructions?
• ML models approximate patterns in the data
• Learns/Amplifies biases at the same time
Model Sensitivity to Data

red - biased regression


dashed green - regression for each subgroup solid
green - unbiased regression
Simpson’s Paradox

https://en.wikipedia.org/wiki/Simpson%27s_paradox
Real World Example
Real World Example
Real World Example
Real World Example
Commercial risk assessment software known as COMPAS

• Assess more than 1 million offenders since


2000
• Predicts a defendant’s risk of committing a
misdemeanor or felony
• 137 features

Dressel et al, 2018


Bias in Historical Data

Gard et al, 2018


Fairness Through Unawareness
A MLAlgorithm Achieves Fair Through Unawareness If

• None of the sensitive features are directly used in the model

Race Skills Years of Exp Hired?


and
Ethnicity
Training Fair
Hispanic Javascript 1 no
ML Model
Hispanic C++ 5 yes

White Java 2 yes

White C++ 3 yes


Indirect Evidence
Sensitive Features May Still Be Used

• Inferred from indirect evidence

Race and Skills Years of Often Goes Hiring


Ethnicity Exp to Mexican Decision
Markets

Hispanic Javascript 1 yes no Training Discriminatory


ML Model
Hispanic C++ 5 yes yes

White Java 2 no yes

White C++ 3 no yes


Common Fairness Criteria

• Demographic Parity
• Equality of Odds
• Equality of Opportunity
Demographic Parity
Demographic Parity Is Applied to a Group of Samples

• Demographic Parity Is Applied to a Group of Samples


• Does not require features to be masked out

• A Predictor Ŷ Satisfies Demographic Parity If


• The probabilities of positive predictions are the same regardless of
whether the group is protected
• Protected groups are identified as A = 1
Equality of Odds

• Equal Probabilities for Both Qualified/Unqualified People Across Protected


Groups
Equality of Opportunity

• Equal Probabilities for Qualified People Across Protected Groups


Practice Question
Find out the Fairness Criteria that Ŷ1, and Ŷ2 Satisfy
• A = {race}, Y = {Hiring Decision}

Race and Skill Years of Goes to Hiring Predictor Predictor


Ethnicity Exp Mexican Decision Y Ŷ1 Ŷ2
Markets?

Hispanic Javascript 1 yes no 0 1

Hispanic C++ 5 yes yes 1 1

Hispanic Python 1 no yes 1 0

White Java 2 no yes 0 0

White C++ 3 no yes 1 1

White C++ 0 no no 1 0
Demographic Parity for Predictor Ŷ1

• P(Ŷ1 = 1 | R = H) = 2/3
• P(Ŷ1 = 1 | R = W) =

Race and Skill Years of Goes to Hiring Predictor Predictor


Ethnicity Exp Mexican Decision Y Ŷ1 Ŷ2
Markets?

Hispanic Javascript 1 yes no 0 1

Hispanic C++ 5 yes yes 1 1

Hispanic Python 1 no yes 1 0

White Java 2 no yes 0 0

White C++ 3 no yes 1 1

White C++ 0 no no 1 0
Demographic Parity for Predictor Ŷ1

• P(Ŷ1 = 1 | R = H) = 2/3
• P(Ŷ1 = 1 | R = W) = 2/3

Race and Skill Years of Goes to Hiring Predictor Predictor


Ethnicity Exp Mexican Decision Y Ŷ1 Ŷ2
Markets?

Hispanic Javascript 1 yes no 0 1

Hispanic C++ 5 yes yes 1 1

Hispanic Python 1 no yes 1 0

White Java 2 no yes 0 0

White C++ 3 no yes 1 1

White C++ 0 no no 1 0
Demographic Parity for Predictor Ŷ1

● P(Ŷ1 = 1 | R = H) = 2/3 ✔Demographic Parity


● P(Ŷ1 = 1 | R = W) = 2/3

Race and Skill Years of Goes to Hiring Predictor Predictor


Ethnicity Exp Mexican Decision Y Ŷ1 Ŷ2
Markets?

Hispanic Javascript 1 yes no 0 1

Hispanic C++ 5 yes yes 1 1

Hispanic Python 1 no yes 1 0

White Java 2 no yes 0 0

White C++ 3 no yes 1 1

White C++ 0 no no 1 0
Equality of Opportunity/Odds for Predictor Ŷ1

● P(Ŷ1 = 1 | R = H, Y = yes) = 1
● P(Ŷ1 = 1 | R = W, Y = yes) =
● P(Ŷ1 = 1 | R = H, Y = no) =
● P(Ŷ1 = 1 | R = W, Y = no) =
Race and Skill Years of Goes to Hiring Predictor Predictor
Ethnicity Exp Mexican Decision Y Ŷ1 Ŷ2
Markets?

Hispanic Javascript 1 yes no 0 1

Hispanic C++ 5 yes yes 1 1

Hispanic Python 1 no yes 1 0

White Java 2 no yes 0 0

White C++ 3 no yes 1 1

White C++ 0 no no 1 0
Equality of Opportunity/Odds for Predictor Ŷ1

● P(Ŷ1 = 1 | R = H, Y = yes) = 1
● P(Ŷ1 = 1 | R = W, Y = yes) = 0.5
● P(Ŷ1 = 1 | R = H, Y = no) =
● P(Ŷ1 = 1 | R = W, Y = no) =
Race and Skill Years of Goes to Hiring Predictor Predictor
Ethnicity Exp Mexican Decision Y Ŷ1 Ŷ2
Markets?

Hispanic Javascript 1 yes no 0 1

Hispanic C++ 5 yes yes 1 1

Hispanic Python 1 no yes 1 0

White Java 2 no yes 0 0

White C++ 3 no yes 1 1

White C++ 0 no no 1 0
Equality of Opportunity/Odds for Predictor Ŷ1

● P(Ŷ1 = 1 | R = H, Y = yes) = 1
● P(Ŷ1 = 1 | R = W, Y = yes) = 0.5
● P(Ŷ1 = 1 | R = H, Y = no) = 0
● P(Ŷ1 = 1 | R = W, Y = no) =
Race and Skill Years of Goes to Hiring Predictor Predictor
Ethnicity Exp Mexican Decision Y Ŷ1 Ŷ2
Markets?

Hispanic Javascript 1 yes no 0 1

Hispanic C++ 5 yes yes 1 1

Hispanic Python 1 no yes 1 0

White Java 2 no yes 0 0

White C++ 3 no yes 1 1

White C++ 0 no no 1 0
Equality of Opportunity/Odds for Predictor Ŷ1

● P(Ŷ1 = 1 | R = H, Y = yes) = 1
● P(Ŷ1 = 1 | R = W, Y = yes) = 0.5
● P(Ŷ1 = 1 | R = H, Y = no) = 0
● P(Ŷ1 = 1 | R = W, Y = no) = 1
Race and Skill Years of Goes to Hiring Predictor Predictor
Ethnicity Exp Mexican Decision Y Ŷ1 Ŷ2
Markets?

Hispanic Javascript 1 yes no 0 1

Hispanic C++ 5 yes yes 1 1

Hispanic Python 1 no yes 1 0

White Java 2 no yes 0 0

White C++ 3 no yes 1 1

White C++ 0 no no 1 0
Equality of Opportunity/Odds for Predictor Ŷ1

● P(Ŷ1 = 1 | R = H, Y = yes) = 1 ✗Equality of Opportunity


● P(Ŷ1 = 1 | R = W, Y = yes) = 0.5
● P(Ŷ1 = 1 | R = H, Y = no) = 0 ✗Equality of Odds
● P(Ŷ1 = 1 | R = W, Y = no) = 1
Race and Skill Years of Goes to Hiring Predictor Predictor
Ethnicity Exp Mexican Decision Y Ŷ1 Ŷ2
Markets?

Hispanic Javascript 1 yes no 0 1

Hispanic C++ 5 yes yes 1 1

Hispanic Python 1 no yes 1 0

White Java 2 no yes 0 0

White C++ 3 no yes 1 1

White C++ 0 no no 1 0
Demographic Parity for Predictor Ŷ2

● P(Ŷ1 = 1 | R = H) = 2/3
● P(Ŷ1 = 1 | R = W) =

Race and Skill Years of Goes to Hiring Predictor Predictor


Ethnicity Exp Mexican Decision Y Ŷ1 Ŷ2
Markets?

Hispanic Javascript 1 yes no 0 1

Hispanic C++ 5 yes yes 1 1

Hispanic Python 1 no yes 1 0

White Java 2 no yes 0 0

White C++ 3 no yes 1 1

White C++ 0 no no 1 0
Demographic Parity for Predictor Ŷ2

● P(Ŷ1 = 1 | R = H) = 2/3
● P(Ŷ1 = 1 | R = W) = 1/3

Race and Skill Years of Goes to Hiring Predictor Predictor


Ethnicity Exp Mexican Decision Y Ŷ1 Ŷ2
Markets?

Hispanic Javascript 1 yes no 0 1

Hispanic C++ 5 yes yes 1 1

Hispanic Python 1 no yes 1 0

White Java 2 no yes 0 0

White C++ 3 no yes 1 1

White C++ 0 no no 1 0
Demographic Parity for Predictor Ŷ2

● P(Ŷ1 = 1 | R = H) = 2/3 ✗Demographics Parity


● P(Ŷ1 = 1 | R = W) = 1/3

Race and Skill Years of Goes to Hiring Predictor Predictor


Ethnicity Exp Mexican Decision Y Ŷ1 Ŷ2
Markets?

Hispanic Javascript 1 yes no 0 1

Hispanic C++ 5 yes yes 1 1

Hispanic Python 1 no yes 1 0

White Java 2 no yes 0 0

White C++ 3 no yes 1 1

White C++ 0 no no 1 0
Equality of Opportunity/Odds for Predictor Ŷ2

● P(Ŷ1 = 1 | R = H, Y = yes) = 1/2


● P(Ŷ1 = 1 | R = W, Y = yes) =
● P(Ŷ1 = 1 | R = H, Y = no) =
● P(Ŷ1 = 1 | R = W, Y = no) =
Race and Skill Years of Goes to Hiring Predictor Predictor
Ethnicity Exp Mexican Decision Y Ŷ1 Ŷ2
Markets?

Hispanic Javascript 1 yes no 0 1

Hispanic C++ 5 yes yes 1 1

Hispanic Python 1 no yes 1 0

White Java 2 no yes 0 0

White C++ 3 no yes 1 1

White C++ 0 no no 1 0
Demographic Parity for Predictor Ŷ2

● P(Ŷ1 = 1 | R = H) = 2/3 ✗Demographics Parity


● P(Ŷ1 = 1 | R = W) = 1/3

Race and Skill Years of Goes to Hiring Predictor Predictor


Ethnicity Exp Mexican Decision Y Ŷ1 Ŷ2
Markets?

Hispanic Javascript 1 yes no 0 1

Hispanic C++ 5 yes yes 1 1

Hispanic Python 1 no yes 1 0

White Java 2 no yes 0 0

White C++ 3 no yes 1 1

White C++ 0 no no 1 0
Equality of Opportunity/Odds for Predictor Ŷ2

● P(Ŷ1 = 1 | R = H, Y = yes) = 1/2


● P(Ŷ1 = 1 | R = W, Y = yes) = 1/2
● P(Ŷ1 = 1 | R = H, Y = no) =
● P(Ŷ1 = 1 | R = W, Y = no) =
Race and Skill Years of Goes to Hiring Predictor Predictor
Ethnicity Exp Mexican Decision Y Ŷ1 Ŷ2
Markets?

Hispanic Javascript 1 yes no 0 1

Hispanic C++ 5 yes yes 1 1

Hispanic Python 1 no yes 1 0

White Java 2 no yes 0 0

White C++ 3 no yes 1 1

White C++ 0 no no 1 0
Equality of Opportunity/Odds for Predictor Ŷ2

● P(Ŷ1 = 1 | R = H, Y = yes) = 1/2


● P(Ŷ1 = 1 | R = W, Y = yes) = 1/2
● P(Ŷ1 = 1 | R = H, Y = no) = 1
● P(Ŷ1 = 1 | R = W, Y = no) =
Race and Skill Years of Goes to Hiring Predictor Predictor
Ethnicity Exp Mexican Decision Y Ŷ1 Ŷ2
Markets?

Hispanic Javascript 1 yes no 0 1

Hispanic C++ 5 yes yes 1 1

Hispanic Python 1 no yes 1 0

White Java 2 no yes 0 0

White C++ 3 no yes 1 1

White C++ 0 no no 1 0
Equality of Opportunity/Odds for Predictor Ŷ2

● P(Ŷ1 = 1 | R = H, Y = yes) = 1/2


● P(Ŷ1 = 1 | R = W, Y = yes) = 1/2
● P(Ŷ1 = 1 | R = H, Y = no) = 1
● P(Ŷ1 = 1 | R = W, Y = no) = 0
Race and Skill Years of Goes to Hiring Predictor Predictor
Ethnicity Exp Mexican Decision Y Ŷ1 Ŷ2
Markets?

Hispanic Javascript 1 yes no 0 1

Hispanic C++ 5 yes yes 1 1

Hispanic Python 1 no yes 1 0

White Java 2 no yes 0 0

White C++ 3 no yes 1 1

White C++ 0 no no 1 0
Equality of Opportunity/Odds for Predictor Ŷ2

● P(Ŷ1 = 1 | R = H, Y = yes) = 1/2 ✔Equality of Opportunity


● P(Ŷ1 = 1 | R = W, Y = yes) = 1/2 ✔
● P(Ŷ1 = 1 | R = H, Y = no) = 1
✗ ✗Equality of Odds
● P(Ŷ1 = 1 | R = W, Y = no) = 0
Race and Skill Years of Goes to Hiring Predictor Predictor
Ethnicity Exp Mexican Decision Y Ŷ1 Ŷ2
Markets?

Hispanic Javascript 1 yes no 0 1

Hispanic C++ 5 yes yes 1 1

Hispanic Python 1 no yes 1 0

White Java 2 no yes 0 0

White C++ 3 no yes 1 1

White C++ 0 no no 1 0
Summary of Fairness Criteria

Fairness Criteria Criteria Group Individual

Unawareness Excludes A in Predictions ✓

Demographic Parity ✓

Equalized Odds ✓

Equalized Opportunity ✓
Recap

● Major Fairness Criteria


○ Fairness Through Unawareness
■ Sensitive feature A are being excluded when training MLmodels
○ Demographic Parity
■ Probabilities of distributing the favorable outcome across groups are the same

○ Equal Opportunity
■ Probabilities of distributing the favorable outcome to the qualified members across
groups are the same

○ Equal Odds
■ Probabilities of distributing the favorable outcome to both qualified and unqualified
members across groups are the same
Thank You!
Applied Machine Learning
Revision

Anita Ramachandran
Question 1
Solution 1
Question 2
Solution 2
Question 3
Solution 3
Question 4
Vijay is a certified Data Scientist and he has applied for two companies
-Google and Microsoft. He feels that he has a 60% chance of receiving
an offer from Google and 50% chance of receiving an offer from
Microsoft. If he receives an offer from Microsoft, he has belief that
there are 80% chances of receiving an offer Google.
• If Vijay receives an offer from Microsoft, what is the probability that
he will not receive an offer from Google?
• What are his chances of getting an offer from Microsoft, considering
he has an offer from Google?
Solution 4

Vijay is a certified Data Scientist and he has applied for two companies -Google and Microsoft. He
feels that he has a 60% chance of receiving an offer from Google and 50% chance of receiving an
offer from Microsoft. If he receives an offer from Microsoft, he has belief that there are 80%
chances of receiving an offer Google.
• If Vijay receives an offer from Microsoft, what is the probability that he will not receive an offer
from Google?
• What are his chances of getting an offer from Microsoft, considering he has an offer from
Google?
Ans
• P(not g| m) = 0.2
• P(M|G)=0.67
Question 5
• Solve the below and find the equation for hyper plane using
linear Support Vector Machine method. Positive Points: {(3, 2),
(4, 3), (2, 3), (3, -1)} Negative Points: {(1, 0), (1, -1), (0, 2), (-1,
2)}
Solution 5
• Solve the below
and find the
equation for hyper
plane using linear
Support Vector
Machine method.

Positive Points: {(3, 2),


(4, 3), (2, 3), (3, -1)}

Negative Points: {(1,


0), (1, -1), (0, 2), (-1,
2)}
Question 6
• Use kernel trick and find the equation for hyper plane using nonlinear SVM. Positive
Points: {(2,0), (3,0), (4,0)} Negative Points: {(0,0), (1,0), (5,0), (6,0)}
Solution 6
• Use kernel trick and find the equation for hyper plane using nonlinear SVM. Positive Points: {(2,0),
(3,0), (4,0)} Negative Points: {(0,0), (1,0), (5,0), (6,0)}
Question 7
• Consider a 2-class classification problem in a 2
dimensional feature space x=[x1, x2] with target
variable y=±1.The training data comprises 7
samples as shown below. 4 black diamonds for the
positive class and 3 gray diamonds for the negative
class.
• Find the k-means cluster centers for the two (black
diamond and gray diamonds) classes.
• What is the equation for linear decision boundary which
is equidistant from the two cluster center?
• What is the training error rate?
Solution 7
• Consider a 2-class classification problem in a 2
dimensional feature space x=[x1, x2] with target
variable y=±1.The training data comprises 7 samples as
shown below. 4 black diamonds for the positive class
and 3 gray diamonds for the negative class.
• Find the k-means cluster centers for the two (black diamond
and gray diamonds) classes.
• Black diamond class: (-1.0,0.0)
• Gray diamond class: (8/3, 0.0)
• What is the equation for linear decision boundary which is
equidistant from the two cluster center?
• x1=(8/3-1)/2=5/6
• What is the training error rate?
• 1/7 (gray diamond point at (0,0) will be misclassified)
Question 8
Using K-means Clustering algorithm, create 4 clusters for the
following dataset.
• (2, 5), (2.5, 3), (3.5, 4), (5, 7.9), (8, 11.3), (11, 12), (12, 19), (13, 5), (7,
3), (11, 7), (15,15.1), (1, 2), (2, 20), (10, 10), (13, 1.1), (7, 9), (30, 42),
(18, 21), (55, 39), (32, 68), (30,30), (50, 50.1)
• Choose 4 random initial cluster centers as C1 = (2, 5), C2 = (11, 12), C3
= (18, 21), and C4 = (30, 30)
• Calculate the distance by using the Distance Function between two
points a = (x1, y1) and b = (x2, y2) as follows: P(a,b)=|x2–x1|+|y2–y1|
Solution 8 - Iteration 1
• Now, re-compute the new centers of 4 clusters. The new cluster center is computed by taking the mean of all the
data points contained in that cluster.
• For Center of Cluster - 1
• X = (2 + 2.5 + 3.5 + 5 + 7 + 1 + 2) / 7 = 3.28
• Y = (5 + 3 + 4 + 7.9 + 3 + 2 + 20) / 7 = 6.414
• Therefore, C1 = (3.28, 6.414)
• For Center of Cluster - 2
• X = (8 + 11 + 13 + 11 + 15 + 10 + 13 + 7) / 8 = 11
• Y = (11.3 + 12 + 5 + 7 + 15.1 + 10 + 1.1 + 9) / 8 = 8.8125
• Therefore, C2 = (11, 8.8125)
• For Center of Cluster - 3
• X = (12 + 18) / 2 = 15
• Y = (19 + 21) / 2 = 20
• Therefore, C3 = (15, 20)
• For Center of Cluster - 4
• X = (30 + 55 + 32 +30 + 50) / 5 = 39.4
• Y = (42 + 39 + 68 + 30 + 50.1) / 5 = 45.82
• Therefore C4 = (39.4, 45.82)
• This is the completion of Iteration 1.
Solution 8 - Iteration 2
Question 9
• Suppose we train a model to predict whether an email is Spam or Not Spam.
After training the model, we apply it to a test set of 500 new email messages
(also labelled) and the model produces the contingency table below.
• [A] Compute the precision of this model with respect to the Spam class.
• [B] Compute the recall of this model with respect to the Not Spam class.

True Class
Spam Not Spam
Predicted Class Spam 70 30
Not Spam 70 330
Question 9
• Suppose we train a model to predict whether an email is Spam or Not Spam.
After training the model, we apply it to a test set of 500 new email messages
(also labelled) and the model produces the contingency table below.
• [A] Compute the precision of this model with respect to the Spam class.
• [B] Compute the recall of this model with respect to the Spam class.

True Class Answer:


[A] Precision with respect to SPAM = #
Spam Not Spam correctly predicted as SPAM / # predicted
as SPAM
Predicted Class Spam 70 30
= 70 / (70+30) = 70%
Not Spam 70 330
[B] Recall with respect to SPAM = #
correctly predicted as SPAM / # true SPAM
= 70 / (70 + 70) = 50%
WEBINAR 1 : ANALYSIS ON QUALITY OF WHITE
WINE

BITS Pilani
Pilani Campus
Introduction

● Dataset: The "Wine Quality - White" dataset is a collection of data


related to white wines.

● Purpose: This dataset is often used for quality prediction and


analysis.

● Importance: Wine quality prediction is essential for winemakers to


ensure the production of high-quality wines and meet consumer
preferences.

BITS Pilani, Pilani Campus


Data Overview

List the attribute:


1. Fixed Acidity
2. Volatile Acidity
3. Citric Acid
4. Residual Sugar
5. Chlorides
6. Free Sulfur Dioxide
7. Total Sulfur Dioxide
8. Density
9. pH
10.Sulphates
11.Alcohol
Target Variable: The target variable in this dataset is typically "Quality,"
which represents the quality rating of the wine. Quality ratings are often
integers between 3 and 9, where higher values indicate better quality.
BITS Pilani, Pilani Campus
5-Class problem

• 5 classes - [6, 5, 7, 8, 4] are considered.


• The dataframe is filtered to contain only these 5 classes.
• Decision Tree Classifier is used.
• Confusion matrix for 5 class is obtained.

BITS Pilani, Pilani Campus


Confusion Matrix

BITS Pilani, Pilani Campus


Histogram with KDE

KDE is often used as complement to histograms. While histograms bin


the data into discrete intervals, KDE provides a continuous
representation. If the KDE plot closely aligns with a histogram, it suggests
a robust estimation.

sns.histplot(X[col], kde=True)

BITS Pilani, Pilani Campus


QQ Plot
Normality / QQ Plot - A normal probability plot, also known as a Q-Q
(quantile-quantile) plot, is a graphical method for assessing the normality
of a distribution. It compares the quantiles of the observed data against
the quantiles of a theoretical normal distribution. In a Q-Q plot, if the
points approximately lie on a straight line, it suggests that the data is
approximately normally distributed.
stats.probplot(filtered_df[col], dist="norm", plot=plt)

BITS Pilani, Pilani Campus


Bubble Plot
A bubble plot is a type of scatter plot where a third dimension of the data is shown
through the size of markers (bubbles) in addition to their x and y positions. It is
useful for visualizing relationships between three continuous variables. In a
bubble plot, the size of each bubble represents the value of a third variable.
sns.scatterplot(data=filtered_df, x='alcohol', y='residual sugar',
size='quality', legend=False, sizes=(20, 200))

BITS Pilani, Pilani Campus


Bubble Plot
hue='pH': Colors the bubbles based on the 'pH' column.

sns.scatterplot(data=filtered_df, x='alcohol', y='residual sugar', size='quality',


hue='pH', legend=None, sizes=(20, 200))

BITS Pilani, Pilani Campus


Decision Tree-Based Feature
Selection

Why Use Decision Trees for Feature Selection?


● Decision trees provide a clear and interpretable way to assess
feature importance.
● They rank features based on their contribution to splitting and
information gain.
● Decision trees can handle both categorical and numerical features.

BITS Pilani, Pilani Campus


Decision Tree-Based Feature
Selection

The Decision Tree Process:


● Decision trees are constructed by recursively splitting data based
on features.
● At each node, the feature with the highest information gain is
selected for splitting.
● The process continues until a stopping criterion is met.

BITS Pilani, Pilani Campus


Decision Tree-Based Feature
Selection

Feature Importance Scores:


● Decision trees assign importance scores to features.
● Features used near the top of the tree are typically more important.
● Importance scores reflect the contribution of each feature to the
model's predictive power.
● Decision trees provide a feature importance score based on how
often a feature is used to split the data across all the decision trees.

feature_importances = dt_classifier.feature_importances_

BITS Pilani, Pilani Campus


Decision Tree-Based Feature
Selection
Feature Importance
Scores:

BITS Pilani, Pilani Campus


Genetic Algorithms for
Feature Selection

Why Use Genetic Algorithms for Feature Selection?


● GAs are suitable for combinatorial optimization problems like
feature selection.
● They explore a wide search space efficiently and find optimal or
near-optimal solutions.

BITS Pilani, Pilani Campus


Genetic Algorithms for
Feature Selection
The Genetic Algorithm Process: Genetic Algorithms mimic the
process of evolution:
• Initialization: Create an initial population of potential solutions
(chromosomes).
• Selection: Evaluate and select the fittest chromosomes based on a
fitness function.
• Crossover (Recombination): Combine pairs of selected
chromosomes to create offspring.
• Mutation: Introduce small random changes to the offspring.
• Replacement: Replace the old population with the new population of
offspring.
• Termination: Repeat the process for a specified number of
generations or until convergence.

BITS Pilani, Pilani Campus


Genetic Algorithms for
Feature Selection

Applying GAs to Feature Selection:

• In feature selection, each chromosome represents a subset of


features.
• The fitness function evaluates subsets based on their contribution to
model performance.
• GAs iteratively evolve better subsets over generations.

BITS Pilani, Pilani Campus


Genetic Algorithms for
Feature Selection

Benefits of Genetic Algorithm-Based Feature Selection:

• GAs handle a large feature space efficiently.


• They can uncover complex feature interactions.
• GAs are adaptable to different machine learning algorithms.

BITS Pilani, Pilani Campus


Genetic Algorithms for
Feature Selection
• The evaluate function is designed for feature selection using a
binary individual representing feature inclusion or exclusion.
• It creates a boolean mask based on the binary array. The boolean
mask selects the relevant features from the original feature matrix
(X). This step effectively filters out the non-selected features,
creating a subset of features based on the individual's gene
representation.
• Initializes a Random Forest classifier and evaluates its performance
using cross-validation with 5 folds.
• The fitness value returned is the mean accuracy of the classifier
across the folds, serving as the objective for optimization algorithms
seeking to find an optimal set of features for classification.
• The GA parameters are defined.

BITS Pilani, Pilani Campus


Genetic Algorithms for
Feature Selection
creator.create("FitnessMax", base.Fitness, weights=(1.0,)): FitnessMax
that represents the optimization goal and the weight is set to 1.0.

creator.create("Individual", list, fitness=creator.FitnessMax): creates


another class named Individual, which is essentially a list of binary
values representing features. Each individual has an associated fitness
attribute, and it's assigned the FitnessMax class.

toolbox = base.Toolbox(): The toolbox is a container for registering


functions that will be used in the genetic algorithm. It's a convenient
way to organize and reuse these functions.

toolbox.register("attr_bool", random.randint, 0, 1): generates random


integers (0 or 1).

toolbox.register("individual", tools.initRepeat, creator.Individual,


toolbox.attr_bool, n=num_features): a function named individual that
BITS Pilani, Pilani Campus
Genetic Algorithms for
Feature Selection
toolbox.register("individual", tools.initRepeat, creator.Individual,
toolbox.attr_bool, n=num_features): It initializes an Individual with
binary values using the attr_bool function num_features times.

toolbox.register("population", tools.initRepeat, list, toolbox.individual): It


initializes a population of individuals using the individual function.

The registered functions (evaluate, mate, mutate, and select) define the
key components of the genetic algorithm: how individuals are
evaluated, how they are combined (crossover), how they can change
(mutation), and how they are selected for the next generation. These
functions will be used by the genetic algorithm to evolve a population of
individuals toward a solution that optimizes the defined evaluation
criteria.

BITS Pilani, Pilani Campus


Genetic Algorithms for
Feature Selection
population = toolbox.population(n=population_size): Creates the initial
population of individuals with a specified size (population_size). Each
individual is a binary string representing a set of features.

hof = tools.HallOfFame(1): Creates the Hall of Fame, which is a


container to store the best individual found during the evolution. Here,
it's set to store the single best individual.

Runs the genetic algorithm using the eaSimple function from the
algorithms module.

The genetic algorithm iteratively evolves a population of individuals,


optimizing the feature selection based on the defined evaluation criteria
(fitness function). The final result is the best individual, representing the
most optimal set of features for the given classification task.

BITS Pilani, Pilani Campus

You might also like