Professional Documents
Culture Documents
AML All Merged PDF Class 9 To 16
AML All Merged PDF Class 9 To 16
Decision Trees
1
Decision Trees
• Decision tree
• A flow-chart-like tree structure
• Internal node denotes a test on an attribute
• Branch represents an outcome of the test
• Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
• Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
• Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
• Test the attribute values of the sample against the decision tree
2
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Construction:
Hunt’s Algorithm
Let Dt be the set of training records that are associated with node t and
y = {y1, y2, y3.. yc} be the class labels.
3
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Construction:
Example
4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions
• How should the training records be split?
5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions - Splitting
Methods
Binary Attributes Nominal Attributes
Ordinal Attributes
Continuous Attributes
6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions - Selecting
the best split
• p(i|t): fraction of records associated with node t belonging to class i
• Best split is selected based on the degree of impurity of the child
nodes
• Class distribution (0,1) has high purity
• Class distribution (0.5,0.5) has the smallest purity (highest impurity)
i 1
i 1
Gini = ?
Entropy = ?
Error = ?
8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions -
Information Gain
• In general the different impurity measures are consistent
• Gain of a test condition: compare the impurity of the
parent node with the impurity of the child nodes
k N (v j )
I ( parent ) I (v j )
j 1 N
• I(.) is the impurity measure of a given node, N is the total no: of
records at the parent node, k is the no: of attribute values, and
N(vj) is the no: of nodes associated with the child node vj
With B:
Gini for N1 = 1 – 1/25 – 16/25 = 0.32
Gini for N2 = 1 – 25/49 – 4/49 = 0.408
Weighted avg Gini = 0.32 x 5/12 + 0.408 x 7/12 = 0.37
11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions - Example –
Splitting Nominal Attributes
12
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Design Decisions - Example –
Splitting Continuous Attributes
• Brute force method – high complexity
• Sort training records:
• Based on their annual income - O(N log N) complexity
• Candidate split positions are identified by taking the midpoints
between two adjacent sorted values
• Measure Gini index for each split position, and choose the one that
gives the lowest value
• Further optimization: consider only candidate split positions located
between two adjacent records with different class labels
13
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Summary so far - Decision
Tree
• Build a model (based on past vector) in the form of a tree structure,
that predicts the value of the output variable based on the input
variables in the feature vector
• Each node (decision node) of a decision tree corresponds to one
feature vector
• Root node, Branch node, Leaf node
• Building a Decision Tree - Recursive partitioning
• Splits data into multiple subsets on the basis of feature values
• Root node – entire dataset
• First selects the feature which predicts the target class in the strongest
way
• Splits the dataset into multiple partitions
• Stopping criteria
• All or most of the examples at a particular node have the same class
• All features have been used up in the partitioning
• The tree has grown to a pre-defined threshold limit
14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example
• Consider the data set available for a company’s hiring
cycles. A student wants to find out if he may be offered a
job in the company. His parameters are as follows:
• CGPA: High, Communication: Bad, Aptitude: High,
Programming skills: Bad
15
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example
• Outcome = false for
all the cases where
Aptitude = Low,
irrespective of other
conditions
• So feature Aptitude
can be taken up as
the first node of the
decision tree
16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example
• For Aptitude = HIGH, job offer condition is TRUE for all the cases
where Communication = Good.
• For cases where Communication = Bad, job offer condition is TRUE
for all the cases where CGPA = HIGH
• Use the below decision tree to predict outcome for (Aptitude = high,
Communication = Bad and CGPA = High)
17
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example: Entropy & Information
Gain Calculation - Level 1
18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 1
19
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 1
21
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 2
• We will have only one branch to navigate: Aptitude =
High
22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 2
23
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 2
24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 2
Entropy values at the end of Level 2:
• 0.85 before the split
• 0.33 when CGPA is used for split
• 0.30 when Communication is used for split
• 0.80 when Programming skill is used for split
Information Gains
• After split with CGPA = 0.52
• After split with Communication = 0.55
• After split with Programming skill = 0.05
25
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 3
Entropy values at the end of Level 3:
• 0.81 before the split
• 0 CGPA is used for split
• 0.50 when Programming skill is used for split
26
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Entropy & Information Gain
Calculation - Level 2
27
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Characteristics of Decision
Tree Induction
• Non-parametric approach
• Computationally inexpensive, even with large training set
• What is the worst case complexity of classifying a test record?
• Easy to interpret
• Accuracy is comparable to other classifiers
• Robust to noise, with methods to prevent overfitting
• Immune to presence of redundant or irrelevant attributes
• Suffers from data fragmentation (where have we heard this before?)
• How can we recover from this?
• Splits using single attribute at a time -> rectilinear decision
boundaries
• Limits decision tree representation for modeling complex relationships
among continuous attributes
• Tree pruning strategies has more effect on the performance of
decision trees rather than choice of impurity measure
28
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Issues in Decision Tree
Learning
• Handling training examples with missing data, attributes
with differing costs
• Model overfitting
• Causes of model overfitting
• Estimating generalization error
• Handling overfitting
• Evaluating classifier performance
29
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Issues in Decision Tree
Learning - Model Overfitting
• Training error (resubstitution error, apparent error)
• Generalization error
• Good model - low training error, low generalization error
• Low training error, high generalization error
==>overfitting
• Reasons for overfitting
• Presence of noise
• Lack of representative samples
• Model complexity
• Primary reason for overfitting – still a subject of debate
30
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example Data Set
Two class problem:
+ : 5200 instances
• 5000 instances generated
from a Gaussian centered at
(10,10)
o : 5200 instances
• Generated from a uniform
distribution
Decision Tree
Decision Tree
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is large
37
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Issues in Decision Tree
Learning - Example
• Overfitting due to the lack of representative samples
Training set Test set
38
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Issues in Decision Tree
Learning - Model Complexity
• The chance for model overfitting increases as the model becomes more
complex.
• Prefer simpler models - Occam's razor or the principle of parsimony
• Given two models with the same generalization errors, the simpler model is preferred
over the more complex model.
• Ideal complexity – produces the lowest generalization error, but only
training data to learn from
• Best possibility – estimate the generalization error
• Methods to estimate generalization error
• Resubstitution estimate
• Selects the model that produces the lowest training error rate as its final model
• Pessimistic error estimate
• Minimum Description Length Principle
• Seek a model that minimizes the overall cost function
• Refer example
• Using a validation set
39
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Issues in Decision Tree
Learning - Handling Overfitting
Two approaches
• Post-pruning
• Allow the tree to overfit the data, and then post-
prune the tree
40
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Model Selection for
Decision Trees
• Pre-Pruning (Early Stopping Rule)
– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
• Stop if all instances belong to the same class
• Stop if all the attribute values are the same
– More restrictive conditions:
• Stop if number of instances is less than some user-specified
threshold
• Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
• Stop if estimated generalization error falls below certain threshold
43
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
ID3
• Uses Hunt’s algorithm
• Uses information gain as the measure to select the best
attribute at each step in the growing tree
• Does not have the ability to determine how many
alternative decision trees are consistent with the
available training data
• Performs no backtracking in its search
• Shorter trees are preferred over longer trees. Trees that
place high information gain attributes close to the root
are preferred over those that do not - Inductive bias of
ID3
• Refer: Mitchell
44
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Combining Multiple Learners
S.P.Vimal
Agenda
• Technique that uses subsets (bags) to get a fair idea of the distribution (complete set).
• The size of subsets created for bagging may be less than the original set.
• Bootstrapping is a sampling technique in which we create subsets of observations from the
original dataset, with replacement.
• When you sample with replacement, items are independent. One item does not affect the outcome
of the other. You have 1/7 chance of choosing the first item and a 1/7 chance of choosing the
second item.
• If the two items are dependent, or linked to each other. When you choose the first item, you have
a 1/7 probability of picking a item. Assuming you don’t replace the item, you only have six items to
pick from. That gives you a 1/6 chance of choosing a second item.
• Multiple subsets are created from the original dataset, selecting observations with
replacement.
• A base model (weak model) is created on each of these subsets.
• The models run in parallel and are independent of each other.
• The final predictions are determined by combining the predictions from all the models.
Bagging
A hyper parameter
Averaging over models with large variance, we get a better fit than those
of the individual models …
Issue -2 : Combining Results of Base Learners
Order 5, polynomial
Weighted sum of all errors -- Errors weighted by the importance of the example
AdaBoost (Freund and Schapire (1996)
If the error rate is > 0.5, the base learner is beyond acceptable and cannot proceed !
AdaBoost (Freund and Schapire (1996)
ꭤ --->
ε --->
AdaBoost (Freund and Schapire (1996)
Approach #2:
Approach #2:
S.P.Vimal
BITS, Pilani
Topics
Hierarchical Clustering
Agglomerative Hierarchical Clustering
6 5
0.2
4
3 4
2
0.15 5
2
0.1
0.05 1
3 1
0
1 3 2 5 4 6
Agglomerative Hierarchical Clustering
Basic algorithm
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
p2
p3
p4
p5
.
.
. Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
• We want to merge the two closest clusters (C2 and C5) and update the proximity matrix.
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
After Merging
C2
• The question is “How do we update the proximity matrix?” U
C1 C5 C3 C4
C1 ?
? ? ? ?
C2 U C5
C3
C3 ?
C4
C4 ?
Proximity Matrix
C1
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
Cluster Similarity: MAX or Complete Linkage
• Similarity of two clusters is based on the two least similar (most distant) points in the
different clusters
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Cluster Similarity: Group Average
• Proximity of two clusters is the average of pairwise proximity between points in the two
clusters.
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4
5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3
Example
p1 0
p2 0.24 0
p3 0.22 0.15 0
p4 0.37 0.20 0.15 0
p5 0.34 0.14 0.28 0.29 0
p6 0.23 0.25 0.11 0.22 0.39 0
p1 p2 p3 p4 p5 p6
4/4/2024 15
Single Link or MIN
4/4/2024 16
Single Link or MIN
• The height at which 2 clusters are merged in the Dendrogram reflects the distance of the two
clusters
• The distance between 3 and 6 is .11
p1 0 p1 0
p2 0.24 0 p2 0.24 0
(p3, p6) 0.22 0.15 0 (p3, p6) 0.22 0.15 0
p4 0.37 0.20 0.15 0 p4 0.37 0.20 0.15 0
p5 0.34 0.14 0.28 0.29 0 p5 0.34 0.14 0.28 0.29 0
p1 p2 (p3, p6) p4 p5 p1 p2 (p3, p6) p4 p5
Agglomerative Algorithm
K-Means Introduction
Clustering
Choosing K=2
K-Means Example
Iteration 1:
• C1 = {2,3} C2 = {4,10,11,12,20,25,30}
• μ1 = (2+3)/2 = 2.5
• μ2 = (4+10+11+12+20+25+30)/7 = 16
Iteration 2:
• C1 = {2,3,4} C2 = {10,11,12,20,25,30}
• μ1 = (2+3+4)/3 = 3
• μ2 = (10+11+12+20+25+30)/=18
and so on..
K-Means
What is it about?
• k-means partitions points in K clusters such that
• Distance between points inside the clusters is
smaller than distance between points across
clusters
• Let
• 𝜇k - Mean of all points in cluster k
- a d-dimensional vector
• rnk 𝜖 {0,1}
• rnk = 1, point xn is assigned to cluster k
• rnk = 0, otherwise
K-Means
Distortion Measure
• k-means identifies K clusters such that the sum of the
squares of the distances of each data point to its center
(represented by its mean) is minimized.
Objective function:
E-Step:
For all xt ∊ X:
K-Means Algorithm
Let us determine 𝜇k :
Assume {rnk} are determined in the E-Step are fixed.
K-Means Algorithm
Let us determine 𝜇k :
Assume {rnk} are determined in the E-Step are fixed.
J is a quadratic function of 𝜇k .
To optimize,
Take the derivative of J w.r.t 𝜇k, & set it to 0
Solve for 𝜇k, we get
K-Means Algorithm
Let us determine 𝜇k :
Assume {rnk} are determined in the E-Step are fixed.
J is a quadratic function of 𝜇k .
To optimize,
Take the derivative of J w.r.t 𝜇k, & set it to 0
Solve for 𝜇k, we get Sum of all the points in a cluster
Number of points in a cluster
K-Means Algorithm
Let us determine 𝜇k :
Assume {rnk} are determined in the E-Step are fixed.
J is a quadratic function of 𝜇k .
To optimize,
Take the derivative of J w.r.t 𝜇k, & set it to 0
Solve for 𝜇k, we get Sum of all the points in a cluster
Number of points in a cluster
K-Means Algorithm
Let us determine 𝜇k :
Assume {rnk} are determined in the E-Step are fixed.
J is a quadratic function of 𝜇k .
To optimize,
Take the derivative of J w.r.t 𝜇k, & set it to 0
Solve for 𝜇k, we get Sum of all the points in a cluster
Number of points in a cluster
K-Means Algorithm
Let us determine 𝜇k :
Assume {rnk} are determined in the E-Step are fixed.
J is a quadratic function of 𝜇k .
To optimize,
Take the derivative of J w.r.t 𝜇k, & set it to 0
Solve for 𝜇k, we get Sum of all the points in a cluster
Number of points in a cluster
K-Means Algorithm
M-Step:
Algorithm:
Algorithm:
Algorithm:
Algorithm:
Data Set: The data used this module came from ‘Old Faithful Data’ available from https://www.kaggle.com/janithwanni/old-faithful/data for download & is used by the text book
PRML.
Multimodal Data
Issues fitting single gaussian to real data set
Data Set: The data used this module came from ‘Old Faithful Data’ available from https://www.kaggle.com/janithwanni/old-faithful/data for download & used by the text book PRML.
Mixture Distributions
Data Set: The data used this module came from ‘Old Faithful Data’ available from https://www.kaggle.com/janithwanni/old-faithful/data for download & used by the text book PRML.
EM Clustering
Outline
The mean shift algorithm seeks modes or local maxima of density in the feature
space
Mean shift
Search
window
Center of
mass
Mean Shift
vector
Center of
mass
Mean Shift
vector
Center of
mass
Mean Shift
vector
Center of
mass
Mean Shift
vector
Center of
mass
Mean Shift
vector
Center of
mass
Mean Shift
vector
Center of
mass
Ref: https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68
Slide by Y. Ukrainitz & B. Sarel
Thank You!
ANN and Deep Learning
1 9 Green
x2
10 9 Green
4 7 Green
4 5 Red
5 3 Red
8 9 Green
x1
4 2 Red
2 5 Red
• Data looks linearly separable
7 1 Red
• What is the decision boundary?
2 10 Green
Many Possibilities, such as
8 5 Green
1 2 Red if (2x1 + 3x2 − 25 > 0) it is green
8 2 Red otherwise red
What about this arrangement?
With chosen decision boundary 2x1 + 3x2 − 25 = 0
x2
-25
x1 2
sign
x2 3
x1
Given a data
“How to find appropriate parameters?” is an important issue
Perceptron Training Rule
Different algorithms may converge to different acceptable hypotheses
Conversion with perceptron training rule is subject to linear separability of training example and appropriate η
An Example
Consider a perceptron with output 0/1 as shown x1 x2 Output
x0=+1 w0 =-30 0 0 0
w1=+20 σ(w T x) 0 1 0
x1
1 0 0
x2 w2=+20
1 1 1
1 1 0
By (1) w0 < 0 so let w0 = −1
Let us assume following
By (2) w0+w2<0 so let w2 =−1
+1 w0
By (3) w0+w1≥0 so let w1 =1.5
By (4) w0+w1+w2<0 that is valid
w1 σ(w T x)
x1 So (w0,w1,w2) = (−1,−1,1.5)
Other possibilities are also there
w2
x2
Perceptron Training (delta rule)
1
https://github.com/dennybritz/nn-from-scratch/blob/master/nn-from-scratch.ipynb
Neuron
Neuron uses nonlinear activation functions (sigmoid, tanh, ReLU, softplus etc.)
at the place of thresholding
+1 w0
sigmoid(x)
tanh(x)
softplus(x)
w1 act(w T x) 1
x1
0
w2
x
x2
Neural Network Applicability
• http://neuralnetworksanddeeplearning.com/chap4.html
Backpropagation
Session Content
• Backpropagation Algorithm
• Behavior of Backpropagation
Learning in NN: Backpropagation
Learning in NN: Backpropagation
• Computation of weights for hidden nodes is not trivial because it is difficult to assess their error
term (partial derivative) without knowing what their output values should be
• Back propagation: Two phases in each iteration of the algorithm: the forward phase and the
backward phase
• During the forward phase, the weights obtained from the previous iteration are used to compute the
output value of each neuron in the network
• The computation progresses in the forward direction; i.e., outputs of the neurons at level k are computed
prior to computing the outputs at level k + 1
• During the backward phase, the weight update formula is applied in the reverse direction
• In other words, the weights at level k + 1 are updated before the weights at level k are updated
• This back-propagation approach allows us to use the errors for neurons at layer k + t to estimate the errors
for neurons at Iayer k
Learning in NN: Backpropagation
Learning in NN: Backpropagation
Learning in NN: Backpropagation
Learning in NN: Backpropagation
Learning in NN: Backpropagation
Learning in NN: Backpropagation
Incremental (Stochastic) vs Batch Gradient Descent
Quality Feature
input Enhancement
Check Extraction
Demo
https://playground.tensorflow.org/
Neural Networks II
Convolutional Neural Networks
Image
Input Conv 1
(1) (1)
nH × nW ×nC nH ×n W ×n C
Convolution Operation
Input Size : n ×n
Kernel Size : f ×f
Ouput Size : (n − f + 1) × (n − f + 1)
Convolution Operation
Input Image 6 × 6
3 0 1 2 7 4 Output Image 4 × 4
Kernel 3 × 3
1 5 8 9 3 1
1 0 -1
2 7 2 5 1 3 * =
1 0 -1
0 1 3 1 7 8
1 0 -1
4 2 1 6 2 8
2 4 5 2 3 9
Convolution Operation
Input Image 6 × 6
3 0 1 2 7 4 Output Image 4 × 4
Kernel 3 × 3
1 5 8 9 3 1
1 0 -1
2 7 2 5 1 3 * =
1 0 -1
0 1 3 1 7 8
1 0 -1
4 2 1 6 2 8
2 4 5 2 3 9
value = 3(1) + 1(1) + 2(1) + 0(0) + 0(5) + 7(0) + 1(−1) + 8(−1) + 2(−1)
= −5
Convolution Operation
Input Image 6 × 6
3 0 1 2 7 4 Output Image 4 × 4
Kernel 3 × 3
1 5 8 9 3 1 -5
1 0 -1
2 7 2 5 1 3 * =
1 0 -1
0 1 3 1 7 8
1 0 -1
4 2 1 6 2 8
2 4 5 2 3 9
value = 0(1) + 5(1) + 7(1) + 1(0) + 8(0) + 5(0) + 2(−1) + 9(−1) + 5(−1)
= −4
Convolution Operation
Input Image 6 × 6
3 0 1 2 7 4 Output Image 4 × 4
Kernel 3 × 3
1 5 8 9 3 1 -5 -4 0 8
1 0 -1
2 7 2 5 1 3 * = -10 -2 -2 3
1 0 -1
0 1 3 1 7 8 0 -2 -4 -7
1 0 -1
4 2 1 6 2 8 -3 -2 -3 -16
2 4 5 2 3 9
value = 1(1) + 6(1) + 2(1) + 7(0) + 2(5) + 3(0) + 8(−1) + 8(−1) + 9(−1)
= −16
Vertical Edge Detector
Input Image 6 × 6
10 10 10 0 0 0 Output 4× 4
Kernel 3 × 3
10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0 * = 0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
1 0 -1
10 10 10 0 0 0 0 30 30 0
10 10 10 0 0 0
Horizontal Edge Detector
Input Image 6 × 6
10 10 10 10 10 10 Output 4× 4
Kernel 3 × 3
10 10 10 10 10 10 0 0 0 0
1 1 1
10 10 10 10 10 10 * = 30 30 30 30
0 0 0
0 0 0 0 0 0 30 30 30 30
-1 -1 -1
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0
Edge Detectors
Similarly edge at 45◦, 60 ◦ and all can be detected, by
appropriately choosing the kernels or filters.
Then stack the different filters to obtain edges.
The aim of the Convolution layer will be learn the filter or kernel
matrix.
The NN will learn the 9 weights in thefilter.
Filter 3 ×3
w1 w2 w3
w4 w5 w6
w7 w8 w9
Fully Connected (FC) Layer
• Stack of neurons in one layer is connected to every otherneuron
in another layer.
FC1 FC2 LR
yˆ1
Neural Networks II
Pa d d i n g , S t r i d i n g , Po o l i n g
• Padding
• Striding
• Pooling
• Max Pooling
• Average Pooling
Convolution Operation
Input Image 6 × 6
3 0 1 2 7 4 Output Image 4 × 4
Kernel 3 × 3
1 5 8 9 3 1 -5 -4 0 8
1 0 -1
2 7 2 5 1 3 * = -10 -2 -2 3
1 0 -1
0 1 3 1 7 8 0 -2 -4 -7
1 0 -1
4 2 1 6 2 8 -3 -2 -3 -16
2 4 5 2 3 9
Padding
Input Size : n × n
Padding : p
Stride : s
Kernel Size : f ×f
n + 2p − f n + 2p − f
Ouput Size :, + 1 , ×, +1 ,
s s
Input Image 7 × 7
2 4 7 4 6 2 9
Kernel 3 × 3 Output 3 × 3
6 6 9 8 7 4 3
Stride = 2 91
3 4 8 3 8 9 7
* 3 4 4 =
7 8 3 6 6 3 4
1 0 2
4 2 1 8 2 4 6
-1 0 3
3 2 4 1 9 8 3
0 1 3 9 2 1 4
Convolution with Stride
Input Image 7 × 7
2 4 7 4 6 2 9
Kernel 3 × 3 Output 3 × 3
6 6 9 8 7 4 3
Stride = 2 91 100
3 4 8 3 8 9 7
* 3 4 4 =
7 8 3 6 6 3 4
1 0 2
4 2 1 8 2 4 6
-1 0 3
3 2 4 1 9 8 3
0 1 3 9 2 1 4
Convolution with Stride
Input Image 7 × 7
2 4 7 4 6 2 9
Kernel 3 × 3
6 6 9 8 7 4 3 Output 3 × 3
Stride = 2
3 4 8 3 8 9 7 91 100 83
* 3 4 4 =
7 8 3 6 6 3 4 69 91 127
1 0 2
4 2 1 8 2 4 6 44 72 74
-1 0 3
3 2 4 1 9 8 3
0 1 3 9 2 1 4
One Layer C N N
Kernel 1 Relu W +
3 ×3 ×3 b
Image Conv
6 ×6 ×3 4 ×4 ×2
Kernel 2 Relu W +
3 ×3 ×3 b
Number of Parameters
Kernel : f × f ∗ nC
Bias : 1
# Parameters : f ∗ f ∗ nC
# Parameters : 3 ∗ 3 ∗ 3 + 1 = 28 per kernel
2 kernels : 28 ∗ 2 =56
Pooling (Pool) Layer
• Fixed computation
• Gradient descent is not applied as no learnable parameters
• Pooling is applied on each of the channels.
Pooling Layer
• Two types
• Max Pooling
• Average Pooling
Max Pooling
• Take the maximum of the sub region that is considered.
Input 4 × 4
Output
2 4 7 4
Max pooling 2 ×2
6 6 9 8 → f =2 →
s =2 6 9
3 4 8 3
7 8
7 5 3 6
Average Pooling
• Compute the average of the sub region that is considered.
Input 4 × 4
Output
2 4 7 4
Average pooling 2 ×2
6 6 9 8 → f= 2 →
s =2 4.5 7
3 4 8 3
4.75 5
7 5 3 6
Neural Networks II
Architectures for Classification
• LeNet-5
• AlexNet
• VGG 16
LeNet-5
• Handwritten Character recognition 10-categories
classification problem
• Approx. 60 thousand parameters are learned.
AlexNet
• ImagetNet dataset
• 1000-categories classification problem
• Approx. 60 million parameters are learned.
VGG 16
• 16 Layers
• Approx. 138 million parameters are learned.
Neural Networks II
Re c u r r e n t N e u r a l N e t w o r k
Dr. Sugata Ghosal
CSIS Off Campus Faculty
BITS Pilani
Recurrent Neural Networks
time step t
y y<t>
Networks we used
previously: also called Recurrent Neural
h h<t>
feedforward neural Network (RNN)
networks
x x<t>
Recurrent edge
Overview
y<t> Single layer RNN y<t-1> y<t> y(t+1)
Multilayer RNN
y<t> y<t-1> y<t> y(t+1)
Unfold
h<t> h<t-1> h<t> h(t+1)
1 1 1 1
• Text classification
• Speech recognition (acoustic modeling)
• language translation
• ...
Displays the actual data and the predicted data from the four models for each stock index in Year Shen, Zhen, Wenzheng Bao, and De-Shuang Huang. "Recurrent
1 from 2010.10.01 to 2011.09.30. Neural Network for Predicting Transcription Factor Binding Sites."
Scientific reports 8, no. 1 (2018): 15270.
Anita Ramachandran
March 2024
• Model engineering
• Model standardization
• Model compression
• Recent topics
• The Model Engineering pipeline includes a number of operations that lead to a final
model:
– Model Training
o The process of applying the machine learning algorithm on training data to train an ML model
o includes feature engineering and the hyperparameter tuning for the model training activity
– Model Evaluation
o Validating the trained model to ensure it meets original codified objectives before serving the ML
model in production to the end-user
– Model Testing
o Performing the final “Model Acceptance Test” by using the hold backtest dataset
– Model Packaging
o The process of exporting the final ML model into a specific format
o which describes the model, in order to be consumed by the business application
• ML architecture patterns
• Model serving patterns
• Model engineering
• Model standardization
• Model compression
• Recent topics
• Model engineering
• Model standardization
• Model compression
• Recent topics
• Model engineering
• Model standardization
• Model compression
• Recent topics
– Data Privacy for Machine Learning
– Distributed Machine Learning
– Federated Machine Learning
• Model engineering
• Model standardization
• Model compression
• Recent topics
– Data Privacy for Machine Learning
– Distributed Machine Learning
– Federated Machine Learning
• Model engineering
• Model standardization
• Model compression
• Recent topics
– Data Privacy for Machine Learning
– Distributed Machine Learning
– Federated Machine Learning
Source: https://viso.ai/deep-learning/federated-learning/
• Challenges
– Not all data sources may have collected new data between one training run and the
next
– Not all mobile devices are powered on all the time
– Performance issues on devices that train an FL model
– Requires frequent communication across the nodes in the federated network
– Workflows
BITS Pilani, Pilani Campus
Federated Learning vs Distributed
Learning
• TinyML
– Industrial Predictive Maintenance, Healthcare,
Agriculture, Ocean life conservation
• Federated Learning
– Useful in the context of mobile phones with
distributed data, or a user’s browser
– Sharing of sensitive data that is distributed across
multiple data owners
– An example of FL in production is Google’s Gboard
keyboard for Android mobile phones
• What to do to make ML
models transparent and
comply with regulations?
Privacy Issues
● How do we defend ML
models against data
poisoning?
FAccT Overview
Psychology
Social Science
Public Policy
Fairness Accountability
Statistics
Transparency Theory
Machine
Privacy Robustness Learning
Fairness and Bias
• Sources of Bias
• Real World Examples
• Sensitive Features
• Fairness Criteria
Fairness
• What is Fairness?
• The absence of bias towards an individual or a group (Mehrabi et al, 2019)
• Can ML Models Discriminate?
• Aren't machines just follow human's instructions?
• ML models approximate patterns in the data
• Learns/Amplifies biases at the same time
Model Sensitivity to Data
https://en.wikipedia.org/wiki/Simpson%27s_paradox
Real World Example
Real World Example
Real World Example
Real World Example
Commercial risk assessment software known as COMPAS
• Demographic Parity
• Equality of Odds
• Equality of Opportunity
Demographic Parity
Demographic Parity Is Applied to a Group of Samples
White C++ 0 no no 1 0
Demographic Parity for Predictor Ŷ1
• P(Ŷ1 = 1 | R = H) = 2/3
• P(Ŷ1 = 1 | R = W) =
White C++ 0 no no 1 0
Demographic Parity for Predictor Ŷ1
• P(Ŷ1 = 1 | R = H) = 2/3
• P(Ŷ1 = 1 | R = W) = 2/3
White C++ 0 no no 1 0
Demographic Parity for Predictor Ŷ1
White C++ 0 no no 1 0
Equality of Opportunity/Odds for Predictor Ŷ1
● P(Ŷ1 = 1 | R = H, Y = yes) = 1
● P(Ŷ1 = 1 | R = W, Y = yes) =
● P(Ŷ1 = 1 | R = H, Y = no) =
● P(Ŷ1 = 1 | R = W, Y = no) =
Race and Skill Years of Goes to Hiring Predictor Predictor
Ethnicity Exp Mexican Decision Y Ŷ1 Ŷ2
Markets?
White C++ 0 no no 1 0
Equality of Opportunity/Odds for Predictor Ŷ1
● P(Ŷ1 = 1 | R = H, Y = yes) = 1
● P(Ŷ1 = 1 | R = W, Y = yes) = 0.5
● P(Ŷ1 = 1 | R = H, Y = no) =
● P(Ŷ1 = 1 | R = W, Y = no) =
Race and Skill Years of Goes to Hiring Predictor Predictor
Ethnicity Exp Mexican Decision Y Ŷ1 Ŷ2
Markets?
White C++ 0 no no 1 0
Equality of Opportunity/Odds for Predictor Ŷ1
● P(Ŷ1 = 1 | R = H, Y = yes) = 1
● P(Ŷ1 = 1 | R = W, Y = yes) = 0.5
● P(Ŷ1 = 1 | R = H, Y = no) = 0
● P(Ŷ1 = 1 | R = W, Y = no) =
Race and Skill Years of Goes to Hiring Predictor Predictor
Ethnicity Exp Mexican Decision Y Ŷ1 Ŷ2
Markets?
White C++ 0 no no 1 0
Equality of Opportunity/Odds for Predictor Ŷ1
● P(Ŷ1 = 1 | R = H, Y = yes) = 1
● P(Ŷ1 = 1 | R = W, Y = yes) = 0.5
● P(Ŷ1 = 1 | R = H, Y = no) = 0
● P(Ŷ1 = 1 | R = W, Y = no) = 1
Race and Skill Years of Goes to Hiring Predictor Predictor
Ethnicity Exp Mexican Decision Y Ŷ1 Ŷ2
Markets?
White C++ 0 no no 1 0
Equality of Opportunity/Odds for Predictor Ŷ1
White C++ 0 no no 1 0
Demographic Parity for Predictor Ŷ2
● P(Ŷ1 = 1 | R = H) = 2/3
● P(Ŷ1 = 1 | R = W) =
White C++ 0 no no 1 0
Demographic Parity for Predictor Ŷ2
● P(Ŷ1 = 1 | R = H) = 2/3
● P(Ŷ1 = 1 | R = W) = 1/3
White C++ 0 no no 1 0
Demographic Parity for Predictor Ŷ2
White C++ 0 no no 1 0
Equality of Opportunity/Odds for Predictor Ŷ2
White C++ 0 no no 1 0
Demographic Parity for Predictor Ŷ2
White C++ 0 no no 1 0
Equality of Opportunity/Odds for Predictor Ŷ2
White C++ 0 no no 1 0
Equality of Opportunity/Odds for Predictor Ŷ2
White C++ 0 no no 1 0
Equality of Opportunity/Odds for Predictor Ŷ2
White C++ 0 no no 1 0
Equality of Opportunity/Odds for Predictor Ŷ2
White C++ 0 no no 1 0
Summary of Fairness Criteria
Demographic Parity ✓
Equalized Odds ✓
Equalized Opportunity ✓
Recap
○ Equal Opportunity
■ Probabilities of distributing the favorable outcome to the qualified members across
groups are the same
○ Equal Odds
■ Probabilities of distributing the favorable outcome to both qualified and unqualified
members across groups are the same
Thank You!
Applied Machine Learning
Revision
Anita Ramachandran
Question 1
Solution 1
Question 2
Solution 2
Question 3
Solution 3
Question 4
Vijay is a certified Data Scientist and he has applied for two companies
-Google and Microsoft. He feels that he has a 60% chance of receiving
an offer from Google and 50% chance of receiving an offer from
Microsoft. If he receives an offer from Microsoft, he has belief that
there are 80% chances of receiving an offer Google.
• If Vijay receives an offer from Microsoft, what is the probability that
he will not receive an offer from Google?
• What are his chances of getting an offer from Microsoft, considering
he has an offer from Google?
Solution 4
Vijay is a certified Data Scientist and he has applied for two companies -Google and Microsoft. He
feels that he has a 60% chance of receiving an offer from Google and 50% chance of receiving an
offer from Microsoft. If he receives an offer from Microsoft, he has belief that there are 80%
chances of receiving an offer Google.
• If Vijay receives an offer from Microsoft, what is the probability that he will not receive an offer
from Google?
• What are his chances of getting an offer from Microsoft, considering he has an offer from
Google?
Ans
• P(not g| m) = 0.2
• P(M|G)=0.67
Question 5
• Solve the below and find the equation for hyper plane using
linear Support Vector Machine method. Positive Points: {(3, 2),
(4, 3), (2, 3), (3, -1)} Negative Points: {(1, 0), (1, -1), (0, 2), (-1,
2)}
Solution 5
• Solve the below
and find the
equation for hyper
plane using linear
Support Vector
Machine method.
True Class
Spam Not Spam
Predicted Class Spam 70 30
Not Spam 70 330
Question 9
• Suppose we train a model to predict whether an email is Spam or Not Spam.
After training the model, we apply it to a test set of 500 new email messages
(also labelled) and the model produces the contingency table below.
• [A] Compute the precision of this model with respect to the Spam class.
• [B] Compute the recall of this model with respect to the Spam class.
BITS Pilani
Pilani Campus
Introduction
sns.histplot(X[col], kde=True)
feature_importances = dt_classifier.feature_importances_
The registered functions (evaluate, mate, mutate, and select) define the
key components of the genetic algorithm: how individuals are
evaluated, how they are combined (crossover), how they can change
(mutation), and how they are selected for the next generation. These
functions will be used by the genetic algorithm to evolve a population of
individuals toward a solution that optimizes the defined evaluation
criteria.
Runs the genetic algorithm using the eaSimple function from the
algorithms module.