Professional Documents
Culture Documents
AML AfterMid Merged
AML AfterMid Merged
AML AfterMid Merged
SE/SS ZG568
Swarna Chaudhary
Assistant Professor – BITS CSIS
BITS Pilani swarna.chaudhary@pilani.bits-pilani.ac.in
Pilani Campus
BITS Pilani
Pilani Campus
3
BITS Pilani, Pilani Campus
Logistic Regression
4
BITS Pilani, Pilani Campus
Linear Regression versus logistic
regression
Logistic Regression:
7
BITS Pilani, Pilani Campus
Sigmoid Function
• Want 0 ≤ ℎ! 𝑥 ≤ 1 1
ℎ! 𝑥 = "! !#
" 1+ 𝑒
• ℎ! 𝑥 = 𝑔 𝜃 𝑥 ,
#
where 𝑔 𝑧 = #$% !" 𝑔(𝑧)
• Sigmoid function
𝑧
• Logistic function Slide credit: Andrew
Ng
z
Suppose predict “ “ if
predict “ “ if
ℎ! 𝑥 = 𝑔 𝜃 " 𝑥
1 𝑔(𝑧)
𝑔 𝑧 =
1 + 𝑒 &'
𝑧 = 𝜃 "𝑥 ≥ 0
predict “y = 0” if ℎ! 𝑥 < 0.5
𝑧 = 𝜃 "𝑥 < 0
Slide credit: Andrew
Ng
Training set:
m examples
) )
1 & & + 1
𝐽 𝜃 = , ℎ* 𝑥 −𝑦 = , Cost(ℎ* (𝑥 & ), 𝑦))
2𝑚 𝑚
&'( &'(
1 #
Cost(ℎ" 𝑥 , 𝑦) = ℎ" 𝑥 − 𝑦
2
Linear regression:
“non-convex” “convex”
17
BITS Pilani, Pilani Campus
Logistic regression cost function (cross entropy)
If y = 1
Cost
0 1
If y = 0
Cost
0 1
−log ℎ! 𝑥 if 𝑦 = 1
Cost(ℎ! 𝑥 , 𝑦) = ,
−log 1 − ℎ! 𝑥 if 𝑦 = 0
if 𝑦 = 1 if 𝑦 = 0
0 ℎ* 𝑥 1 0 ℎ* 𝑥 1
Slide credit: Andrew
Ng
−log ℎ! 𝑥 if 𝑦 = 1
• Cost(ℎ! 𝑥 , 𝑦) = 3
−log 1 − ℎ! 𝑥 if 𝑦 = 0
• If 𝑦 = 1: Cost ℎ! 𝑥 , 𝑦 = −log ℎ! 𝑥
• If 𝑦 = 0: Cost ℎ! 𝑥 , 𝑦 = −log 1 − ℎ! 𝑥
Slide credit: Andrew
Ng
• ℎ! 𝑥 = 𝑔(𝜃? + 𝜃# 𝑥# + 𝜃@ 𝑥@ )
Age
E.g., 𝜃? = −3, 𝜃# = 1, 𝜃@ = 1
x2
x1 Tumor Size 3
• Predict “𝑦 = 1” if −3 + 𝑥# + 𝑥@ ≥ 0
• i.e 𝑥# + 𝑥@ ≥ 3
𝑥@ 𝑥@
𝑥# 𝑥#
𝑥+
"
ℎ! 𝑥
𝑥(
𝑥@
# 𝑥+
ℎ! 𝑥
𝑥# 𝑥(
Class 1:
Class 2:
Class 3:
$
ℎ! 𝑥 𝑥+
&
ℎ* 𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3) 𝑥(
Slide credit: Andrew
Ng
3
• Train a logistic regression classifier
ℎ" 𝑥 for
each class 𝑖 to predict the probability that 𝑦 = 𝑖
where, each class has its own dedicated parameter vector θ(k)
• Compute the probability pk that the instance belongs to class k
where, K is the number of classes., s(x) is a vector containing the scores of each class for the instance x.
and σ(s(x))k is the estimated probability that the instance x belongs to class k given the scores of each
class for that instance.
• the Softmax Regression classifier predicts the class with the highest probability
29
BITS Pilani, Pilani Campus
Logistic regression (Classification)
• Model
"
ℎ! 𝑥 = 𝑃 𝑌 = 1 𝑋" , 𝑋# , ⋯ , 𝑋$ = #
"%& !" $
• Cost function
)
1 −log ℎ! 𝑥 if 𝑦 = 1
𝐽 𝜃 = . Cost(ℎ! (𝑥 ' ), 𝑦 (') )) Cost(ℎ! 𝑥 , 𝑦) = 6
𝑚 −log 1 − ℎ! 𝑥 if 𝑦 = 0
'("
• Learning
" $
Gradient descent: Repeat {𝜃! ≔ 𝜃! − 𝛼 # ∑#
$%" ℎ& 𝑥
$ −𝑦 $ 𝑥! }
• Inference
1
𝑌= = ℎ! 𝑥 ,-., = # 0 %&'%
1 + 𝑒 /!
30
BITS Pilani, Pilani Campus
Logistic Regression Applications
32
BITS Pilani, Pilani Campus
Classifier Evaluation Metrics: Confusion
Matrix
Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in class i
that were labeled by the classifier as class j
May have extra rows/columns to provide totals
• True Positive (TP): It refers to the number of
predictions where the classifier correctly predicts
the positive class as positive. Predicted C1 ¬ C1
• True Negative (TN): It refers to the number of class ->
predictions where the classifier correctly predicts Actual
the negative class as negative. classß
• False Positive (FP): It refers to the number of
C1 True False
predictions where the classifier incorrectly Positives Negatives
predicts the negative class as positive. (TP) (FN)
• False Negative (FN): It refers to the number of
¬ C1 False True
predictions where the classifier incorrectly predicts
Positives Negatives
the positive class as negative. (FP) (TN)
33
33
BITS Pilani, Pilani Campus
Classifier Evaluation Metrics: Accuracy, Error Rate,
Classifier Accuracy, or recognition rate: percentage of test set tuples that are
correctly classified
Accuracy = (TP + TN)/All
most effective when the class distribution is relatively balanced
A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All
34
34
BITS Pilani, Pilani Campus
Example
Given below is a confusion matrix for medical data where the class values are yes and
no for a class label attribute, cancer. Calculate the accuracy of the classifier.
35
36
36
BITS Pilani, Pilani Campus
Class Imbalance Problem
• The main class of interest is rare.
• the data set distribution reflects a significant majority of the negative class and a
minority positive class.
• For example,
– fraud detection applications, the class of interest (or positive class) is “fraud,”
– medical tests, there may be a rare class, such as “cancer”
• Accuracy might not be a good option for measuring performance in case of class
imbalance problem
37
38
BITS Pilani, Pilani Campus
Model Evaluation Measures
• True positive rate (TPR) or sensitivity is defined as the fraction of positive examples
predicted correctly by the model
• True negative rate (TNR) or specificity is defined as the fraction of negative examples
predicted correctly by the model
• False positive rate (FPR) is the fraction of negative examples predicted as a positive class
• False negative rate (F N R) is the fraction of positive examples predicted as a negative class
39
High value of F1 measure ensures that both precision and recall are high
40
40
BITS Pilani, Pilani Campus
Example
Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%
41
41
BITS Pilani, Pilani Campus
Example: Contingency for Multi-Class
Classifier
42
43
43
BITS Pilani, Pilani Campus
Cross Validation
44
45
46
47
48
• Tom M. Mitchell
Generative and discriminative classifiers: Naïve Bayes
and Logistic Regression
http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.
pdf
• http://www.cs.cmu.edu/~tom/NewChapters.html
• http://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf
• https://medium.com/@sangha_deb/naive-bayes-vs-logistic-regression-
a319b07a5d4c
• https://www.youtube.com/watch?v=-la3q9d7AKQ
• http://www.datasciencesmachinelearning.com/2018/11/handling-outliers-in-
python.html
• http://www.cs.cmu.edu/~tom/10601_fall2012/slides/GenDiscr_LR_9-20-2012.pdf
• https://www.statlect.com/fundamentals-of-statistics/logistic-model-maximum-
likelihood
• http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf
• https://towardsdatascience.com/logistic-regression-explained-9ee73cede081
Interpretability
• https://christophm.github.io/interpretable-ml-book/logistic.html
50
BITS Pilani, Pilani Campus
Naïve Bayesian Classifier
Swarna Chaudhary
BITS Pilani Assistant Professor
Pilani|Dubai|Goa|Hyderabad
swarna.chaudhary@pilani.bits-pilani.ac.in
1
• The slides presented here are obtained from the authors of the books and from
various other contributors. I hereby acknowledge all the contributors for their
material and inputs.
• I have added and modified a few slides to suit the requirements of the course.
BITS Pilani
Random Variables
• A random variable, is a variable whose possible values are numerical outcomes of
a random experiment.
• Random variable X is a function from the sample space to the real numbers.
• Denoted by Capital letter
2. Continuous Random Variable Continuous Data can take any value within a
range.
• Example, List all the real numbers between [0,1]
5/15/22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Random Variables
5/15/22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example
• Sum of 2 dice
10
11
5/15/22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Probability Density Function
BITS Pilani
Probability Density Function
BITS Pilani
Gaussian Distribution
• The normal curve is bell-shaped and has a single peak at the
• exact center of the distribution.
• The arithmetic mean, median, and mode of the distribution are equal and
located at the peak.
• Half the area under the curve is above the peak, and the other half is
below it.
• The normal distribution is symmetrical about its mean.
• The normal distribution is asymptotic - the curve gets closer and closer to
the x-axis but never actually touches it.
16
1. Mean
2. Standard deviation
17
18
19
20
§ Example: Let a pair of dice is thrown. If the sum of the numbers on the dice is 7, Find the
probability that at least one of the dice shows 2.
21
• Example: If a dice is thrown twice, What is the probability that the first throw results in a
number greater than 4 and the second throw results in a number less than 3.
22
23
24
25
26
30
– Now we can estimate P(Xi| Yj) for all Xi and Yj combinations from the training
data
34
Data Mining
Data Mining
Data Mining
Data Mining
N ic
Original : P( Ai | C ) =
Nc
C: No. of distinct
N +1 values of an
Laplace : P( Ai | C ) = ic
Nc + c attribute
Data Mining
2 No Married 100K No
3 No Single 70K No
6 No Married 60K No
9 No Married 75K No
Data Mining
40
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Naïve Bayes for Text Classification
Data Mining
41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
A Simple Example
Which tag does the sentence A very close game
Text Tag
belong to? i.e. P(sports| A very close game)
“A great game” Sports
Feature Engineering: Bag of words i.e use word
“The election was over” Not sports frequencies without considering order
Using Bayes Theorem:
“Very clean match” Sports
P(sports| A very close game)
“A clean but forgettable
Sports = P(A very close game| sports) P(sports)
game”
----------------------------------------------
“It was a close election” Not sports P(A very close game)
“close” doesn’t appear in sentences of sports tag, So P(close | sports) = 0, which makes product
0
Data Mining
42
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Laplace smoothing
Data Mining
43
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Apply Laplace Smoothing
Word P(word | Sports) P(word | Not Sports)
Data Mining
44
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Learning to Classify Text
• Why?
• Learn which news articles are of interest
• Learn to classify web pages by topic
Data Mining
45
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Naïve Bayes Generative Model
• Naïve Bayes classifier uses likelihood and prior probability to calculate
conditional probability of the class
• Likelihood is based on joint probability, which is the core principle of probabilistic
generative model
• Naïve Bayes simplifies the calculation of likelihood by the assumption of
conditional independence among input parameters
• Each parameter’s likelihood is determined using joint probability of the input
parameter and the output label
BITS Pilani
Naïve Bayes – When to use
• When the training data is small
• When the features are conditionally independent (mostly)
• When we have a fewer missing data
• When we have large number of features with minimal data set
• Ex: Text classification
BITS Pilani
Gaussian Naïve Bayes Algorithm
• A Gaussian Naive Bayes algorithm is a special type of NB
algorithm.
• It’s also assumed that all the features are following a Gaussian
distribution i.e, normal distribution.
• Bernoulli NB: implements the naive Bayes training and classification algorithms for data that is
distributed according to multivariate Bernoulli distributions; i.e., there may be multiple features
but each one is assumed to be a binary-valued (Bernoulli, boolean) variable. Therefore, this class
requires samples to be represented as binary-valued feature vectors;
• Categorical NB: implements the categorical naive Bayes algorithm for categorically distributed
data. It assumes that each feature, which is described by the index i, has its own categorical
distribution.
Reference: https://scikit-learn.org/stable/modules/naive_bayes.html
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Extra Reading
• Text Classification in NLP using Naïve Bayes
• https://medium.com/@theflyingmantis/text-classification-in-nlp-naive-bayes-a606bf419f8c
• More on Probability
• https://www.probabilitycourse.com/chapter1/1_0_0_introduction.php
50
51
52
Swarna Chaudhary
BITS Pilani swarna.chaudhary@pilani.bits-pilani.ac.in
Pilani Campus
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• I have added and modified a few slides to suit the requirements
of the course.
2
Decision Tree
Decision trees
§ Decision Trees is one of the most widely used and practical methods of inductive inference
§ Features
§ Method for approximating discrete-valued functions
§ Learned functions are represented as decision trees (or if-then-else rules)
§ Interpretable Not black box
§ Humans can understand decisions
§ Decision tree good at handling noisy or missing system (low information gain)
§ Fast and Compact
§ Greedy (disadvantages)
§ Overfitting avoided by pruning
Data Mining
q Resulting tree:
Youth high no excellent no
Middle_aged
high no fair yes
Senior medium no fair yes
Senior low yes fair yes
Senior low yes excellent no
Middle_aged
low yes excellent yes
age? Youth medium
Youth low
no fair
yes fair
no
yes
Senior medium yes fair yes
Youth medium yes excellent yes
Middle_aged
medium no excellent yes
youth Midle_aged
overcast Senior Middle_aged
high yes fair yes
Senior medium no excellent no
• Complexity is O(n X |D| X log|D|), where n is the number of attributes describing the
tuples in D, |D| is the number of training tuples.
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
m=2
Data Mining
14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attribute Selection Measure: Information
Gain (ID3/C4.5)
n Select the attribute with the highest information gain
n This attribute minimizes the expected number of tests needed to classify a given tuple.
n Let pi be the probability that a tuple in D belongs to class Ci, estimated by |Ci, D|/|D|, m is
the number of distinct classes, v is the number of distinct values in an attribute.
n Expected information (entropy) needed to classify a tuple in D:
m
Info ( D) = -å pi log 2 ( pi )
i =1
n Information needed (after using A to split D into v partitions) to classify D
v | Dj |
InfoA ( D) = åj =1 |D|
´ Info( D j )
n The smaller the expected information required, greater the purity of the partitions.
Data Mining
Data Mining
Data Mining
Data Mining
Data Mining
20
5/15/22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Splitting Based on Continuous Attributes
Data Mining
21
5/15/22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Computing Information-Gain for
Continuous-Valued Attributes
• Let attribute A be a continuous-valued attribute
• Must determine the best split point for A
• Sort the value A in increasing order
• Typically, the midpoint between each pair of adjacent values is considered as
a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
• The point with the minimum expected information requirement for A is
selected as the split-point for A
• Split:
• D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of
tuples in D satisfying A > split-point
Data Mining
T 6 +
Class
+ - + - + - - T 5 -
Sorted F 4 +
• Calculate1 Info(D)
A2
3 4 5 6 7 8
Split_Poi F 7 -
• Calculate
nt
2
the entropy 3.5each 4.5
for 5.5 for 6.5
split_point 7.5“>”
“<=” and
F 3 -
• Find Gain for each split_point
F 8 -
T 7 +
F 5 -
Data Mining
v | Dj | | Dj |
SplitInfo A ( D) = -å ´ log 2 ( )
j =1 |D| |D|
• GainRatio(A) = Gain(A)/SplitInfoA (D)
• Ex. Gain Ratio of “income” on the given data set.
Data Mining
24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Gini Index (CART, IBM IntelligentMiner)
• If a data set D contains examples from n classes, gini index, gini(D) is
defined as n 2
gini( D) = 1- å p j
j =1
Data Mining
26
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Comparing Attribute Selection Measures
• The three measures, in general, return good results but
• Information gain:
• biased towards multivalued attributes
• Gain ratio:
• tends to prefer unbalanced splits in which one partition is much smaller
than the others
• Gini index:
• biased to multivalued attributes
• has difficulty when # of classes is large
Data Mining
Data Mining
28
5/15/22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Inductive bias in decision tree learning
§ Inductive bias is the assumptions made by the model to learn the target
function and to generalize beyond training data.
§ What is the inductive bias of DT learning?
1. Shorter trees are preferred over longer trees
This is the bias exhibited by a simple breadth first algorithm
generating all DT's e selecting the shorter one
2. Prefer trees that place high information gain attributes close to the root
5/15/22
33
BITS Pilani, Pilani Campus
How to Address Overfitting…
• Post-pruning
– Grow decision tree to its entirety
– Trim the nodes of the decision tree in a bottom-up fashion
– If generalization error(i.e. expected error of the model on previously unseen
records) improves after trimming, replace sub-tree by a leaf node.
– Class label of leaf node is determined from majority class of instances in
the sub-tree
5/15/22
34
BITS Pilani, Pilani Campus
Good References
Decision Tree
• https://www.youtube.com/watch?v=eKD5gxPPeY0&list=PLBv0
9BD7ez_4temBw7vLA19p3tdQH6FYO&index=1
Overfitting
• https://www.youtube.com/watch?time_continue=1&v=t56Nid85
Thg
• https://www.youtube.com/watch?v=y6SpA2Wuyt8
• Decision tree for regression
• https://www.saedsayad.com/decision_tree_reg.htm
Data Mining
36
5/15/22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You
BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus
Text Book(s)
R1 An Introduction to Data Mining – Pang-Ning Tan, Michael Steinbach, Anuj
Karpatne, Vipin Kumar - 2005
These slides are prepared by the instructor, with grateful acknowledgement of and many others who made their
course materials freely available online.
Topics to be covered
• Ensemble Method
• Methods for constructing an Ensemble Classifier
• Bagging, Boosting
• Random Forest
• AdaBoost
• eXtreme Gradient Boosting (XGBoost)
5
BITS Pilani, Pilani Campus
When does Ensemble work?
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets
Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers
Step 3:
Combine C*
Classifiers
7
BITS Pilani, Pilani Campus
Methods for constructing Ensemble
Classifier
• Using different algorithms
• Using different parameters/hyperparameters
• Using different training sets
• By manipulating input features
• By manipulating the class labels
Max Voting
Ex: Movie rating
The result of max voting would be something like this:
– Rating by 5 friends: 5 4 5 4 4
Averaging- (5+4+5+4+4)/5 = 4.4 Final rating
Weighted Average
Weight-0.23 0.23 0.18 0.18 0.18
The result is calculated as [(5*0.23) + (4*0.23) +
(5*0.18) + (4*0.18) + (4*0.18)] = 4.41.
11
BITS Pilani, Pilani Campus
Bootstrap Sampling
• Several bootstrap methods, and a common one is 0.632 boostrap
– A data set with d tuples is sampled d times, with replacement, resulting in a training set
of d samples.
– About 63.2% of the original data end up in the bootstrap, and the remaining 36.8% form
the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
• Where does the figure, 63.2%, come from?
Each tuple has a probability of 1/d of being selected, so the probability of not being
chosen is (1-1/d). We have to select d times, so the probability that a tuple will not be chosen
during this whole time is since (1 – 1/d)d . If d is large, the probability approaches e-1 = 0.368.
Thus, 36.8% of tuples will not be selected for training
12
BITS Pilani, Pilani Campus
Bootstrap Sampling
True False
yleft yright
16
BITS Pilani, Pilani Campus
Bagging Example
Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35 è y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35 è y = -1
Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1 x <= 0.7 è y = 1
y 1 1 1 -1 -1 -1 1 1 1 1 x > 0.7 è y = 1
Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9 x <= 0.35 è y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.35 è y = -1
Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9 x <= 0.3 è y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.3 è y = -1
Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1 x <= 0.35 è y = 1
y 1 1 1 -1 -1 -1 -1 1 1 1 x > 0.35 è y = -1
17
BITS Pilani, Pilani Campus
Bagging Example
Bagging Round 6:
x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1 x <= 0.75 è y = -1
y 1 -1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 è y = 1
Bagging Round 7:
x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1 x <= 0.75 è y = -1
y 1 -1 -1 -1 -1 1 1 1 1 1 x > 0.75 è y = 1
Bagging Round 8:
x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1 x <= 0.75 è y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 è y = 1
Bagging Round 9:
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1 1 x <= 0.75 è y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 è y = 1
18
BITS Pilani, Pilani Campus
Bagging Example
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 1 1 1 -1 -1 -1 -1 -1 -1 -1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
4 1 1 1 -1 -1 -1 -1 -1 -1 -1
5 1 1 1 -1 -1 -1 -1 -1 -1 -1
6 -1 -1 -1 -1 -1 -1 -1 1 1 1
7 -1 -1 -1 -1 -1 -1 -1 1 1 1
8 -1 -1 -1 -1 -1 -1 -1 1 1 1
9 -1 -1 -1 -1 -1 -1 -1 1 1 1
10 1 1 1 1 1 1 1 1 1 1
Sum 2 2 2 -6 -6 -6 -6 2 2 2
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class
19
BITS Pilani, Pilani Campus
Bagging Algorithm
20
BITS Pilani, Pilani Campus
Boosting
• What if a data point is incorrectly predicted by the first
model, and then the next (probably all models), will
combining the predictions provide better results? Such
situations are taken care of by boosting.
• Boosting is a sequential process, where each
subsequent model attempts to correct the errors of the
previous model.
• The succeeding models are dependent on the previous
model.
• Records that are wrongly classified will have their weights increased
• Records that are classified correctly will have their weights decreased
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
24
BITS Pilani, Pilani Campus
Bagging and Boosting Algorithms
Bagging algorithms:
Boosting algorithms:
– AdaBoost
– XGBoost
28
BITS Pilani, Pilani Campus
Random Forest
29
BITS Pilani, Pilani Campus
Random Vector selection –
Forest RI (Random Input selection)
• Randomly select F input features to split at each node of the
decision tree and then fully grow the tree without pruning
• This helps reduce the bias present in the resulting tree
• The predictions are combined using a majority voting scheme
• To increase randomness, bagging can also be used to generate
bootstrap samples
• The strength and correlation of random forests may depend on
the size of F features
– If F is sufficiently small, then the trees tend to become less
correlated
• The strength of the tree classifier tends to improve with a larger
number of F
• Optimal number of F = log2 d + 1 (where d is number of input
features)
30
BITS Pilani, Pilani Campus
AdaBoost
• Adaptive boosting or AdaBoost is one of the
simplest boosting algorithms. Usually, decision
trees are used for modelling. Multiple sequential
models are created, each correcting the errors
from the last model.
• AdaBoost assigns weights to the observations
which are incorrectly predicted and the
subsequent model works to predict these values
correctly.
Alpha is weight for the classifier. It is how much influence this stump will have in the final classification.
Weight Update:
ε = 0.3
α = 1/2 * ln(1- 0.3 / 0.3) = 0.42365
ε = 0.7
α = 1/2 * ln(1- 0.7 / 0.7) = -0.42365
ε = 0.5
α = 1/2 * ln(1- 0.5 / 0.5) = 0
Notice three interesting observations:
1) classifier with accuracy higher than 50% results in a positive
weight for the classifier
(in other words, α > 0 if ε <= 0.5),
2) classifier with exact 50% accuracy is 0, and thus, does not
contribute to the final prediction, and
3) errors 0.3 and 0.7 lead to classifier weights with inverse signs.
36
BITS Pilani, Pilani Campus
Training instance weight update
37
BITS Pilani, Pilani Campus
Adaboost Example
Initial weights for each data point Data points
for training
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - a = 1.9459
38
BITS Pilani, Pilani Campus
Adaboost Example
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - a = 1.9459
B2
0.3037 0.0009 0.0422
Boosting
Round 2 - - - - - - - - ++ a = 2.9323
B3
0.0276 0.1819 0.0038
Boosting
Round 3 +++ ++ ++ + ++ a = 3.8744
Overall +++ - - - - - ++
39
BITS Pilani, Pilani Campus
Good References
Ensemble methods
https://www.slideshare.net/hustwj/an-introduction-to-ensemble-methodsboosting-
bagging-random-forests-and-more
Bagging and Boosting
• https://www.youtube.com/watch?time_continue=2&v=m-S9Hojj1as
XGBoost
• https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/
• https://xgboost.readthedocs.io/en/latest/tutorials/model.html
• https://www.youtube.com/watch?time_continue=71&v=Vly8xGnNiWs
• https://www.slideshare.net/ShangxuanZhang/xgboost-55872323
• https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/
• https://towardsdatascience.com/a-beginners-guide-to-xgboost-87f5d4c30ed7
• https://towardsdatascience.com/machine-learning-for-diabetes-562dd7df4d42
40
BITS Pilani, Pilani Campus
Thank You
41
15 May 2022 BITS Pilani, Pilani Campus
Support Vector Machines
BITS Pilani
Pilani Campus
Text Book(s)
T1 Christopher Bishop: Pattern Recognition and Machine Learning, Springer
International Edition
T2 Tom M. Mitchell: Machine Learning, The McGraw-Hill Companies, Inc..
denotes +1
denotes -1 How would you
classify this data?
w x + b<0
denotes +1
denotes -1 How would you
classify this data?
denotes +1
denotes -1 How would you
classify this data?
denotes +1
denotes -1 Any of these
would be fine..
..but which is
best?
denotes +1
denotes -1
How would you
classify this data?
Misclassified
to +1 class
Decision Boundary
2 parallel hyperplanes
bi1 and bi2 can be
written as,
bi1: w.x + b = 1,
bi2: w.x + b = -1
X-
𝟏 2 T
L(w, b, αi)= 𝟐||w|| - Σ αi [yi (w xi + b) -1]
yi (wTxi + b) ≥1
Find w and b such that
𝟏 +1(wTxi + b) ≥ 1
Φ(w) = 𝟐||w||2 is minimized; -1(wTxi+ b) ≤ 1
and for all {(xi ,yi)}: yi (wTxi + b) ≥1 same as (wTxi + b) ≥ 1
Constraints are
26
BITS Pilani, Pilani Campus
Soft Margin
f(x) = ΣαiyixiTx + b
Swarna Chaudhary
BITS Pilani Asst. Professor
Pilani Campus WILP Division, BITS-Pilani
•The slides presented here are obtained from the authors of the books and from various other contributors. I
hereby acknowledge all the contributors for their material and inputs.
•I have added and modified a few slides to suit the requirements of the course.
data
• Given a X, find K clusters
using data similarity
• Simple segmentation
– Dividing students into different registration groups alphabetically, by last name
• Results of a query
– Groupings are a result of an external specification
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
https://ieeexplore.ieee.org/document/7019655 10
• Case Study
• https://medium.com/@msuginoo/three-different-lessons-from-three-different-clustering-
analyses-data-science-capstone-5f2be29cb3b2
11
12
• An important distinction among types of clustering : hierarchical and partitional sets of clusters
• Partitional Clustering
– A division data objects into non-overlapping subsets (clusters) such that each data object is in
exactly one subset
• Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
• Density based
– identify distinctive groups/clusters in the data, based on the idea that a cluster in a data space
is a contiguous region of high point density, separated from other such clusters by contiguous
regions of low point density.
• Distribution Based
– Idea is data generated from the same distribution, belongs to the same cluster if there exists
several distributions in the dataset.
13
14
p1
p3 p4
p2
p1 p2 p3 p4
Hierarchical Clustering
Dendrogram
15
6 density-based clusters
16
17
18
19
20
21
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and h is
the order (the distance so defined is also called L-h norm)
Properties
– d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
– d(i, j) = d(j, i) (Symmetry)
– d(i, j) £ d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric
22
x2 x4 Euclidean (L2) L2 x1 x2 x3 x4
x1 0
4 x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
2 x1 L¥ x1 x2 x3 x4
Supremum
x1 0
x2 3 0
x3 2 5 0
x3 x4 3 1 5 0
0 2 4
23
24
BITS Pilani, Pilani Campus
Ordinal Variables
An ordinal variable can be discrete or continuous
Order is important, e.g., rank
Can be treated like interval-scaled
– replace xif by their rank rif Î{1,..., M f }
– map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by
rif -1
zif =
M f -1
– compute the dissimilarity using methods for interval-scaled variables
25
https://healthcare.ai/clustering-non-continuous-variables/
26
BITS Pilani, Pilani Campus
Example
Based on the information given in the table below, find most similar and most dissimilar persons
among them. Apply min-max normalization on income to obtain [0,1] range. Consider profession
and mother tongue as nominal. Consider native place as ordinal variable with ranking order of
[Village, Small Town, Suburban, Metropolitan]. Give equal weight to each attribute.
27
BITS Pilani, Pilani Campus
Solution
After normalizing income and quantifying native place, we get
Most similar – Balram and Bharat; Most dissimilar – Balram and Kishan
28
BITS Pilani, Pilani Campus
Cosine Similarity
A document can be represented by thousands of attributes, each recording the frequency of a
particular word (such as keywords) or phrase in the document.
29
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 =
6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 =
4.12
cos(d1, d2 ) = 0.94
30
31
BITS Pilani, Pilani Campus
K-Means Algorithm
• Works iteratively to find {𝜇k}and {rnk} such that J is minimized
For all xt ∊ X:
M-Step:
Algorithm:
Algorithm:
Algorithm:
Algorithm:
Candidate
Glucose
Weight level
1 72 185
2 56 170
3 60 168
4 68 179
5 72 182
6 77 188
7 70 180
8 84 183
42
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
44
Iteration 6
1
2
3
4
5
3
2.5
1.5
y
0.5
45
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
46
2.5
1.5
y
0.5
47
– Dissimilarity calculations
48
49
50
51
52
https://towardsdatascience.com/gaussian-mixture-models-d13a5e915c8e
BITS Pilani, Pilani Campus
Gaussian Distribution
• The normal curve is bell-shaped and has a single peak at the
• exact center of the distribution.
• The arithmetic mean, median, and mode of the distribution are equal and located at
the peak.
• Half the area under the curve is above the peak, and the other half is below it.
• The normal distribution is symmetrical about its mean.
55
56
Where
x: input vector
μ : D-dimensional mean vector
Σ : D × D covariance matrix
|Σ| : Determinant of Σ
Time to next Geyser Eruptions (in Mins) Length of Geyser Eruptions (in
Mins)
Demo of fitting a single gaussian to the data.
Time to next Geyser Eruptions (in Mins) Length of Geyser Eruptions (in
Mins)
Demo of fitting a k gaussian to the data.
n Out task: infer a set of k probabilistic clusters that is mostly likely to generate D
https://scikit-learn.org/stable/modules/mixture.html
𝙆 - Number of Gaussians
Data Set: The data used this module came from ‘Old Faithful Data’ available from https://www.kaggle.com/janithwanni/old-faithful/data for download & used by the text book
PRML.
Component Densities
𝙆 - Number of Gaussians
Data Set: The data used this module came from ‘Old Faithful Data’ available from https://www.kaggle.com/janithwanni/old-faithful/data for download & used by the text book
PRML.
Data Set: The data used this module came from ‘Old Faithful Data’ available from https://www.kaggle.com/janithwanni/old-faithful/data for download & used by the text book
PRML.
Parameters of MoG:
π : {π1 , . . . , πK },
μ : {μ1 , . . . , μK }
Σ : {Σ1, . . . ΣK }
Data Set: The data used this module came from ‘Old Faithful Data’ available from https://www.kaggle.com/janithwanni/old-faithful/data for download & used by the text book
PRML.
Parameters of MoG:
π : {π1 , . . . , πK },
μ : {μ1 , . . . , μK }
Σ : {Σ1, . . . ΣK }
Log likelihood
Data Set: The data used this module came from ‘Old Faithful Data’ available from https://www.kaggle.com/janithwanni/old-faithful/data for download & used by the text book
PRML.
Parameters of MoG:
π : {π1 , . . . , πK },
μ : {μ1 , . . . , μK }
Σ : {Σ1, . . . ΣK }
Data Set: The data used this module came from ‘Old Faithful Data’ available from https://www.kaggle.com/janithwanni/old-faithful/data for download & used by the text book
PRML.
● We get 𝜇k as
Where
● We get Σk as
Where
Where
75
75
BITS Pilani, Pilani Campus
EM Algorithm for MoG
1. Start by placing gaussians randomly.
2. Repeat until it converges.
1. E step: With the current means and variances, find the probability of each data point xi
coming from each gaussian.
2. M step: Once it computed these probability assignments it will use these numbers to re-
estimate the gaussians’ mean and variance to better fit the data points.
• For each observation i, vector γ (aka responsibility vector) is (γi1, γi2,… γiK ), where K
is the total number of clusters, or often referred to as the number of components.
• The cluster responsibilities for a single data point i should sum to 1.
, where
and check for convergence of either the parameters or the log likelihood. If the convergence
criterion is not satisfied return to step 2.
Thank You!
Applied Machine Learning
Swarna Chaudhary
Assistant Professor
BITS Pilani swarna.chaudhary@pilani.bits-pilani.ac.in
Pilani Campus
BITS Pilani
Pilani Campus
NeuralNetwork
These slides are prepared by the instructor, with grateful acknowledgement of
Tom Mitchell, Andrew Ng and many others who made their course materials
freely available online.
Session Content
3
Artificial Neural Network
Neural Networks
• Origins: Algorithms that try to mimic the brain.
• Very widely used in 80s and early 90s; popularity diminished in late 90s.
• Recent resurgence: State-of-the-art technique for many applications
• Artificial neural networks are not nearly as complex or intricate as the actual
brain structure
-
Perceptron Training rule
Perceptron Training
Gradient Descent
Gradient Descent
Gradient Descent
12
Perceptron: Sigmoid Function
13
Decision Surface of Perceptron
AND Operation
Multilayer network
Output units
Hidden units
Input units
Layered feed-forward network
hθ(x)
Slide by Andrew Ng 12
Feed-Forward Process
• Input layer units are set by some exterior function
(think of these as sensors), which causes their output
links to be activated at the specified level
• Working forward through the network, the input
function of each unit is applied to compute the input
value
– Usually this is just the weighted sum of the activation on
the links feeding into this node
Slide by Andrew Ng 19
Layering Representations
x1 ... x20
x21 ... x40
x41 ... x60
.
..
x381 ... x400
20 ×20 pixelimages
d = 400 10 classes
Forward Propagation
• a(1) = x
• z(2) = Θ(1)a(1) a(1)
a(4)
• a(2) = g(z(2)) [add a 0(2)] a(2) a(3)
• z(3) = Θ(2)a(2)
• a(3) = g(z(3)) [add a0(3)]
• z(4) = Θ(3)a(3)
• a(4) = hΘ(x) = g(z(4))
• The trick is to assess the blame for the error and divide
it among the contributing weights
32
Han Kamber, 3rd Edition
34
Random Initialization
Training a Neural Network
Pick a network architecture (connectivity pattern between nodes)
40
Dropouts
28
41
BITS Pilani, Pilani Campus
Good References for understanding
Neural Network
Andrew Ng videos on neural network
https://www.youtube.com/watch?v=EVeqrPGfuCY&li
st=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=45
ConvolutionalNeural Network
These slides are prepared by the instructor, with grateful acknowledgement of Andrew
Ng, Tom Mitchell, Mitesh Khapra and many others who made their course materials
freely available online.
Session Content
3
Convolutional Neural Network
Source: https://www.youtube.com/watch?v=40riCqvRoMs
Convolutional Neural Network
5/68
How a computer sees an image.
source: http://cs231n.github.io/classification/
6/68
Input
Kernel
a b c d
w x
e f g h
y z
i j k A
Output
aw+bx+ey+fz
6/68
Let us apply this idea to a toyexample and see the results
Input
Kernel
a b c d
w x
e f g h
y z
i j k A
Output
aw+bx+ey+fz bw+cx+fy+gz
6/68
Let us apply this idea to a toyexample and see the results
Input
Kernel
a b c d
w x
e f g h
y z
i j k A
Output
6/68
Let us apply this idea to a toyexample and see the results
Input
Kerne
a b c d l
w x
e f g h
y z
i j k A
Output
ew+fx+iy+jz
6/68
Let us apply this idea to a toyexample and see the results
Input
Kernel
a b c d
w x
e f g h
y z
i j k A
Output
ew+fx+iy+jz fw+gx+jy+kz
6/68
Let us apply this idea to a toyexample and see the results
Input
Kernel
a b c d
w x
e f g h
y z
i j k A
Output
6/68
Example of kernel: Blur
1 1 1
∗ 1 1 1 =
1 1 1
blurs the image
1 1 1
∗ 1 -8 1 =
1 1 1
detec s h
edges
pixel of interest
20/68
What if we want the output to be of same size
0 0 0 0 0 0 0 0 0
as the input?
0 0
We can use something known as padding
0 0
0 0
Pad the inputs with appropriate number of 0
0 0
= inputs so that you can now apply the kernel
at the corners
0 0
0 0 Let us use pad P = 1 with a 3 × 3 kernel
0 0
This means we will add one row and one
0 0 0 0 0 0 0 0 0
column of 0 inputs at the top, bottom, left
and right
We now have, W2 = W1 − F + 2P + 1 H2 = H1 − F + 2P + 1
21/68
0
0
0 0 0 0 0 0 0 0
0
What does the stride S do?
0 0 It defines the intervals at which the
0 0 filter is applied (here S =2)
0 0 =
Here, we are essentially
0 0
0 0
skipping every 2nd pixel which
0 0 will again result in an output
0 0 0 0 0 0 0 0 0 which is of smaller dimensions
22/68
0
0
0 0 0 0 0 0 0 0
0
What does the stride S do?
0 0 It defines the intervals at which the
0 0 filter is applied (here S =2)
0 0 =
Here, we are essentially
0 0
0 0
skipping every 2nd pixel which
0 0 will again result in an output
0 0 0 0 0 0 0 0 0 which is of smaller dimensions
H1 − F +2P
H = +1
S
22/68
Finally, coming to the depth of the
output.
Each filter gives us one 2D output.
K filters will give us K such 2D out-
H2
puts
H1
W2
W1
D2 =K
D1
W2 = W1−F +2P +1
S
H2 = H1−F +2P +1
S
D2 =K
23/68
Finally, coming to the depth of the
output.
Each filter gives us one 2D output.
K filters will give us K such 2D out-
H2
puts
We can think of the resulting output
as K × W2 × H2 volume
H1
Thus D2 = K
W2
W1
D2 =K
D1
W2 = W1−F +2P +1
S
H2 = H1−F +2P +1
S
D2 =K
23/68
Example
H 2 =?
11
∗ =
11
227
3
W 2 =?
96filters
227 Stride =4
Padding = 0
W 2 = W 1 − F +2P + 1 96
S
3 H 2 = H 1 − F +2P + 1
S
55 = 227−11 + 1
4
11
∗ =
11
227
3
55 = 227−11 + 1
96filters 4
227 Stride =4
Padding = 0
W 2 = W 1 − F +2P + 1 96
S
3 H 2 = H 1 − F +2P + 1
S
24/68
Applying multiple filters
Features
Raw pixels
car, bus,monument, flower
Edge Detector
car, bus,monument, flower
S I F T /H OG
car, bus,monument, flower
28/68
Input Features Classifier
0 0 0 0 0
0 1 1 1 0
0 1 -8 1 0
0 1 1 1 0
0 0 0 0 0
-1.21358689e-033.23652686e-03····· -
2.06615720e-02
-1.5275..7822e-032.3613..0832e-03····· - ..
1.1982.4838e-02
.
.
.
.
Learn these weights
.. . .
-8.25322699e-04-5.14897937e-03····· -
9.90395527e-
03
Instead of using handcrafted kernels such as edge detectors can we learn meaningful kernels/filters in
addition to learning the weights of the classifier?
CS7015 (Deep Mitesh M. Khapra 29/68
Learning) : Lecture 11
Input Features Classifier
0 0 0 0 0
0 1 1 1 0
0 1 -8 1 0
0 1 1 1 0
0 0 0 0 0
-0.01871333-0.01075948······
0.04684572
0.001..043250.019..35937··· ··· ..
. .
0.010.16542
.. .. ..
. . .
0.030087770.00335217······ -0.02791128
Even better: Instead of using handcrafted kernels (such as edge detectors) can we learn multiple
meaningful kernels/filters in addition to learning the weights of the classifier?
CS7015 (Deep Mitesh M. Khapra 30/68
Learning) : Lecture 11
Convolutional Neural Network
Input Classifier
33/68
Sparse Connectivity
h11 h12 Only a few local neurons participate
in the computation of h11
2
over the image, only a few inputs are
considering at a time to compute the
* = weighted average of selected pixel
inputs.
The output h₁₁ is calculated using
sparser connections rather than
CS7015 (Deep Mitesh M. Khapra considering all the connections. 34/68
Learning) : Lecture 11
A
Pooling Layer 1 FC 1(120)FC2(84)Output(10)
32
28
14
P a r am
32
10 P a r am P a r a m
5 = 48120= 10164 = 850
28 14
10 5
S = 1,F = 5, S = 1,F = 2,
K = 6,P =0, K = 6,P =0,
S = 1,F = 5, S = 1,F = 2,
P a r a m =150 P a r a m =0 K = 16,P = 0, K = 16,P = 0,
P a r a m = 2400 P a r a m =0
5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7 5
1 3 1 2
Input 1filter
1 4 2 1
5 8 3 4 maxpool 8 8
1 3 1 2
5 8 3 4 maxpool 8 4
* =
7 6 4 5 2x2 filters (stride 2) 7 5
1 3 1 2
Input 1filter
1 4 2 1
5 8 3 4 maxpool 8 8 4
1 3 1 2 7 6 5
43/68
The pooling layer uses various filters to identify different parts of the image like edges, corners, etc
Training CNN
Input Kernel
b c d l m n o
w x
e f g
y z
h i j b c d e f g h i j
44/68
Training CNN
Input Kerne
l
b c d l m n o
w x
e f g
y z
g
h i j b c d e f h i j
Visualization of CNN
https://www.youtube.com/watch?v=cNBBNAxC8l4
https://www.simplilearn.com/tutorials/deep-learning-
tutorial/convolutional-neural-
network?source=sl_frs_nav_playlist_video_clicked