MergedPDF Iml

Note to other teachers and users of these
slides. Andrew would be delighted if you

found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them
to fit your own needs. PowerPoint
originals are available. If you make use
of a significant portion of these slides in
■ For your end-term exam, the syllabus is slides 1-25.
your own lecture, please include this
message, or the following link to the
source repository of Andrew’s tutorials:
http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully
Introduction to Support
received.
Vector Machines
Thanks:
Andrew Moore
CMU
And
Martin Law
Michigan State University
* 2
History of SVM Linear Classifiers Estimation:

x f yest
■ SVM is related to statistical learning theory [3]
■ SVM was first introduced in 1992 [1]
f(x,w,b) = sign(w. x - b)
■ SVM becomes popular because of its success in denotes +1
handwritten digit recognition denotes -1 w: weight vector
■ 1.1% test error rate for SVM. This is the same as the error
x: data vector
rates of a carefully constructed neural network, LeNet 4.
■ See Section 5.11 in [2] or the discussion in [3] for details How would you
classify this data?
■ SVM is now regarded as an important example of “kernel
methods”, one of the key area in machine learning
■ Note: the meaning of “kernel” is different from the “kernel”
function for Parzen windows
[1] B.E. Boser et al. A Training Algorithm for Optimal Margin Classifiers. Proceedings of the Fifth Annual Workshop on
Computational Learning Theory 5 144-152, Pittsburgh, 1992.
[2] L. Bottou et al. Comparison of classifier methods: a case study in handwritten digit recognition. Proceedings of the 12th
IAPR International Conference on Pattern Recognition, vol. 2, pp. 77-82.
[3] V. Vapnik. The Nature of Statistical Learning Theory. 2 nd edition, Springer, 1999.
* 3 * 4
α α
Linear Classifiers Linear Classifiers
x f y est x f yest
f(x,w,b) = sign(w. x - b) f(x,w,b) = sign(w. x - b)
denotes +1 denotes +1
denotes -1 denotes -1
How would you How would you

classify this data? classify this data?
* 5 * 6
α α
Linear Classifiers Linear Classifiers
x f y est x f yest
denotes -1 denotes -1
How would you Any of these

classify this data? would be fine..
..but which is
best?
* 7 * 8
α α
Classifier Margin Maximum Margin
x f y est x f yest
denotes -1 Define the margin denotes -1 The maximum
of a linear margin linear
classifier as the classifier is the
width that the linear classifier
boundary could be with the, um,
increased by maximum margin.
before hitting a
This is the
datapoint.
simplest kind of
SVM (Called an
LSVM)
Linear SVM
* 9 * 10
α
Maximum Margin Why Maximum Margin?
x f yest
f(x,w,b) = sign(w. x + b) f(x,w,b) = sign(w. x - b)
denotes -1 The maximum denotes -1 The maximum
margin linear margin linear
classifier is the classifier is the
linear classifier linear classifier
Support Vectors with the, um, Support Vectors with the, um,
are those are those
datapoints that maximum margin. datapoints that maximum margin.
the margin This is the the margin This is the
pushes up pushes up
against
simplest kind of against
simplest kind of
SVM (Called an SVM (Called an
LSVM) LSVM)
Linear SVM
* 11 * 12
How to calculate the distance from a point
to a line? Estimate the Margin
denotes -1
x
wx +b = 0 denotes -1
x
wx +b = 0
X – Vector X – Vector
W W
W – Normal Vector W – Normal Vector
b – Scale Value b – Scale Value
■ http://mathworld.wolfram.com/Point-LineDistance2-Dim ■ What is the distance expression for a point x to a line

ensional.html wx+b= 0?
■ In our case, w *x +w *x +b=0,
1 1 2 2
■ thus, w=(w ,w ), x=(x ,x )
1 2 1 2
* 13 * 14
Large-margin Decision Boundary Finding the Decision Boundary

■ The decision boundary should be as far away from the ■ Let {x1, ..., xn} be our data set and let yi ∈ {1,-1} be
data of both classes as possible the class label of xi
■ We should maximize the margin, m ■ The decision boundary should classify all points correctly
■ Distance between the origin and the line wtx=-b is b/||w|| ⇒
■ To see this: when y=-1, we wish (wx+b)<1, when y=1,
we wish (wx+b)>1. For support vectors, we wish
y(wx+b)=1.
■ The decision boundary can be found by solving the
Class 2 following constrained optimization problem
Class 1
m
* 15 * 16
Next step… Optional The Dual Problem (we ignore the derivation)
■ Converting SVM to a form we can solve ■ The new objective function is in terms of αi only
■ Dual form ■ It is known as the dual problem: if we know w, we know
■ Allowing a few errors all αi; if we know all αi, we know w

■ Soft margin ■ The original problem is known as the primal problem
■ Allowing nonlinear boundary ■ The objective function of the dual problem needs to be
■ Kernel functions maximized!

■ The dual problem is therefore:
Properties of αi when we introduce The result when we differentiate the

the Lagrange multipliers original Lagrangian w.r.t. b
* 17 * 18
The Dual Problem Characteristics of the Solution

■ Many of the αi are zero (see next page for example)
■ w is a linear combination of a small number of data points
■ This “sparse” representation can be viewed as data
compression as in the construction of knn classifier

■ xi with non-zero αi are called support vectors (SV)
■ This is a quadratic programming (QP) problem ■ The decision boundary is determined only by the SV
■ A global maximum of αi can always be found
■ Let t (j=1, ..., s) be the indices of the s support vectors.
j
We can write
■ w can be recovered by
■ For testing with a new data z
■ Compute and
classify z as class 1 if the sum is positive, and class 2
otherwise
■ Note: w need not be formed explicitly
* 19 * 20
A Geometrical Interpretation Allowing errors in our solutions
■We allow “error” ξi in classification; it is based on the
Class 2 output of the discriminant function wTx+b
■ ξ approximates the number of misclassified samples
i
α8=0.6 α10=0
α7=0 Class 2
α2=0
α5=0
α1=0.8
α4=0
α6=1.4
α9=0
α3=0
Class 1
Class 1
* 21 * 22
Soft Margin Hyperplane Extension to Non-linear Decision Boundary

■ If we minimize ∑iξi, ξi can be computed by ■ So far, we have only considered
large-margin classifier with a linear
decision boundary
■ ξi are “slack variables” in optimization ■ How to generalize it to become nonlinear?
Note that ξi=0 if there is no error for xi
■ Key idea: transform x to a higher
■
■ ξi is an upper bound of the number of errors i

■ We want to minimize
dimensional space to “make life easier”
■ Input space: the space the point xi are
■ C : tradeoff parameter between error and margin
■ The optimization problem becomes
located
■ Feature space: the space of φ(x ) after
i
transformation
* 23 * 24
Transforming the Data (c.f. DHS Ch. 5)
φ( )
φ( ) φ( )
φ( ) φ( ) φ( )
φ φ( )
φ( ) φ( )
φ( ) φ( )
(.) φ( )
φ( ) φ( )
φ( ) φ( )
φ( )
φ( )
Input space Feature space

Note: feature space is of higher dimension
than the input space in practice
■ Computation in the feature space can be costly because it is

high dimensional
■ The feature space is typically infinite-dimensional!
■ The kernel trick comes to rescue
* 25 * 26
The Kernel Trick An Example for φ(.) and K(.,.)

■ Recall the SVM optimization problem ■ Suppose φ(.) is given as follows
■ An inner product in the feature space is
■ The data points only appear as inner product

■ As long as we can calculate the inner product in the ■ So, if we define the kernel function as follows, there is
feature space, we do not need the mapping explicitly no need to carry out φ(.) explicitly
■ Many common geometric operations (angles, distances)
can be expressed by inner products
■ Define the kernel function K by ■ This use of kernel function to avoid carrying out φ(.)
explicitly is known as the kernel trick
* 27 * 28
More on Kernel Functions Examples of Kernel Functions
■ Not all similarity measures can be used as kernel
function, however ■ Polynomial kernel with degree d
■ The kernel function needs to satisfy the Mercer function,
i.e., the function is “positive-definite”
■ This implies that ■ Radial basis function kernel with width σ
■ the n by n kernel matrix,
■ in which the (i,j)-th entry is the K(xi, xj), is always positive
definite
■ Closely related to radial basis function neural networks
■ This also means that optimization problem can be solved ■ The feature space is infinite-dimensional
in polynomial time!
■ Sigmoid with parameter κ and θ
■ It does not satisfy the Mercer condition on all κ and θ
* 29 * 30
Non-linear SVMs: Feature spaces Example

■ General idea: the original input space can always be mapped to ■ Suppose we have 5 one-dimensional data points
some higher-dimensional feature space where the training set is
■ x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4,
separable:
5 as class 2 ⇒ y1=1, y2=1, y3=-1, y4=-1, y5=1
■ We use the polynomial kernel of degree 2
■ K(x,y) = (xy+1)2
Φ: x → φ(x) ■ C is set to 100
■ We first find αi (i=1, …, 5) by
* 31 * 32
Example Example
■ By using a QP solver, we get
■ α1=0, α2=2.5, α3=0, α4=7.333, α5=4.833 Value of discriminant function
■ Note that the constraints are indeed satisfied
■ The support vectors are {x2=2, x4=5, x5=6}
■ The discriminant function is
class 1 class 2 class 1
■ b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1,

as x2 and x5 lie on the line and x4 1 2 4 5 6
lies on the line
■ All three give b=9
* 33 * 34
Degree of Polynomial Features Choosing the Kernel Function

■ Probably the most tricky part of using SVM.
X^1 X^2 X^3
X^4 X^5 X^6
* 35 * 36
Software Summary: Steps for Classification
■ A list of SVM implementation can be found at ■ Prepare the pattern matrix
http://www.kernel-machines.org/software.html ■ Select the kernel function to use
■ Some implementation (such as LIBSVM) can handle ■ Select the parameter of the kernel function and the
multi-class classification value of C
■ SVMLight is among one of the earliest implementation of ■ You can use the values suggested by the SVM software, or
SVM you can set apart a validation set to determine the values
■ Several Matlab toolboxes for SVM are also available of the parameter
■ Execute the training algorithm and obtain the αi
■ Unseen data can be classified using the α and the
i
support vectors
* 37 * 38
Conclusion Resources
■ SVM is a useful alternative to neural networks ■ http://www.kernel-machines.org/
■ Two key concepts of SVM: maximize the margin and the ■ http://www.support-vector.net/
kernel trick ■ http://www.support-vector.net/icml-tutorial.pdf

■ Many SVM implementations are available on the web for ■ http://www.kernel-machines.org/papers/tutorial-nips.ps.
you to try on your data set! gz
■ http://www.clopinet.com/isabelle/Projects/SVM/applist.h
tml
* 39 * 40
Appendix: Distance from a point to a line Distance and margin
■ Equation for the line: let u be a variable, then any point

on the line can be described as:
■ P = P1 + u (P2 - P1) ■ x = x1 + u (x2 - x1)
■ Let the intersect point be u, P2 y = y1 + u (y2 - y1)
■ Then, u can be determined by:
■ The two vectors (P2-P1) is orthogonal to P3-u: P ■ The distance therefore between the point P3 and the
■ That is, line is the distance between P=(x,y) above and P3
■ (P3-P) dot (P2-P1) =0 ■ Thus,
■ P=P1+u(P2-P1) ■ d= |(P3-P)|=
P1=(x1,y1),P2=(x2,y2),P3=(x3,y3) P3
■
P1
* 41 * 42
Linear Regression: Part 2 Introduction to Linear Regression (cont.)
• How well a set of data points fits a straight line

can be measured by calculating the distance
between the data points and the line.
• The total error between the data points and the
line is obtained by squaring each distance and
then summing the squared values.
• The regression equation is designed to produce
Acknowledgments: Information in this presentation has been obtained from a
wide variety of publicly available Internet sources. Slides created by Mr. Saransh the minimum sum of squared errors.
Gupta for academic use only, as part of course material for Introduction to
Machine Learning course.
1 2
Regression About Maximum Likelihood Estimation

So far, we’ve been interested in learning P(Y|X) where Y • Maximum Likelihood Estimation (MLE) is an
has discrete values (called ‘classification’) approach used for estimating the parameters of
a model when you have observed data as input.
What if Y is continuous? (called ‘regression’)
• Input: Observed data
• predict weight from gender, height, age, …
• predict Google stock price today from Google, Yahoo,
• Output: Estimated parameters of the model
MSFT prices yesterday
• predict each pixel intensity in robot’s current camera
image, from previous image and previous action
3 4
The big picture of MLE What is a model?
• Take a look at the data in detail • A model is a way to represent your beliefs, assumptions
etc. about how some event or process is. It is a formal
• Try to figure out / determine a model about how
way to represent how you view that event or process.
the data could have been created
• Are models a perfect representation of the real-world?
• Assign values to the parameters of the model Generally not
such that the likelihood of the parameters is • Models are usually approximate because real-world
maximized w.r.t. the data scenarios are hard to model perfectly
• You can have simple models to represent an event or a
process OR you can have models of much higher
Now let’s delve a bit deeper about what constitutes complexity to represent the exact same event or process
a model
5 6
An example Delving deeper into the example

• Recall the example of the model, where hours studied is
• Why can’t models predict accurately 100% of the time?
the predictor variable and marks obtained is the outcome
variable • Could be that other additional predictor variables should
have been used to improve the model’s prediction
• Is the model always going to be accurate? Accurate here
accuracy
means:
– Will the model always be able to correctly predict the student’s • Possible that you don’t have data regarding those other
marks obtained when given the student’s number of hours studied predictor variables or you are simply unaware of those
as input? predictor variables
• Now suppose you add the student’s years of experience in • Your training data is not infinite in practice
the topics associated with the course to make your model – There is only so much prediction accuracy that your training data
more complex in the hope of getting better prediction can support
accuracy – Concept of unavoidable errors due to the training data
– Even then, the model will not be able to always predict perfectly
7 8
Delving deeper into the example Delving deeper into the example
• In our example about the student’s marks obtained (given • Was the student at peak performance form while giving the
the number of hours studied), let us try to understand some exam?
of the beliefs and assumptions in our model: – the student could have been studying till very late at night for an exam
starting at 8.30 am on the next day
• Is number of hours studied easy to practically quantify? A
student could be studying with full focus for 2 hours, while • Did the student arrive 30 minutes late for a 2-hour exam?
another student could be just staring at the computer • And so many other factors that you can think of….
screen for the same 2 hours.
– How to quantify and model focus? Bottomline: There are many simplifying assumptions that we
• Was the student mentally or physically healthy during the make while modeling because trying to capture all
time of the exam? How to quantify the extent of sickness? aspects of most real-world scenarios is way too complex
Can the extent of sickness always be quantified? in terms of modeling
9 10
Example of a dice being rolled Example of a dice being rolled

• Think about how you would model the roll of a dice • Instead of going for very complex models, you could adopt a
• You have to first model the dice itself much more simplified model
– The dice has 6 faces, color of the dice, weight of the • The simple model could entail that the outcome of the die
dice, thickness of the die, length, breadth, width etc. being rolled is drawn from some probability distribution
• When the dice is rolled, there will be some forces on the • Note that it is not necessarily true that a more complex model
dice based on the laws of physics will be more accurate than a simple model
– You would also have to model those forces • Observe that fitting the simpler model is obviously going to be
– Remember that the outcome of the rolled dice (which of far easier than fitting the more complex model
the numbers 1-6 come up) depends upon how you roll
the dice in what direction with how much force etc. Bottomline: What matters is the usefulness of the model.
• All of this keeps increasing the complexity of the model Not so critical how simple or how complex the model is,
but what is most important thing is how useful the model
11 is 12
A Linear Model Contd: A Linear Model
• Consider the corresponding • Notice that the data points do not
plot. all fit into a single line i.e., there
are deviations. These are errors
• Our goal is to find the best-fit that arise out of a distribution
line to the datapoints, i.e to (suppose it is Gaussian)
predict future values of y
Source: towardsdatascience
given x. • This indicates that the error arises
• Start with the equation of a from a Gaussian distribution with
line. mean 0 and variance \sigma^{2}.
Source: towardsdatascience So, our resulting model is given by
the following equation:
13 14
Contd: A Linear Model Contd: A Linear Model

• In general, we might have any number of • Maximising the likelihood is equivalent to
parameters, so let’s refer to the entire collection maximising the log-likelihood, given below:
of them as θ (theta)
• Knowing what the values of θ must be involved
computing the likelihood function of θ, given the
data observed. For a continuous distribution • The Maximum Likelihood Estimator takes the
with the probability density function f(x|θ), we Gaussian Distribution, and slides it over all
have the likelihood function Source: towardsdatascience
possible means. The mean which causes the
model to fit the data the best is chosen.
Given that the data is independent and
As shown in the picture here, the log-likelihood
•
Source: towardsdatascience •
identically distributed (iid), we have the

likelihood function can be re-written as is maximised at the mean of 2.
• This is the true mean of the distribution!
• The solutions provided by the MLS provides the
• In other words, likelihood is the product of Source: towardsdatascience
probability density for each data point. We need best fit model to our data.
15 16
to maximize this likelihood.
Choose parameterized form for P(Y|X; θ)
Regression
Wish to learn f:X🡪Y, where Y is real, given {<x1,y1>…<xn,yn>}
Approach:
1. Choose some parameterized form for P(Y|X; θ) Assume Y is some deterministic f(X), plus random noise
( θ is the vector of parameters)
2. derive learning algorithm as MLE or MAP estimate for θ Therefore Y is a random variable that follows the distribution
17 and the expected value of y for any given x is f(x) 18
Consider Linear Regression Training Linear Regression
How can we learn W from the training data?

E.g., assume f(x) is linear function of x
Learn Maximum Conditional Likelihood Estimate!
Example:
Consider developing a model to forecast a company's stock price. You
noticed that the stock price rose significantly throughout the previous
night. There could be a variety of causes for it. Maximum Likelihood
Notation: to make our parameters explicit, let’s write
Estimation seeks to determine the probability of the most likely cause.
This idea is applied, among other things, to satellite imaging, MRIs, and
economics.
MLE can be defined as a method for estimating population parameters (such as the
mean and variance for Normal, rate (lambda) for Poisson, etc.) from sample data such
19
that the probability (likelihood) of obtaining the observed data is maximized.
Training Linear Regression Training Linear Regression
Learn Maximum Conditional Likelihood Estimate

How can we learn W from the training data?
Learn Maximum Conditional Likelihood Estimate!

where
so:
where
21 22
MAP (Maximum A Posteriori) instead of MLE?

Training Linear Regression
Like MLE, MAP estimation is a method of estimating parameters of
statistical models.
•MLE is powerful when you have enough data. However, it doesnʼt work well
Learn Maximum Conditional Likelihood Estimate when observed data size is small. For example, if Liverpool only had 2
matches and they won the 2 matches, then the estimated value of Ɵ by MLE
is 2/2 = 1. It means that the estimation says Liverpool wins 100%, which is
unrealistic estimation. MAP can help dealing with this issue.
Can we derive gradient descent rule for training? •Assume that we have a prior knowledge that Liverpoolʼs winning percentage
for the past few seasons were around 50%.
Then, without the data from this season, we already have somewhat idea of
potential value of Ɵ. Based (only) on the prior knowledge, the value of Ɵ is
most likely to be 0.5, and less likely to be 0 or 1.
23 24
• On the other words, the probability of Ɵ = 0.5 is higher than
Ɵ = 0 or 1. Calling this as the prior probability P(Ɵ), and if we Regression – key points
visualise this Under general assumption
• Then, having the observed data

1. MLE corresponds to minimizing sum of squared prediction errors (SSE)
D (30 win out of 38 matches) from this
season, we can update this P(Ɵ) which 2. MAP estimate minimizes SSE plus sum of squared weights
is based only on the prior knowledge.
The updated probability of Ɵ given D is 3. Again, learning is an optimization problem once we choose
our objective function
expressed as P(Ɵ |D) and called the posterior probability.
• maximize data likelihood
• maximize posterior probability of W
4. Again, we can use gradient descent as a general learning algorithm

• as long as our objective function is differentiable with respect to W
• though we might learn local optima ins
5. Almost nothing we said here required that f(x) be linear in x

26
25
Reference: TowardsDataScience
Partial Correlation Partial Correlation
• A partial correlation measures the relationship between • For example, there probably is no underlying relationship
two variables (X and Y) while eliminating the influence of between weight and mathematics skill for elementary
a third variable (Z). school children.
• Partial correlations are used to reveal the real, • However, both of these variables are positively related to
underlying relationship between two variables when age: Older children weigh more and, because they have
researchers suspect that the apparent relation may be spent more years in school, have higher mathematics
distorted by a third variable. skills.
27 28
Partial Correlation Error Minimization
• “Minimization”, “Maximization”- both forms of “Optimization”
• As a result, weight and mathematics skill will show a
Given a function f(x) [In our case, the mean-squared error
positive correlation for a sample of children that includes
•
several different ages. function while fitting a regression line with points], how do we
minimize or maximize it?
• A partial correlation between weight and mathematics – Least Squares Optimization
skill, holding age constant, would eliminate the influence
Lagrange Multipliers (for continuous and
of age and show the true correlation which is near zero.
–
partially-differentiable functions)
– Convex optimization techniques
– Greedy algorithms like Gradient Descent
29 30
How does Gradient descent compute the

Gradient descent regression line
• So far, you have understood that linear • Start with a random line and then keep changing
regression is used for finding the line of best fit the slope and intercept iteratively until you get to
• Then you can use this regression line to predict the line of best fit
a value of y given some value of x. • Observe that for this, you need to know when
• Now, for finding the line of best fit, you can use you have already reached the line of best fit
gradient descent. • This is based on the sum of squares of the
• Now let’s see how you can use gradient descent errors
for computing the line of best fit
31 32
Bottomline: Gradient descent algorithm essentially
arrives at the least squared regression line by
doing multiple iterations for minimizing the sum of
squared errors
33
2
CONTENTS
• Revisiting Biology
NEURAL NETWORKS • Intelligence: Biological vs Artificial
• History of AI
• Why Deep Learning
• Neural roots of Deep Learning
• Artificial Neural Network
• Revisiting and understanding Neurons
• Neurons in Artificial Neural Networks
• Activation Functions in Artificial Neural Networks
• An illustrative example
wide variety of publicly available Internet sources. Slides created by Mr. Saransh
1
REVISITING BIOLOGY REVISITING BIOLOGY

The human brain is the most complex organ in the human
body. It helps us think, understand, and make decisions. Biological Neuron
The secret behind its power is a neuron.
• A human brain has billions of neurons.
• Neurons are interconnected nerve cells in the human brain that
are involved in processing and transmitting chemical and
electrical signals.
• Dendrites are branches that receive information from other
neurons.
• Cell nucleus or Soma processes the information received
from dendrites.
• Axon is a cable that is used by neurons to send information.
3
• Synapse is the connection between an axon and other neuron
dendrites. 4
INTELLIGENCE: BIOLOGICAL VS ARTIFICIAL HISTORY OF AI
Definition #1: Artificial intelligence (AI) is intelligence - Maturation of Artificial Intelligence (1943-1952)
perceiving, synthesizing, and inferring information - demonstrated • 1943: The first work which is now recognized as AI was done
by machines, as opposed to intelligence displayed by Warren McCulloch and Walter Pitts in 1943. They published
by animals and humans. "A Logical Calculus of the Ideas Immanent in Nervous Activity"
Example tasks are speech recognition, computer vision, planning, (1943), laying foundations for artificial neural networks and
natural language processing, robotics, etc. proposed a model of artificial neurons.
Definition #2: the theory and development of computer systems
able to perform tasks that normally require human intelligence, • 1949: Donald Hebb demonstrated an updating rule for
such as visual perception, speech recognition, decision-making, modifying the connection strength between neurons. His rule is
and translation between languages. now called Hebbian learning.
AI applications include advanced web search engine (Google, Bing),
recommendation systems (YouTube, Amazon, Netflix), understanding • 1950: Alan Turing, who was an English mathematician and
pioneered Machine learning in 1950, published ”Computing
human speech (Siri, Alexa), self-driving cars (Tesla), automated Machinery and Intelligence” in which he proposed a test as a
decision-making and competing at the highest level in strategic measure of machine intelligence. This test (now called Turing
5 6
game systems (chess, Go) and many more! Test) can check the machine's ability to exhibit intelligent
behavior equivalent to human intelligence.
HISTORY OF AI HISTORY OF AI
The birth of Artificial Intelligence (1952-1956) Golden years: Early enthusiasm & optimism (1956-1972)
•• 1955: Allen Newell, J.C. Shaw and Herbert A. Simon created • 1956-1960: High-level computer languages such as FORTRAN, LISP, or COBOL
the “first AI program“ which was named as ”Logic Theorist”. It were invented in the same decade soon after this and the excitement and optimism
proved 38 of the first 52 theorems in Bertrand for AI was very high at that time.
Russell and Alfred Whitehead's Principia Mathematica, and • 1965: The researchers emphasized developing algorithms which can solve
found new and more elegant proofs for some.
mathematical problems. Joseph Weizenbaum created the first natural language
processing computer program in 1966, which was named as ELIZA at the MIT AI
•• Year 1956: The Dartmouth College summer AI conference was lab.
organized by John McCarthy, Marvin Minsky, Nathan
Rochester of IBM and Claude Shannon. McCarthy coined the • 1972: The first intelligent anthropomorphic robot was built in Japan which was
term artificial intelligence for the conference. named as WABOT-1. It consisted of a limb-control system, a vision system and a
conversation system. The WABOT-1 was able to communicate-with a person in
•• Year 1959: The ”General Problem Solver (GPS)” was created Japanese and to measure distances and directions to the objects using external
by Newell, Shaw and Simon while at CMU. John receptors, artificial ears and eyes, and an artificial mouth.
McCarthy and Marvin Minsky founded the MIT AI Lab. 7
8
The first AI winter (1972-1980) AI re-emerges (1980-1987)
• 1969: Marvin Minsky and Seymour Papert published ’Perceptrons’, • 1980: After AI winter, AI re-emerged publicly with the development and
demonstrating previously unrecognized limits of the feed-forward marketing of Lisp machines and the offering of "Expert System“ shells and
two-layered structure. commercial applications. Expert systems were programmed to emulate the
decision-making ability of a human expert. The first national conference of the
• 1970: Seppo Linnainmaa published the reverse mode of automatic American Association of Artificial Intelligence was held at Stanford University in
differentiation which became later known as backpropagation, and is heavily the same year.
used to train artificial neural networks.
• mid-1980s: Neural Networks became widely used with the Backpropagation
• late-1970s: The duration between years 1974 to 1980 was the first AI Algorithm which was published by Seppo Linnainmaa in 1970 and were applied
winter duration. AI winter refers to the time period where computer scientists to neural networks by Paul Werbos.
dealt with a severe shortage of both confidence and funding from
Government for AI research. During AI winters, an interest in the publicity on • 1983: DARPA again began to fund AI research through the Strategic Computing
artificial intelligence reduced significantly. Initiative.
9 • 1987: Marvin Minsky published The Society of Mind, a theoretical description of

10
the mind as a collection of cooperating agents.
AI rises re-emerges again with intelligent agents (1993-2009)
The second AI winter (1987-1993)
• By early 1990s: The earliest successful expert systems, such as XCON, proved •• early 1990s: TD-Gammon, a backgammon program written by
too expensive to maintain and the few remaining expert system shell Gerry Tesauro, demonstrated that reinforcement (learning) is
companies were forced to downsize and search for new markets. powerful enough to create a championship-level game-playing
program by competing favorably with world-class players.
• 1987: Expert systems were dismissed as "clever programming" and DARPA
changed its strategy to focus its funding only on those technologies which
showed the most promise believing strongly that AI was not "the next wave". •• 1997: IBM Deep Blue defeated the world chess champion,
Gary Kasparov, and became the first computer to beat a world
• late 1980s: Investors and Government stopped the funding for AI research chess champion.
once again owing to the high cost and not as many results which were
promised optimistically in the earlier years. Expectations had run much higher
than what was actually possible.
•• late 1990s: Web crawlers and other AI-based information
extraction programs became essential in widespread use of
11
the World Wide Web. 12
AI rises re-emerges again with intelligent agents (1993-2009)
Deep learning, big data and artificial general
intelligence (2011- present)
•• 2002: AI entered the home in the form of Roomba, a vacuum

•• 2011: IBM's Watson won jeopardy, a quiz show, where it had to
cleaner which could autonomously vacuum the floor while
solve complex questions as well as riddles. Watson proved that
navigating and avoiding obstacles.
it could understand natural language and can solve tricky
questions quickly.
•• 2005: Recommendation technology based on tracking web
activity or media usage brought AI to marketing.
•• 2011-2014: Apple's Siri (2011), Google's Google Now (2012)
and Microsoft's Cortana (2014) are smartphone apps that
•• 2006: AI found a strong footing in the Business world when use natural language to answer questions, make
companies like Facebook, Twitter, and Netflix started recommendations and perform actions.
significantly leveraging AI.
13 •• 2014: In the year 2014, Chatbot "Eugene Goostman" won a 14
•• 2009: Google built autonomous car. competition in the infamous "Turing test."
Deep learning, big data and artificial general Deep learning, big data and artificial general
intelligence (2011- present) intelligence (2011- present)
•• 2015: Google DeepMind's AlphaGo defeated three-time • early-2020: Microsoft introduced its Turing Natural Language
European Go champion and professional Fan Hui by 5 games Generation (T-NLG), which was then the "largest language
to 0. model ever published at 17 billion parameters."
•• 2018: Alibaba language processing AI outscores top humans at • mid-2020: OpenAI's GPT-3, a state-of-the-art autoregressive
a Stanford University reading and comprehension test, scoring language model that uses deep learning to produce a variety
82.44 against 82.304 on a set of 100,000 questions. of computer codes, poetry and other language tasks
exceptionally similar, and almost indistinguishable from those
•• 2018: Google announced "Duplex" which is a service to allow written by humans. Its capacity was ten times greater than that
an AI assistant to book appointments over the phone on user’s of the T-NLG.
15 16
behalf in a manner indiscernible to that of humans.
WHY DEEP LEARNING? ARTIFICIAL NEURAL NETWORK
Neural Roots of Deep Learning
•• Deep learning was conceptualized by Geoffrey Hinton in the

1980s who is widely considered to be the founding father of
the field of deep learning.
•• More specifically, Hinton created the concept of a "neural

network", which is a deep learning algorithm structured similar
to the organization of neurons in the brain. Hinton took this
approach because the human brain is arguably the most
powerful computational engine known today.
17
•• The structure that Hinton created was called an artificial neural 18
network (or artificial neural net for short).
ARTIFICIAL NEURAL NETWORK ARTIFICIAL NEURAL NETWORK

The question that Geoffrey Hinton asked during his Here’s a brief description of how
seminal research in neural networks was whether we they function:
could build computer algorithms that behave similarly to • Artificial neural networks are
neurons in the brain. composed of layers of node.
• The hope was that by mimicking the brain’s structure, we might • Each node is designed to behave
capture some of its capability. similarly to a neuron in the brain.
• The first layer of a neural net is
• In order to do this, researchers studied the way that neurons called the input layer, followed
behaved in the brain. by hidden layers, then finally the
output layer.
• One important observation was that a neuron by itself is useless. • Each node in the neural net
performs some sort of calculation,
• Instead, you require networks of neurons to generate any which is passed on to other nodes
meaningful functionality. 19 deeper in the neural net 20
REVISITING AND UNDERSTANDING NEURONS
ARTIFICIAL NEURAL NETWORK • Neurons are a critical component of any deep learning model, biological or
artificial!
Neural nets have pushed the field of deep learning very
far very fast. However, it took decades for Machine • Neurons have quite an interesting structure in that groups of neurons work
Learning (and Deep Learning) to gain prominence. together inside the human brain to perform the functionality that we require
in our day-to-day lives.
• This was because any mature deep learning model requires an
abundance of two resources: • Neurons function by receiving and sending signals. More specifically, the
•• Data neuron’s dendrites receive signals and pass along those signals through the
axon. The dendrites of one neuron are connected to the axon of another
•• Computing power neuron. These connections are called synapses, which is a concept that has
been generalized to the field of deep learning.
• At the time of deep learning’s conceptual birth, researchers did
not have access to enough of either data or computing power
to build and train meaningful deep learning models.
• This has changed over time, which has led to deep learning’s
prominence today. 21 22
NEURONS IN ARTIFICIAL NEURAL NETWORKS NEURONS IN ARTIFICIAL NEURAL NETWORKS

• Neurons in deep learning models are nodes through which data and • Neurons in a deep learning model are capable of having synapses that
computations flow. connect to more than one neuron in the preceding layer.
• The working of computational neurons is as follows:
• They receive one or more input signals. These input signals can come • Each synapse has an associated weight, which impacts the preceding
from either the raw data set or from neurons positioned at a previous neuron’s importance in the overall neural network.
layer of the neural net.
• They perform some calculations. • Weights are a very important topic in the field of deep learning because
• They send some output signals to neurons deeper in the neural net adjusting a model’s weights is the primary way through which deep learning
through a synapse. models are trained.
23 24
NEURONS IN ARTIFICIAL NEURAL NETWORKS ACTIVATION FUNCTIONS IN ARTIFICIAL NEURAL NETWORKS
• Once a neuron receives its inputs from the neurons in the preceding layer of • Activation functions allow neurons in a neural network to
the model, it adds up each signal multiplied by its corresponding weight and communicate with each other through their synapses.
passes them on to an activation function.
• The activation function calculates the output value for the neuron. This output • We have established that neurons receive input signals from the
value is then passed on to the next layer of the neural network through preceding layer of a neural network. A weighted sum of these
another synapse. signals is fed into the neuron's activation function. Then the
activation function's output is passed onto the next layer of the
network.
• There are four main types of activation functions:

1. Threshold functions
2. Sigmoid functions
3. Rectifier functions, or ReLUs
4. Hyperbolic Tangent functions
25 26
ACTIVATION FUNCTIONS IN ARTIFICIAL NEURAL NETWORKS ACTIVATION FUNCTIONS IN ARTIFICIAL NEURAL NETWORKS
1. Threshold Functions, or unit step functions 2. Sigmoid Functions

• Threshold functions compute a different • The sigmoid function is also used Mathematical definition of sigmoid
output signal depending on whether or in logistic regression, one of the core function:
not its input lies above or below a machine learning techniques used to
certain threshold. solve classification problems.
• The input value to an activation function • The sigmoid function can accept any
is the weighted sum of the input values value, but always computes a value
from the preceding layer in the neural between 0 and 1.
network.
• One benefit of the sigmoid function
• Threshold functions are similar to over the threshold function is that its
Boolean variables in computer curve is smooth. This means that it is
programming. Their computed value is possible to calculate derivatives at
either 1 (similar to True) or 0 (equivalent any point along the curve.
to False). 27 28
ACTIVATION FUNCTIONS IN ARTIFICIAL NEURAL NETWORKS
ACTIVATION FUNCTIONS IN ARTIFICIAL NEURAL NETWORKS
3. Rectifier Functions, or Rectified Linear Units (ReLUs)

Hyperbolic Tangent Functions
• The rectifier function does not have
Mathematical definition of ReLU: • Out of the 4 discussed Mathematical definition of Hyperbolic
the same smoothness property as the
Tangent Functions:
sigmoid function from the last section. activation functions, the
However, it is still very popular in the hyperbolic tangent function is
field of deep learning.
the only one based on a
• The rectifier function is defined as trigonometric identity.
follows:
• If the input value is less than 0,
• The hyperbolic tangent function
then the function outputs 0. is similar in appearance to the
• If not, the function outputs its sigmoid function, but its output
input value values are all shifted
29
downwards. 30
REFERENCES
• https://machinelearningmastery.com/what-is-deep-learning/
• https://www.javatpoint.com/history-of-artificial-intelligence
• https://en.wikipedia.org/wiki/Timeline_of_artificial_intelligence
• https://en.wikipedia.org/wiki/AI_winter
• https://www.humanoid.waseda.ac.jp/booklet/kato_2.html
• https://www.freecodecamp.org/news/deep-learning-neural-networks-explained-in-plain-englis
h/
31
Recap:
Simple regression considers the relation between a single

explanatory variable and response variable
Multinomial and Logistic Regression

Multiple regression simultaneously considers the

influence of multiple explanatory variables on a response Regression Modeling
variable Y
• A simple regression
model (one independent
The intent is to look at the variable) fits a regression
independent effect of each line in 2-dimensional
variable while “adjusting space
out” the influence of
potential confounders • A multiple regression
model with two
explanatory variables fits
a regression plane in
3-dimensional space
3 4
Simple Regression Model Multiple Regression Model
Again, estimates for the multiple slope

Regression coefficients are estimated by coefficients are derived by minimizing ∑residuals2
minimizing ∑residuals2 (i.e., sum of the squared to derive this multiple regression model:
residuals) to derive this model:
Again, the standard error of the regression

The standard error of the regression (sY|x) is
based on the squared residuals: is based on the ∑residuals2:
5 6
Multiple Regression Model Multiple Regression Model

• Intercept α predicts
where the regression A multiple regression model with k independent
plane crosses the Y axis variables fits a regression “surface” in k + 1 dimensional
space (cannot be visualized)
• Slope for variable X1 (β1)
predicts the change in Y
per unit X1 holding X2
constant
• The slope for variable X2

(β2) predicts the change
in Y per unit X2 holding X1
constant
7 8
Categorical Explanatory Variables in
Regression Models
Introduction to Data
Introduction to Logistic
• Categorical independent variables can be incorporated Analysis
Regression
into a regression model by converting them into 0/1
(“dummy”) variables
• For binary variables, code dummies “0” for “no” and 1 for
“yes”
9
Slides are adapted from SPIA, University of Georgia
• Categorical dependent variables in more complicated models. But, first an experiment

Logistic regression (for binary categorical dependent variables). • I’m going to show you a short video of some students playing
basketball.
• Why can’t we just use OLS? • There are 6 people; 3 dressed in black shirts and 3 in white
shirts.
• How does logistic regression work?
• I’d like you to count the number of times that the white shirted
students pass the ball to each other in two different ways.
• How do we compare logistic models? • An ‘aerial’ pass (without touching the ground on the way).
• A ‘bounce’ pass (touching the ground on the way).
• Thus after the video has ended you should have two totals, one for
aerial passes by white shirts and one for bounce passes by white
shirts.
“Gorillas in our midst” (1)
http://viscog.beckman.uiuc.edu/grafs/demos/1
5.html
http://viscog.beckman.uiuc.edu/grafs/demos/15.
html
“Gorillas in our midst” (2) Predicting gorilla sightings (1)
• This is a real bit of psychology research by Simons and • Our dependent variable is just like the variables we were using
Chabris (1999) at Harvard. earlier.
• They find that the harder the task, the more likely it is
that people don’t spot the gorilla.
• But let’s say with this example we want to predict whether the
• Only 50% of his subjects spotted the gorilla…
gorilla will be spotted by a person with a particular set of
characteristics.
• How is this relevant to us?
• In this case, let’s say with a particular concentration span
• Imagine we wanted to predict whether someone saw the
(measured on a 1-100 scale).
gorilla or not, this is a binary dependent variable.
• We might have independent variables like concentration
span, difficulty of the task, time of day and so on. • Since our independent variable is interval level data we can’t use
cross-tabs.
Predicting gorilla sightings (2) What’s wrong with SLR?
• So, what we want to know is the probability that any person • We want to predict a probability, this can only vary between
will be a gorilla spotter or not for any value of concentration zero and 1.
span.
• Remember if we know this, we will know the • But our SLR may predict values that are below zero or above
proportion of people that will spot the gorilla at each 1…
level of concentration of span on average.
• Let’s quickly fit a SLR to our example.
• We could use simple linear regression (SLR) here, with the • Our sample here is the 108 subjects that Simons and
dependent variable coded as 0 (no gorilla spot) or 1 (gorilla Chabris used. I’ve added some extra data on their
spotted). concentration spans.
• Well, why can’t we…? • A scatter-plot isn’t all that much use here.
Scatter-plot (1) Scatter-plot (2) People with CS below

More low concentration 21 have > 1 probability
people spot the gorilla. of being a spotter…
People with CS above

92 have < 0 probability
of being a spotter…
Could add a linear

More high concentration regression line
people DON’T spot the gorilla.
Other problems What to do (1)
• If you think about it, that’s just one problem. • Instead of linear OLS regression we use something called
• For linear regression we assumed that the population logistic regression.
distribution was normally distributed around the mean,
for each value of the X variable. • This is a very widely used method, and it’s important to
understand how it works.
• That’s not going to be the case if we’ve got a binary
response. The distribution around the mean is going to • Probably more widely used (especially if include
be quite different. variants) than linear OLS, as interesting dependent
variables are often categorical.
• Looking at our data, when CS=50 we’ll have about 60%
of cases scoring 1 (being spotters) and 40% of cases • A randomly selected academic (by the name of Tilley)
scoring 0 (not being spotters). That doesn’t sound much has used logistic regression in 55.5% of all his
like a normal distribution… sociology and politics articles.
What to do? (2) What to do? (3)

Here’s a more realistic
representation of the
relationship between the
• Somehow we need to dump the linear OLS bit of our model probability of gorilla
spotting and CS
for this binary categorical variable.
• So what we want to do is assume a different kind of

relationship between the probability of seeing gorillas (or
whatever) and concentration span.
• Maybe something like this…

The logistic transformation (1) The logistic transformation (2)
• This type of relationship is described by a special formula.

• Remember, if the relationship was linear then the
equation is just:
• But the relationship on the graph is actually described This is just the odds.
by: As the probability increases (from zero So if β is ‘large’ then as X increases
to 1), the odds increase from 0 to the log of the odds will increase
infinity. steeply.
The log of the odds then increases from The steepness of the curve will
–infinity to +infinity. therefore increase as β gets bigger.
Fitting this model (1) Fitting this model (2)

• So that’s what we want to do, but how do we do it? • Maximum likelihood is an iterative process that estimates the
• With SLR we tried to minimize the squares of the best fitted equation.
residuals, to get the best fitting line.
• The iterative bit just means that we try lots of models
• This doesn’t really make sense here (remember the until we get to a situation where tweaking the equation
errors won’t be normally distributed as there’s only two any further doesn’t improve the fit.
values).
• The maximum likelihood bit is kind of complicated,
• We use something called maximum likelihood to estimate what although the underlying assumptions are simple to
the β and α are. understand, and very intuitive. The basic idea is that we
find the coefficient value that makes the observed data
most likely.
Back to the gorillas Gorilla results
• So pressing the appropriate buttons in STATA or SPSS, allows
us to fit a logistic regression to our gorilla spotting data.
Variable Coefficient Standard p-value
value error
• The numbers that we get out are not immediately interpretable
however. Concentration -0.07 0.01 0.00
Intercept 3.69 0.72 0.00

• Remember for OLS linear regression, a change of one
unit on the X variable meant that the Y variable would
increase by the coefficient for X.
• This is how logistic regression results are often reported in
• That’s not what the coefficient associated with X in our articles.
logistic regression means. • It’s clear that concentration span has a negative (and
statistically significant) effect on gorilla sightings.
• But what does the -0.07 actually mean?
Interpreting the coefficients (1) Interpreting the coefficients (2)

• What we need to do is think • Antilog both sides and we get
about the equation again, and the odds on the LH side.
what an increase in X means.
• If we enter a value of X we can
• So an increase in X of 1 unit work out what the predicted
will decrease our log (odds) by odds will be.
0.07.
• Thus the odds of spotting the

• If we antilog both sides then
we could see how the odds gorilla (as opposed to not
change… spotting the gorilla) are nearly
5. For every 5 spotters there
Remember the ‘hat’ should be one non-spotter.
sign means the
predicted value.
Interpreting the coefficients (3) Yet more coefficient interpretation (1)
• We can also think about what happens to the odds when we • The other way of thinking about
increase X by a certain amount. things is in terms of
probabilities.
• Another way of writing ea+bX is ea(eb)X. That means that a one
unit increase in X multiples the odds by eb (as it’s to the power • If we rearrange the ‘antilogged’
of 1). equation then we work out what
the probability (for a particular
value of X) would be.
• In our case therefore a one unit increase in X multiplies the
odds by e-0.07, or 0.93.
• When X increases from 30 to 31, the odds are • The probability of a person with
4.90*0.93, or 4.56. CS=30 of gorilla spotting is thus
83%.
• When X increases from 30 to 40, the odds are
4.90*(0.93)10, or 2.37.
Yet more coefficient interpretation (2) Adding extra variables (1)

• Including other interval level independent variables and categorical
independent variables is as easy as in multiple linear regression.
When CS=30, probability of
spotting the gorilla is 83% • The logic is the same as before, we are examining the effects
of one independent variable when the other is held constant.
Perhaps the most useful
thing to do is to plot the • The important bit is to understand what the coefficients from
predicted probabilities (it is extra independent variables actually mean.
easiest to do this in STATA).
• Since this is less clear cut than in multiple linear regression
we need to be careful in interpretation.
Adding extra variables (2) Interpreting extra variables (1)
• Let’s say we think that people that own monkeys are more • So owning a monkey (holding concentration span constant),
adept at spotting the gorilla. multiplies the odds by e3.15, or 23.3 times.
• We could include a dummy variable for monkey owner
(1 if you are a monkey owner, and 0 if not). • The odds of monkey owners spotting the gorilla are 23 times
the odds of non-monkey owners spotting the gorilla.
• The probability of a person with a CS of 50 that owns a monkey

Variable Coefficient Standard p-value being a gorilla spotter is 93%, and the probability of a person with
a CS of 50 that does not own a monkey being a gorilla spotter is
value error
only 40%.
Concentration -0.09 0.02 0.000
Monkey owner 3.15 0.96 0.001

• With such a simple model we can still display it graphically.
Intercept 4.01 0.83 0.000 • A linear model would have two parallel lines for each type of
person (monkey or none) by CS. Our lines are NOT
parallel.
Interpreting extra variables (2)

Monkeys and no monkeys • Generally, we want to present information from a logistic
Monkey owners
regression in the form of probabilities as these are easiest to
understand.
• If we have lots of variables, then we normally set them to

a particular value and then examine how the predicted
probability of the dependent outcome varies.
• e.g. if I had more independent variables (age, sex,

Non-monkey owners eyesight), I would produce the first graph before for men
of average age with average eyesight not owning a
monkey. Then I could see how concentration alone
affected the predicted probability of a gorilla sighting.
Interactive monkeys (1) Interactive monkeys (2)
• We can also include interaction effects. Again though
we need to be careful interpreting these. Monkey owners
Variable Coefficient Standard p-value

value error
Concentration -0.12 0.02 0.00
Monkey owner -1.92 2.00 0.34 Non-monkey owners
Monkey*concentration 0.08 0.04 0.02
Intercept 5.07 1.14 0.00
Comparing models (1) Comparing models (2)

• One of the most important differences between logistic • The best way of comparing models is to use something called
regression and linear regression is in how we compare models. the likelihood-ratio test.
• Remember for linear regression we looked at how the • When we were using OLS regression, we were trying to
adjusted R2 changed. If there was a significant increase minimize the sum of squares, for logistic regression we
are trying to maximize something called the likelihood
when we added another variable (or interaction) then we
function (normally called L).
thought the model had improved.
• To see whether our model has improved by adding a
• For logistic regression there are a variety of ways of looking variable (or interaction, or squared term), we can compare
model improvement. the maximum of the likelihood function for each model
(just like we compared the R2 before for OLS
regressions).
Comparing models (3)
• In fact, just to complicate matters we actually compare the

maximised values of -2*log L.
First model’s maximised value Second model’s maximised value

• By logging the Ls and multiplying them by -2, this statistic
conveniently ends up with a chi-square distribution. This means
we test whether there is a statistically significant improvement
with reference to the χ2 distribution.
Introduction
• ML models can be:
– discriminative
– generative
Discriminative vs Generative
Models
Acknowledgments: The information in the slides in this presentation have been obtained
from a wide variety of publicly available Internet sources such as:
https://www.baeldung.com/cs/ml-generative-vs-discriminative
Some of the slides have been modified.
1 2
Discriminative models Generative Models

• Core goal of discriminative models is to separate between • Generative models are essentially statistical
classes models
• Tries to learn boundaries between classes
• Can generate new data instances
• Cannot generate new data
• Attempt to model how to populate a dataset
• Uses conditional probability
– hence, discriminative models are sometimes referred to as conditional • Used in unsupervised ML
models) • Types of generative models include Bayesian
– also uses maximum likelihood
Network, Hidden Markov model, Generative
• Discriminative models are used mostly for supervised ML;
classification, regression etc.
adversarial network (GAN)
• Types of discriminative models: Logistic regression, Support
vector machine, Decision tree, Random forest
3 4
Discriminative vs Generative An example
models • Say you are given a document and you need to find out the
• In the data space, discriminative models aim at drawing class language in which the document is written
boundaries – Is it written in English, Spanish, Japanese etc.?
– Trains model by learning parameters for maximizing • Option 1: Generative model
conditional probability – You can learn each language one-by-one
• In the data space, generative models aim at – Then you will know which language the document is written in
modeling/predicting how the data is throughout the space – This is the generative model’s way of doing things
– Learns parameters by maximizing joint probability • Option 2: Discriminative model

– Learn distinguishing patterns of languages, but no need to learn the
whole language as such
– This means you know enough to discriminate/distinguish among the
languages, but don’t know any of the languages in depth
– This is the discriminative model’s way of doing things
5 Source: https://www.baeldung.com/cs/ml-generative-vs-discriminative 6
Some insights Some applications of generative

• Discriminative models try to learn decision boundaries to distinguish cases models
• Generative models try to learn the main properties of the data
• Data augmentation
• The generative approach focuses on modeling, whereas the discriminative
approach focuses on a solution. So, we can use generative algorithms to • Restoration of old images, automatic image coloring
generate new data points. Discriminative algorithms don’t serve that • Re-constructing missing regions in an image
purpose.
• Generative algorithms converge faster than discriminative algorithms
Source:
• Generative models converge to a higher asymptotic error than
discriminative models
https://www.baeldung.com/cs/applications-of-generative-mod
– More training samples means error rate decreases for discriminative models els#:~:text=A%20very%20important%20application%20of,the
• Generative models have more “explainablity” than discriminatory models 🡪 %20size%20of%20the%20dataset.
better for applications in anomaly detection, NLP etc.
• Generative models can deduce from a context because they are more
comprehensive
Source: https://www.baeldung.com/cs/ml-generative-vs-discriminative 8
Why is feature selection important?
• Real-world datasets can have lots of features
– Features can also be sometimes referred to as
attributes, dimensions etc.
• When trying to design an ML model for doing
Feature Selection predictions in real-world scenarios, some
features will be relevant; other features will be
irrelevant
• Think of it like filtering out the noise from the data
Acknowledgments: The slides in this presentation have been obtained from a wide variety of • Remember that the features that you consider to
publicly available Internet sources such as:
https://www.geeksforgeeks.org/feature-selection-techniques-in-machine-learning/. Some of be irrelevant to your analysis or ML model, others
the slides have been modified.
1
may consider those features to be relevant 2
Why is feature selection important? Goal of feature selection

• Different stakeholders will need to use different • To find the best set of features, which are most relevant
features for building ML models for their towards building your ML model
respective purposes • Some caveats need to be added here
• The “best” depends upon domain knowledge and the
• What happens when you use irrelevant features decision concerning which feature set constitutes the
in your dataset? “best” may differ among experts
More noise in the dataset w.r.t. the ML model that you
– • You need to have adequate data corresponding to each
are trying to build of the features
– Unnecessarily increases the complexity of the model – No point saying that a feature is relevant if you do not have
– Decreases the generalization capability of the model enough data concerning the feature
– Sometimes, the data may exist for a given feature, but you may
– Increases the bias in the model
not have access to it due to reasons such as cost, privacy etc.
Bottomline: Why deal with irrelevant features?
3 4
Widely used feature selection techniques
in ML
Filter methods
• Filter methods • Selects features from the dataset without using any ML
algorithm
• Wrapper methods • Typically used for pre-processing datasets
• Embedded methods • Very good for getting rid of duplicate and irrelevant features
• Usually fast and inexpensive computationally
• Each feature is evaluated and selected on an individual
basis, but does not account for dependencies among
features
• When multiple features together can result in better
performance of the ML model, filter methods are generally
not good to use
5 6
Techniques used in filter methods Techniques used in filter methods

• Information Gain • Fisher’s Score
– How much information is being given by the feature – Selects each feature based on the Fischer scores
towards building a “good” ML model – Features are selected on an independent basis
– Measures reduction in the entropy values. – Usually, suboptimal set of features.
– Compute information gain for each attribute w.r.t. the – Higher Fischer score means better selected feature
target values for feature selection • Correlation Coefficient
• Chi-square test – Pearson’s Correlation Coefficient
– Evaluates the relationship between categorical – How associated are two continuous variables
variables.
7 8
Techniques used in filter methods Techniques used in filter methods
• Dispersion Ratio
• Variance Threshold – Computes the ratio of the Arithmetic mean to Geometric mean for a
– Core idea: Higher variance features usually contain specific feature.
more information – Higher value of dispersion ratio means that the feature is more relevant
from a feature selection perspective
– Sets a threshold for variance and gets rid of features
• Mutual Dependence
that do not satisfy this variance threshold – Computes if two variables are mutually dependent,
• Mean Absolute Difference (MAD) – If a particular feature is present/absent, how much information does
– Similar to the variance threshold method that feature contribute to the prediction that you are trying to do in your
ML model
– Computes the mean absolute difference from the
• Relief
mean value.
– Measures the quality of attributes by means of random sampling of
instances from the dataset
9 10
Wrapper methods Techniques used in wrapper methods

• These are basically greedy algorithms
• Forward selection
• Train by using a subset of features
– Start with an empty set of features and keep adding
• Training happens iteratively
features that provides best improvement to your ML model
• Depending upon the results/outcome of the training, perform addition
– This is an iterative approach
or removal of features
• Stopping criteria could be based on threshold performance criteria or – Stop when the addition of a new feature fails to improve the
when you the model has a desired number of features performance of your ML model
• Wrapper methods generally select a more optimal set of features • Backward elimination
than filter methods; this means generally better accuracy in case of – Also, an iterative approach
wrapper methods – Start with all features and keep getting rid of the least
• The downside is that wrapper methods involve more computational significant feature after each iteration
cost than filter methods
– Stop when no improvement happens in the performance of
your model after removal of the feature.
11 12
Techniques used in wrapper methods Embedded methods
• Bi-directional elimination
– Uses forward selection and backward elimination in tandem
• The feature selection algorithm is embedded in the
• Exhaustive selection learning algorithm
– Basically, a brute force approach • Basically, has its own built-in feature selection methods
– Creates all possible subsets of features • Consider a combination of features
– Builds a learning algorithm for each subset • Tries to address the deficiencies in filter and wrapper
– Selects the subset that has the best performance of the ML model methods, while attempting to use their advantages.
• Recursive elimination
– Basically, a greedy optimization method
– Selects features by recursively considering smaller and smaller set of
features
– Attributes some importance value to each feature
– Removes least important features from the dataset until you get the
target number of features
13 14
Techiques used in embedded methods Feature selection is different from PCA
• Regularization • Observe that in feature selection, you are

• Adds a penalty to ML model’s parameters to avoid over-fitting. essentially deciding which features to select for
• Uses uses Lasso (L1 regularization) and Elastic nets (L1 and arriving at better prediction results of the ML
L2 regularization) model
• The penalty is applied over the coefficients
• Similar to feature selection, in PCA, you also
– This means that some coefficients can become zero
reduce the number of dimensions (features), but
– The features with zero coefficient can be removed
you do so by converting to lower dimensional
• Tree-based methods
– Examples: Random Forest, Gradient Boosting
space and NOT by getting rid of the features
– Uses feature importance towards selecting features
themselves
• This is a very important conceptual point
15 16
2
CONTENTS
• Computational Learning Theory
PAC LEARNING AND VC DIMENSION • PAC Learning
• VC Dimension
• VC Dimension: Learners and Complexity
• VC Dimension continued
• VC Dimension: Shattering
• Using VC Dimension
• Training Error vs Prediction/Test Error
• No-free-lunch Theorem
• Occam’s Razor (Principle of Parsimony)
1
COMPUTATIONAL LEARNING THEORY COMPUTATIONAL LEARNING THEORY

• It is a field of study concerned with the use of formal mathematical methods
applied to learning systems.
• It seeks to use the tools of theoretical computer science to quantify learning

problems. This includes characterizing the difficulty of learning specific tasks.
• The division between ‘Learning tasks’ and ‘Learning algorithms’ is arbitrary and
Computational learning theory, or statistical in practice, there is a lot of overlap between the two fields:
learning theory, refers to mathematical • Computational Learning Theory (CoLT): Formal study of learning tasks.
• Statistical Learning Theory (SLT): Formal study of learning algorithms.
frameworks for quantifying learning tasks
and algorithms. • CoLT and SLT are largely synonymous in modern usage.
Statistical learning theory can be extended to computational

learning theory by taking computational complexity of the learner
into account. 4
3
• Hypothesis Space: It is the set of all the possible legal hypothesis.
This is the set from which the ML algorithm would determine the
best possible (only one) hypothesis which would best describe the
target function or the outputs.
• Hypothesis: A hypothesis is a function that best describes the
target in supervised machine learning. The hypothesis that The main unanswered question in learning is this:
an algorithm would come up with depends upon the data How can we be sure that our learning algorithm
and the restrictions and bias that we have imposed on the
has produced a hypothesis that will predict the
data.
Every learning algorithm requires assumptions about the hypothesis space. correct value for previously unseen inputs?
• Eg: “My hypothesis space is
• …linear”
• …decision trees with 5 nodes”
• …a three layer neural network with rectifier hidden units”
6
5
The focus in computational learning theory is typically on Yet…

supervised learning tasks.
as a machine learning practitioner, it is important to to know
Formal analysis of real problems and real algorithms is very about computational learning theory and some of the main
challenging. As such, it is common to reduce the complexity of areas of investigation.
the analysis by focusing on binary classification tasks and even
simple binary rule-based systems.
The CoLT field provides an immensely useful grounding and a
mathematically rigorous framework for understanding what we
Consequently, the practical application of the theorems may be are trying to achieve when we are trying to fit models on data,
limited or challenging to interpret for real problems and as it may provide both valuable insight and the limitations of
algorithms. the methods.
7 8
Questions explored in computational learning theory might include:
Among many subfields of CoLT, two of the most widely discussed
areas of study are PAC Learning and VC Dimension.
• How do we know a model has a good approximation for the
target function? Crudely speaking,
• What hypothesis space should be used?
PAC (Probably Approximately Correct) Learning is the theory of
• How do we know if we have a local or globally good solution?
machine learning problems
• How do we avoid overfitting?
• How many data examples are needed? whereas
• … and so on. VC (Vapnik–Chervonenkis) Dimension is the theory of machine
learning algorithms.
The main unanswered question in learning is this: How can we be
sure that our learning algorithm has produced a hypothesis that
will predict the correct value for previously unseen inputs?
— Page 713, Artificial Intelligence: A Modern Approach, 3rd 9 10
edition, 2009.
PAC LEARNING PAC LEARNING

• Probably approximately correct learning, or PAC learning, refers
to a theoretical machine learning framework developed by Leslie Think of the target function as a hidden random hypothesis that
Valiant. maps all the input to its correct labels and could potentially exist.
It may have been decided by ‘Nature’ but we have not idea
about it. Our best bet is to come up with a model/hypothesis that
• PAC learning seeks to quantify the difficulty of a learning task and
can perhaps approximate this Nature-given target function.
might be considered one of the premier sub-fields of computational
learning theory. PAC learning is concerned with how much computational effort is
required to find a hypothesis (fit model) that is a close match for
• Consider that in supervised learning, we are trying to approximate the possibly underlying and unknown Nature-given target function.
an unknown underlying mapping function from inputs to outputs. This
function that maps the input to its correct labels is called the target PAC is a framework under which numerous results on learning
function. theory were proved, and has the following set of assumptions:
the training and testing sets follow the same distribution • the
• We don’t know what this mapping function (target function) looks training examples are drawn independently
like, but we suspect it exists in nature, and we have examples of
11 12
data possibly produced by the function.
Consider the following scenario:
PAC is a framework under which numerous results

on learning theory were proved, and has the
following set of assumptions:
• the training and testing sets follow the same
distribution
• the training examples are drawn independently
13 14
15 16

Continuing the scenario: Continuing the scenario and arriving on an important result:

What does the PAC Learning theory say (in simple words)?
The underlying principle is that any hypothesis that is seriously
• The idea is that a bad hypothesis will be found out based on wrong will almost certainly be “found out” with high probability
after a small number of examples, because it will make an
the predictions it makes on new data, i.e. based on its incorrect prediction. Thus, any hypothesis that is consistent with a
generalization error. sufficiently large set of training examples is unlikely to be seriously
wrong: that is, TELY it must be probably approximately correct.
• A hypothesis that gets most or a large number of predictions
correct, i.e. has a small generalization error, is probably a
— Page 714, Artificial Intelligence: A Modern Approach, 3rd
good approximation for the target function.
edition, 2009.
• This probabilistic language gives the theorem its name:
“probably approximately correct.” That is, a hypothesis seeks to
“approximate” a target function and is “probably” good if it
has a low generalization error. 17 18

What does the PAC Learning theory say (in simple words)?
The goal of the PAC framework is to understand how large a data
• A PAC learning algorithm refers to an algorithm that returns a set needs to be in order to give good generalization. It also gives
hypothesis that is Probably Approximately Correct (PAC). bounds for the computational cost of learning …
— Page 344, Pattern Recognition and Machine Learning, 2006.
• Using formal methods, a minimum generalization error can be
specified for a supervised learning task. The theorem can then
be used to estimate the expected number of samples from the
problem domain that would be required to determine whether
a hypothesis was PAC or not. That is, it provides a way to
estimate the number of samples required to find a PAC A hypothesis space is said to be efficiently PAC-learnable if there is
hypothesis. a polynomial time algorithm that can identify a function that is PAC.
— Page 210, Machine Learning: A Probabilistic Perspective,
• Additionally, a hypothesis space (machine learning algorithm) is 2012.
efficient under the PAC framework if an algorithm can find a
19 20
PAC hypothesis (fit model) in polynomial time.
21 22

ε-Representative Uniform Convergence

PAC LEARNING VC DIMENSION

Estimating the number of samples required to find a PAC hypothesis. • Vapnik–Chervonenkis theory, or VC theory for short, refers to a
theoretical machine learning framework developed by Vladimir
Vapnik and Alexey Chervonenkis.
• VC theory seeks to quantify the capability of a learning algorithm

and might be considered another one of the premier sub-fields of
statistical learning theory.
• VC theory is comprised of many elements, most notably the VC

dimension.
• The VC dimension quantifies the complexity of a hypothesis space,

e.g. the models that could be fit given a representation and
learning algorithm.
23 24
VC DIMENSION VC DIMENSION
Why do we need VC Dimension?
A key quantity in PAC learning is the
• One way to consider the complexity of a hypothesis space (space of
Vapnik-Chervonenkis dimension, or VC dimension,
models that could be fit) is based on the number of distinct hypotheses which provides a measure of the complexity of a
it contains and perhaps how the space might be navigated. The VC space of functions, and which allows the PAC
dimension is a clever approach that instead measures the number of
examples from the target problem that can be discriminated by
framework to be extended to spaces containing an
hypotheses in the space. infinite number of functions.
• The VC dimension estimates the capability or capacity of a
classification machine learning algorithm for a specific dataset — Page 344, Pattern Recognition and Machine
(number and dimensionality of examples).
Learning, 2006.
• The VC dimension is used as part of the PAC learning framework.
25 26
VC DIMENSION: LEARNERS AND COMPLEXITY VC DIMENSION: LEARNERS AND COMPLEXITY

• We’ve seen many versions of underfit/overfit trade-off • We’ve seen many versions of underfit/overfit trade-off
– Complexity of the learner – Complexity of the learner
– “Representational Power”  – “Representational Power” 
• Different learners have different power • Different learners have different power
Feature Values Parameters Feature Values Parameters

(measured)
(measured)
x1 Predicted Class x1 Predicted Class
3
x2 Classifier 3
x2 Classifier
…
…
2 2
xn xn
1 1
0 0
-1
Example: -1
Example:
-2 -2
-3
-2 -1 0 1 2 3
(c) Alexander Ihler -3
-2 -1 0 1 2 3
(c) Alexander Ihler
-3 -3
VC DIMENSION: LEARNERS AND COMPLEXITY VC DIMENSION: LEARNERS AND COMPLEXITY
• We’ve seen many versions of underfit/overfit trade-off • We’ve seen many versions of underfit/overfit trade-off
– Complexity of the learner – Complexity of the learner
– “Representational Power”  – “Representational Power” 
• Different learners have different power • Different learners have different power
Feature Values Parameters

• Usual trade-off:
(measured)
x1 Predicted Class – More power = represent more complex systems, might overfit
x2 Classifier – Less power = won’t overfit, but may not find “best” learner
…
xn
• How can we quantify representational power?
– Not easily…
Example: – One solution is VC (Vapnik-Chervonenkis) dimension
(c) Alexander Ihler (c) Alexander Ihler
VC DIMENSION VC DIMENSION CONTINUED

What does VC Dimension mean though? What does VC Dimension mean though?
• Formally, the VC dimension is the largest number of examples from • Any placement of three points on a 2d plane with class labels 0 or 1
the training dataset that the space of hypothesis from the algorithm can be “correctly” split by label with a line, e.g. shattered. But, there
can “shatter.” exists placements of four points on plane with binary class labels that
cannot be correctly split by label with a line, e.g. cannot be shattered.
• Shatter or a shattered set, in the case of a dataset, means points in Instead, another “algorithm” must be used, such as ovals.
the feature space can be selected or separated from each other using
hypotheses in the space such that the labels of examples in the
separate groups are correct (whatever they happen to be).
• Whether a group of points can be shattered by an algorithm

depends on the hypothesis space and the number of points.
• For example, a line (hypothesis space) can be used to shatter
three points, but not four points.
31 32
VC DIMENSION CONTINUED VC DIMENSION: SHATTERING
What does VC Dimension mean though? • We say a classifier f(x) can shatter points x(1)…x(h) iff For
• Therefore, the VC dimension of a machine learning algorithm is the all y(1)…y(h), f(x) can achieve zero error on training data
largest number of data points in a dataset that a specific (x(1),y(1)), (x(2),y(2)), … (x(h),y(h))
configuration of the algorithm (hyperparameters) or specific fit model
can shatter. (i.e., there exists some θ that gets zero error)
• A classifier that predicts the same value in all cases will have a VC
dimension of 0, no points. • Can f(x;θ) = sign(θ0 + θ1x1 + θ2x2) shatter these
points?
• A large VC dimension indicates that an algorithm is very flexible,
although the flexibility may come at the cost of additional risk of
overfitting.
33 34
VC DIMENSION: SHATTERING VC DIMENSION: SHATTERING

• We say a classifier f(x) can shatter points x(1)…x(h) iff For • We say a classifier f(x) can shatter points x(1)…x(h) iff
all y(1)…y(h), f(x) can achieve zero error on training data For all y(1)…y(h), f(x) can achieve zero error on
(x(1),y(1)), (x(2),y(2)), … (x(h),y(h)) training data (x(1),y(1)), (x(2),y(2)), … (x(h),y(h))
(i.e., there exists some θ that gets zero error) (i.e., there exists some θ that gets zero error)
• Can f(x;θ) = sign(x12 + x22 - θ) shatter these

• Can f(x;θ) = sign(θ0 + θ1x1 + θ2x2) shatter these points?
points?
• Yes: there are 4 possible training sets…
35 36
VC DIMENSION: SHATTERING VC DIMENSION
• We say a classifier f(x) can shatter points x(1)…x(h) iff • The VC dimension H is defined as the maximum number
For all y(1)…y(h), f(x) can achieve zero error on of points h that can be arranged so that f(x) can shatter
training data (x(1),y(1)), (x(2),y(2)), … (x(h),y(h)) them
(i.e., there exists some θ that gets zero error)

• Example: what’s the VC dimension of the (zero-centered)
2 2
• Can f(x;θ) = sign(x1 + x2 - θ) shatter these points? circle, f(x;θ) = sign(x12 + x22 - θ) ?
• Nope!
• VCdim = 1 : can arrange one point, cannot arrange two
(previous example was general)
37 38

• EXAMPLE:WHAT’S THE VC DIMENSION OF THE TWO-DIMENSIONAL LINE, • EXAMPLE:WHAT’S THE VC DIMENSION OF THE TWO-DIMENSIONAL
F(X;Θ) = SIGN(Θ1 X1 + Θ2 X2 + Θ0)?
• LINE, F(X;Θ) = SIGN(Θ1 X1 + Θ2 X2 + Θ0)?
• VC DIM >= 3? YES

• VC DIM >= 3? YES
• VC dim >= 4?
39 40
• EXAMPLE:WHAT’S THE VC DIMENSION OF THE TWO-DIMENSIONAL • EXAMPLE:WHAT’S THE VC DIMENSION OF THE TWO-DIMENSIONAL
LINE, F(X;Θ) = SIGN(Θ1 X1 + Θ2 X2 + Θ0)? • LINE, F(X;Θ) = SIGN(Θ1 X1 + Θ2 X2 + Θ0)?
• VC DIM >= 3? YES • VC DIM >= 3? YES

Turns out:
For a general , linear
classifier (perceptron) in
d dimensions with a
• VC dim >= 4? No… • VC dim >= 4? No… constant term:
Any line through these points Any line through these points VC dim = d+1
must split one pair (by must split one pair (by
crossing one of the lines) crossing one of the lines)
41 42
VC DIMENSION USING VC DIMENSION

• Use validation / cross-validation to select complexity
• VC dimension measures the “power” of the learner
• Does *not* necessarily equal the # of parameters!
• Number of parameters does not necessarily equal complexity

– Can define a classifier with a lot of parameters but not much power
# Params Train Error X-Val Error
(how?)
f1
– Can define a classifier with one parameter but lots of power (how?)
f2 f3 f4 f5 f6
• It is a lot of work to determine what the VC dimension of

various learners is…
43 44
(c) Alexander Ihler
USING VC DIMENSION USING VC DIMENSION
• Use validation / cross-validation to select complexity • Use validation / cross-validation to select complexity
• Use VC dimension based bound on test error similarly • Use VC dimension based bound on test error similarly
• Other Alternatives
• “Structural Risk Minimization” (SRM) – Probabilistic models: likelihood under model (rather than
classification error)
– AIC (Aikike Information Criterion)
# Params Train Error VC VC Test Bound • Log-likelihood of training data - # of parameters
f1 Term – BIC (Bayesian Information Criterion)
• Log-likelihood of training data - (# of parameters)*log(m)
f2 f3 f4 f5 f6
• Similar to VC dimension: performance + penalty
• BIC conservative; SRM very conservative

• Also, “true Bayesian” methods (take prob. learning…)
45 46
(c) Alexander Ihler (c) Alexander Ihler
47
TRAINING ERROR VS PREDICTION/TEST ERROR NO-FREE-LUNCH THEOREM

• The theorem states that all optimization algorithms perform equally well when
their performance is averaged across all possible problems.
MSEtest • It implies that there is no single best optimization algorithm i.e. no classifier is
inherently better than any other: you need to make assumptions to generalize
• Because of the close relationship between optimization, search, and machine

learning, it also implies that there is no single best machine learning algorithm
for predictive modeling problems such as classification and regression.
MSEtrain
47 48
OCCAM’S RAZOR (PRINCIPLE OF PARSIMONY) REFERENCES
• This philosophical idea in the context of ML suggests that all else being equal, • https://machinelearningmastery.com/introduction-to-computational-learning-t
a simpler model is be preferred over a more complex model.
heory
• It does not mean that simpler models are universally better than complex • Artificial Intelligence: A Modern Approach, 3rd edition, 2009 (Book)
models, but rather that a model must be complex enough to learn the patterns
• The Nature of Statistical Learning Theory, 1999 (Book)
in a dataset and to avoid underfitting but simple enough to avoid overfitting.
• Pattern Recognition and Machine Learning, 2006 (Book)
• When choosing between two models, we can only say a simpler model is • Machine Learning, 1997 (Book)
better if it’s generalization error is equal to or less than that of the more
complex model. • Slides of Andrew W. Moore (Associate Professor, School of Computer
Science, Carnegie Mellon University
• An Introduction to Computational Learning Theory by Keanrs and Vazirani
50
49
Data Mining: Bayesian Belief Networks
Concepts and Techniques ■ Bayesian belief networks (also known as Bayesian
networks, probabilistic networks): allow class
(3rd ed.)
conditional independencies between subsets of variables
— Chapter 9 —
Classification: Advanced Methods ■ A (directed acyclic) graphical model of causal relationships
Jiawei Han, Micheline Kamber, and Jian Pei ■ Represents dependency among the variables
University of Illinois at Urbana-Champaign & ■ Gives a specification of joint probability distribution
Simon Fraser University
❑ Nodes: random variables
©2011 Han, Kamber & Pei. All rights reserved.
❑ Links: dependency
X Y
Acknowledgments: The slides in this presentation are mostly the textbook slides ❑ X and Y are the parents of Z, and Y is the
from the data mining textbook “Data Mining: Concepts and Techniques” by Jiawei parent of P
Han, Micheline Kamber, and Jian Pei. The slides have been modified i.e., some Z
new slides have been added and some slides have been deleted. Information in P ❑ No dependency between Z and P
this presentation has also been obtained from a wide variety of publicly available ❑ Has no loops/cycles
Internet sources. 1 2
Training Bayesian Networks: Several

Bayesian Belief Network: An Example Scenarios
Family
Smoke CPT: Conditional Probability Table ■ Scenario 1: Given both the network structure and all variables
History observable: compute only the CPT entries
r (S) for variable LungCancer:
(FH)
■ Scenario 2: Network structure known, some variables hidden: gradient
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
descent (greedy hill-climbing) method, i.e., search for a solution along
LC 0.8 0.5 0.7 0.1 the steepest descent of a criterion function
LungC ■ Weights are initialized to random probability values
Emphy
ancer ~LC 0.2 0.5 0.3 0.9
sema ■ At each iteration, it moves towards what appears to be the best
(LC)
solution at the moment, w.o. backtracking
shows the conditional probability for
■ Weights are updated at each iteration & converge to local optimum
each possible combination of its parents
■ Scenario 3: Network structure unknown, all variables observable:
Positiv Dyspne
Derivation of the probability of a search through the model space to reconstruct network topology
eXRay a particular combination of values of X, ■ Scenario 4: Unknown structure, all hidden variables: No good
from CPT: algorithms known for this purpose
Bayesian Belief Network ■ D. Heckerman. A Tutorial on Learning with Bayesian Networks. In
Learning in Graphical Models, M. Jordan, ed.. MIT Press, 1999.
3 4
Classification by Backpropagation Neural Network as a Classifier
■ Weakness
■ Backpropagation: A neural network learning algorithm ■ Long training time
■ Started by psychologists and neurobiologists to develop ■ Require a number of parameters typically best determined
and test computational analogues of neurons empirically, e.g., the network topology or “structure.”
■ Poor interpretability: Difficult to interpret the symbolic meaning
■ A neural network: A set of connected input/output units behind the learned weights and of “hidden units” in the network
where each connection has a weight associated with it ■ Strength
■ During the learning phase, the network learns by ■ High tolerance to noisy data
adjusting the weights so as to be able to predict the ■ Ability to classify untrained patterns
■ Well-suited for continuous-valued inputs and outputs
correct class label of the input tuples
■ Successful on an array of real-world data, e.g., hand-written letters
■ Also referred to as connectionist learning due to the ■ Algorithms are inherently parallel
connections between units ■ Techniques have recently been developed for the extraction of
rules from trained neural networks
5 6
A Multi-Layer Feed-Forward Neural Network How A Multi-Layer Neural Network Works

■ The inputs to the network correspond to the attributes measured
Output vector for each training tuple
■ Inputs are fed simultaneously into the units making up the input
Output layer layer
■ They are then weighted and fed simultaneously to a hidden layer
■ The number of hidden layers is arbitrary, although usually only one
■ The weighted outputs of the last hidden layer are input to units
Hidden layer making up the output layer, which emits the network's prediction
wij ■ The network is feed-forward: None of the weights cycles back to

an input unit or to an output unit of a previous layer
Input layer ■ From a statistical point of view, networks perform nonlinear
regression: Given enough hidden units and enough training
Input vector: X samples, they can closely approximate any function
7 8
Defining a Network Topology Backpropagation
■ Iteratively process a set of training tuples & compare the network's
■ Decide the network topology: Specify # of units in the
prediction with the actual known target value
input layer, # of hidden layers (if > 1), # of units in each
■ For each training tuple, the weights are modified to minimize the
hidden layer, and # of units in the output layer
mean squared error between the network's prediction and the actual
■ Normalize the input values for each attribute measured in
target value
the training tuples to [0.0—1.0]
■ Modifications are made in the “backwards” direction: from the output
■ One input unit per domain value, each initialized to 0 layer, through each hidden layer down to the first hidden layer, hence
■ Output, if for classification and more than two classes, “backpropagation”
one output unit per class is used ■ Steps
■ Once a network has been trained and its accuracy is ■ Initialize weights to small random numbers, associated with biases
unacceptable, repeat the training process with a different ■ Propagate the inputs forward (by applying activation function)
network topology or a different set of initial weights ■ Backpropagate the error (by updating weights and biases)
■ Terminating condition (when error is very small, etc.)
9 10
Neuron: A Hidden/Output Layer Unit Efficiency and Interpretability

bias ■ Efficiency of backpropagation: Each epoch (one iteration through the
x0 w0 μk training set) takes O(|D| * w), with |D| tuples and w weights, but # of
epochs can be exponential to n, the number of inputs, in worst case
∑
x1 w1
f output y ■ For easier comprehension: Rule extraction by network pruning
■ Simplify the network structure by removing weighted links that
xn wn have the least effect on the trained network
■ Then perform link, unit, or activation value clustering
Input weight weighted Activation ■ The set of input and activation values are studied to derive rules
vector x vector w sum function describing the relationship between the input and hidden unit
■ An n-dimensional input vector x is mapped into variable y by means of the layers
scalar product and a nonlinear function mapping ■ Sensitivity analysis: assess the impact that a given input variable
■ The inputs to unit are outputs from the previous layer. They are multiplied by has on a network output. The knowledge gained from this analysis
their corresponding weights to form a weighted sum, which is added to the
can be represented in rules
bias associated with unit. Then a nonlinear activation function is applied to it.
11 12
Lazy vs. Eager Learning Lazy Learner: Instance-Based Methods
■ Lazy vs. eager learning ■ Instance-based learning:

■ Lazy learning (e.g., instance-based learning): Simply ■ Store training examples and delay the processing
stores training data (or only minor processing) and (“lazy evaluation”) until a new instance must be
waits until it is given a test tuple classified
■ Eager learning (the above discussed methods): Given ■ Typical approaches
a set of training tuples, constructs a classification model ■ k-nearest neighbor approach
before receiving new (e.g., test) data to classify ■ Instances represented as points in a Euclidean
■ Lazy: less time in training but more time in predicting space.
■ Accuracy ■ Locally weighted regression
■ Lazy method effectively uses a richer hypothesis space ■ Constructs local approximation
since it uses many local linear functions to form an ■ Case-based reasoning
implicit global approximation to the target function
■ Uses symbolic representations and
■ Eager: must commit to a single hypothesis that covers
knowledge-based inference
the entire instance space
13 14
Case-Based Reasoning (CBR) Genetic Algorithms (GA)

■ CBR: Uses a database of problem solutions to solve new problems
■ Genetic Algorithm: based on an analogy to biological evolution
■ Store symbolic description (tuples or cases)—not points in a Euclidean
■ An initial population is created consisting of randomly generated rules
space
■ Each rule is represented by a string of bits
■ Applications: Customer-service (product-related diagnosis), legal ruling
■ E.g., if A1 and ¬A2 then C2 can be encoded as 100
■ Methodology
■ If an attribute has k > 2 values, k bits can be used
■ Instances represented by rich symbolic descriptions (e.g., function
■ Based on the notion of survival of the fittest, a new population is
graphs)
formed to consist of the fittest rules and their offspring
■ Search for similar cases, multiple retrieved cases may be combined
■ The fitness of a rule is represented by its classification accuracy on a
■ Tight coupling between case retrieval, knowledge-based reasoning,
set of training examples
and problem solving
■ Offspring are generated by crossover and mutation
■ Challenges
■ The process continues until a population P evolves when each rule in P
■ Find a good similarity metric
satisfies a prespecified threshold
■ Indexing based on syntactic similarity measure, and when failure,
■ Slow but easily parallelizable
backtracking, and adapting to additional cases
15 16
Active Learning Transfer Learning: Conceptual Framework
■ Class labels are expensive to obtain ■ Transfer learning: Extract knowledge from one or more source tasks
■ Active learner: query human (oracle) for labels and apply the knowledge to a target task
■ Pool-based approach: Uses a pool of unlabeled data ■ Traditional learning: Build a new classifier for each new task
■ L: a small subset of D is labeled, U: a pool of unlabeled data in D
■ Transfer learning: Build new classifier by applying existing knowledge
■ Use a query function to carefully select one or more tuples from U
and request labels from an oracle (a human annotator) learned from source tasks
■ The newly labeled samples are added to L, and learn a model
■ Goal: Achieve high accuracy using as few labeled data as possible
■ Evaluated using learning curves: Accuracy as a function of the number
of instances queried (# of tuples to be queried should be small)
■ Research issue: How to choose the data tuples to be queried?
■ Uncertainty sampling: choose the least certain ones
■ Reduce version space, the subset of hypotheses consistent w. the
training data
■ Reduce expected entropy over U: Find the greatest reduction in
the total number of incorrect predictions
Traditional Learning Framework Transfer Learning Framework
17 18
Transfer Learning: Methods and Applications
■ Applications: Especially useful when data is outdated or distribution

changes, e.g., Web document classification, e-mail spam filtering
■ Instance-based transfer learning: Reweight some of the data from
source tasks and use it to learn the target task
■ TrAdaBoost (Transfer AdaBoost)
■ Assume source and target data each described by the same set of
attributes (features) & class labels, but rather diff. distributions
■ Require only labeling a small amount of target data
■ Use source data in training: When a source tuple is misclassified,
reduce the weight of such tupels so that they will have less effect on
the subsequent classifier
■ Research issues
■ Negative transfer: When it performs worse than no transfer at all
■ Heterogeneous transfer learning: Transfer knowledge from different
feature space or multiple source domains
■ Large-scale transfer learning
19
2
CONTENTS
BIAS VARIANCE TRADE-OFF • Bias and Variance: Introduction, why and what?
• Bias and Variance: Conceptual Definition
• Bias-Variance: Tradeoff
• Bias and Variance: Graphical Definition
• An Illustrative Example: Voting Intentions
• An Illustrative Example: Voting Intentions Caveats
• Bias and Variance: Mathematical Definition
• Bias-Variance: Tradeoff Revisited
• ML Model Space, Hypothesis Space & Hyperparameters
• Generalization and Bias Variance Tradeoff Intuitions
wide variety of publicly available Internet sources. Slides created by Mr. Saransh • Error comparisons, No Free Lunch Theorem and Occam’s Razor
Gupta for academic use only, as part of course material for Introduction to • Going beyond theory and Managing Bias and Variance
Machine Learning course. • Bias-Variance Tradeoff: Overfitting/Underfitting & Summary
1 • More about Managing Bias and Variance practically
• An Applied Example: Voter Party Registration
BIAS AND VARIANCE: INTRODUCTION BIAS AND VARIANCE: WHY AND WHAT?
Understanding these two types of error i.e. the error due to “Bias”
and the error due to “Variance” can help us diagnose model results
and avoid the mistake of overfitting or under-fitting.
• Think Conceptually
• Think Graphically
• Think Mathematically
3
4
BIAS AND VARIANCE:
BIAS-VARIANCE: TRADEOFF
CONCEPTUAL DEFINITION
Remember: There is a tradeoff between a

If we repeat the entire model building process multiple times with new model’s ability to minimize bias and variance.
data every time, the resulting models will have a range of predictions
because of the randomness in underlying data sets. Bias measures how
far off in general these models' predictions are from the correct value. In statistics and machine learning, the bias–variance tradeoff is
the property of a model that the variance of the parameter
estimated across samples can be reduced by increasing
the bias in the estimated parameters.
If you repeat the entire model building process multiple times, the
variance tells us how much the predictions for a given point vary Note: We will revisit this again in detail in this slide deck.
between different realizations of the model.
5 6
BIAS AND VARIANCE: BIAS AND VARIANCE:

GRAPHICAL DEFINITION GRAPHICAL DEFINITION
• Imagine that the center of the target is a model that perfectly

predicts the correct values. As we move away from the bulls-eye,
our predictions get worse and worse.
• Imagine we can repeat our entire model building process to get a Bias: Bias measures how far off
number of separate hits on the target. Each hit represents an in general models’ predictions
individual realization of our model, given the chance variability in are from the correct value.
the training data we gather.
• Sometimes we will get a good distribution of training data so we Variance: The variance is how
predict very well and we are close to the bulls-eye, while much the predictions for a given
sometimes our training data might be full of outliers or point vary between different
non-standard values resulting in poorer predictions. realizations of the model.
• These different realizations result in a scatter of hits on the target.
7
AN ILLUSTRATIVE EXAMPLE: AN ILLUSTRATIVE EXAMPLE:
VOTING INTENTIONS VOTING INTENTIONS

We put out our press release that the Democrats

are going to win by over 10 points; but, when the
election comes around, it turns out they
actually lose by 10 points.
Voting Republican Voting Democratic Non-Respondent Total That certainly reflects poorly on us!
13 16 21 50
Where did we go wrong in our model?

9
10
AN ILLUSTRATIVE EXAMPLE: AN ILLUSTRATIVE EXAMPLE:

VOTING INTENTIONS VOTING INTENTIONS
Sources of Bias Source of Variance
• Multiple issues with the trivial model we built:
1. We only sample people from the phone book and so only 1. Using a phonebook to select 4. Small sample size – increasing sample
participants in our survey – surveying only size will give consistent results each time
include people with listed numbers. certain classes of people, consistently! the survey and prediction are repeated.*
2. We did not follow up with non-respondents. They may have
2. Not following up with respondents – it
different voting patterns from the respondents. consistently changes the mixture of
3. We do not try to weight responses by likeliness to vote. responses we get!
4. We have a very small sample size. 3. Not trying to weight responses by
… likeliness to vote – treating all responses
… with an equal weightage
… Result: On our bulls-eye diagram, these Result: On the bulls-eye diagram, the low
and so on. biases move us away from the center of sample size results in a wide scatter of
the target, but they would not result in an estimates. Increasing the sample size
increased scatter of estimates. would make the estimates clump closers
• Can we classify these sources of errors as bias and variance? together, but they still might miss the
center of the target.
11 12
AN ILLUSTRATIVE EXAMPLE: BIAS AND VARIANCE:
VOTING INTENTIONS CAVEATS MATHEMATICAL DEFINITION

For instance, in general the data set used to build the model is
provided prior to model construction and the modeler cannot
simply say, "Let's increase the sample size to reduce variance."
*Note2: Even if we increase the sample size and get

consistent results, they still might be highly inaccurate due
to our large sources of bias, but the variance of
predictions will be reduced. 13 14
15
BIAS AND VARIANCE:

BIAS-VARIANCE: TRADEOFF REVISITED
MATHEMATICAL DEFINITION Three kinds of error:

1. Inherent (because of noise): unavoidable and irreducible
2. Bias: due to over-simplifications
Error due to incorrect assumptions (or Error due to variance of training samples (or 3.Variance: due to inability to perfectly estimate parameters
restriction of model class) randomness of training data)
from limited data

In practice an explicit tradeoff exists between bias and variance

where decreasing one increases the other.
Minimizing the total error of the model requires a careful

balancing of these two forms of error.
16
MACHINE LEARNING MODEL SPACE HYPOTHESIS SPACE AND HYPOTHESIS
• Hypothesis Space: It is the set of all the possible legal hypothesis.
This is the set from which the ML algorithm would determine the
models
consistent
best possible (only one) hypothesis which would best describe the
with data target function or the outputs.
correct
model all possible • Hypothesis: A hypothesis is a function that best describes the
restricted models
model
target in supervised machine learning. The hypothesis that
class an algorithm would come up with depends upon the data
and the restrictions and bias that we have imposed on the
data.
Every learning algorithm requires assumptions about the hypothesis space.
• Eg: “My hypothesis space is
Restricting model class can help. Or it can hurt! • …linear”
Depends on whether restrictions are domain appropriate • …decision trees with 5 nodes”
• …a three layer neural network with rectifier hidden units”
18
HYPERPARAMETER SELECTION GENERALIZATION (ON TEST DATA)

• Hyperparameters: Every Machine Learning model has
parameters that are explicitly defined by the user to control the
learning process and need to be tuned for an improved learning
of the model. Examples:
• k in k-NN or k-Nearest Neighbors
• Learning rate for training a neural network
• Train-test split ratio
• Batch Size
• Branches in Decision Tree
• Number of clusters in Clustering Algorithm
• … and so on.
We want to choose hyperparameter values which would give us
the best performance on test data.
20
19
21 22
GENERALIZATION (ON TEST DATA) BIAS –VARIANCE TRADEOFF - INTUITION 1

• On one hand, while choosing a hypothesis space, we may impose
• Model is too simple
the inductive bias too restrictively - leads to Underfitting! • Too few parameters
• A fundamental question in learning theory is, over which • Does not fit the data well
hypothesis classes, learning will not result in overfitting. • It is a biased solution
• On the other hand, if we minimize the error (loss) with respect to • Not enough flexibility
training data alone, it does not result in a good • Inaccurate!
expected/predictive performance – leads to Overfitting!
• We need to be able to generalize on the test data better • Model is too complex
i.e. minimize the expected loss (prediction risk). • Too many parameters
• Small changes to the data
changes the solution a lot
Intuitively, choosing a more restricted hypothesis class better • It is a high-variance solution
protects us against overfitting but at the same time might cause us • Too much sensitivity to the sample
• Inaccurate!
a stronger inductive bias. We will get back to this fundamental
tradeoff later.
BIAS –VARIANCE TRADEOFF - INTUITION 2 BIAS –VARIANCE TRADEOFF - INTUITION 3

• Bias is the true error (loss) of the best • Variance describes how much the best
predictor in the hypothesis set classifier depends on the training set
Under-fitting Appropriate-fitting Over-fitting Under-fitting (high bias)

(too simple to explain variance) (forcefitting – too good to be true
How to reduce variance?

How to reduce bias?
•Decrease model size (make simple) Appropriate-fitting
•Increase model size (make complex)
•Cross-validate the parameters
•Change model architecture
•Feature selection and dimensionality
•Feature addition and parameter tuning
reduction 23
•Ensure the data is truly representative Over-fitting (high variance)
•Get more training data
24
GENERALIZATION (ON TEST DATA): TRAINING ERROR VS PREDICTION/TEST ERROR
UNDERFITTING AND OVERFITTING
• Underfitting occurs when model is too “simple” to represent all
the relevant class characteristics.
• High bias and low variance
• There can be a high training error and a high test error
• Arises when the classifier cannot represent the data.
• Overfitting occurs when model is too “complex” and fits

irrelevant characteristics (noise) in the data.
• Low bias and high variance

• There may be a low training error but a high test error
• Arises when the learner tries to fit every data point in the
training set very closely i.e. overfits.
25 26
27
TRAINING ERROR VS PREDICTION/TEST ERROR NO-FREE-LUNCH THEOREM

• The theorem states that all optimization algorithms perform equally well when
their performance is averaged across all possible problems.
MSEtest • It implies that there is no single best optimization algorithm i.e. no classifier is
inherently better than any other: you need to make assumptions to generalize
• Because of the close relationship between optimization, search, and machine

learning, it also implies that there is no single best machine learning algorithm
for predictive modeling problems such as classification and regression.
MSEtrain
28
30
GOING BEYOND THEORY: PRACTICAL CONSIDERATIONS

OCCAM’S RAZOR (PRINCIPLE OF PARSIMONY) • There are practical situations wherein we may choose a simpler model even if
• This philosophical idea in the context of ML suggests that all else being equal, it’s generalization error is lower than that of the more complex models.
a simpler model is be preferred over a more complex model. • In fact simpler models may provide the following advantages:
• Less memory usage.
• It does not mean that simpler models are universally better than complex • Faster inference times.
models, but rather that a model must be complex enough to learn the patterns • Better explainability.
in a dataset and to avoid underfitting but simple enough to avoid overfitting. • Ex: Consider a fraud detection problem where a decision tree achieves an
accuracy of 98% and a neural network achieves an accuracy of 99%.
• When choosing between two models, we can only say a simpler model is • Consider other restrictions/limitations of the problem:
better if it’s generalization error is equal to or less than that of the more • Requires quick inference times
complex model. • Our deployment server has memory limitations
• The model’s predictions need to be explained to a regulatory body
within the company that is working on this use case.
• Clearly, a decision tree is a much better model than a neural network here as it
is likely a smaller model with faster inference times and is much easier to
explain than a neural network.
29 • Unless the 1% drop in accuracy is immensely significant in such a problem, the
practical decision would be to choose the decision tree over the neural network.
MANAGING BIAS AND VARIANCE MANAGING BIAS AND VARIANCE

Fight your instincts: Gut feeling of minimizing bias even at the Bagging and Resampling Techniques: Can be used to reduce
expense of variance is not a good idea. This is mistaken logic! the variance in model predictions.
• In Bagging (Bootstrap Aggregating), numerous replicates of the original data set
• It is common to assume the presence of bias indicates something basically are created using random selection with replacement. Each derivative data set is
wrong with the model and algorithm. It is wrongly inferred that while variance then used to construct a new model and the models are gathered together into an
is also bad, a model with high variance could at least predict well on ensemble. To make a prediction, all of the models in the ensemble are polled and
average, i.e. at least it is not fundamentally wrong. their results are averaged.
• One powerful modeling algorithm that makes good use of bagging is Random
• It is true that a high variance and low bias model can preform well in some Forests (RF). It works by training numerous decision trees each based on a
sort of long-run average sense. However, in practice modelers are always different resampling of the original training data. In RF, the bias of the full model
dealing with a single realization of the data set. In these cases, long run is equivalent to the bias of a single decision tree (which itself has high variance).
averages are irrelevant, what is important is the performance of the model on • By creating many of these trees, in effect a "forest", and then averaging them the
the data you actually have and in this case bias and variance are equally variance of the final model can be greatly reduced over that of a single tree. In
important and one should not be improved at an excessive expense to the 31 practice the only limitation on the size of the forest is computing time as an infinite
32
other. number of trees could be trained without ever increasing bias and with a continual
(if asymptotically declining) decrease in the variance.
34
MANAGING BIAS AND VARIANCE MANAGING BIAS AND VARIANCE

Understanding Over- and Under-Fitting: At its root, dealing with bias and
Asymptotic Properties of Algorithms: Theoretically speaking, as variance is really about dealing with over- and under-fitting. Bias is reduced
your training sample size grows towards infinity, your model’s and variance is increased in relation to model complexity. As more and more
bias will fall to 0 (asymptotic consistency) and your model will parameters are added to a model, the complexity of the model rises and
have a variance that is no worse than any other potential model variance becomes our primary concern while bias steadily falls.
you could have used (asymptotic efficiency). Understanding bias and variance
• However, we do not live in a world of infinite sample sizes so asymptotic is critical for understanding the
properties generally have very little practical use. behavior of prediction models,
• An algorithm that may have close to no bias when you have a million points, may but in general what you really
have very significant bias when you only have a few hundred data points. care about is overall error, not
• More importantly, an asymptotically consistent and efficient algorithm may the specific decomposition.
actually perform worse on small sample size data sets than an algorithm that is
neither asymptotically consistent nor efficient. The sweet spot for any model is
When working with real data, it is best to leave aside theoretical the level of complexity at which
properties of algorithms and to instead focus on their actual accuracy the increase in bias is equivalent
in a given scenario. to the reduction in variance.
33
BIAS-VARIANCE TRADEOFF: BIAS VARIANCE TRADEOFF: SUMMARY

OVERFITTING AND UNDERFITTING
• Tradeoff between bias and variance:
• Simple Models: High Bias, Low Variance
• Complex Models: Low Bias, High Variance
• If our model complexity exceeds this sweet spot, we are in effect Mean Squared Error (true risk)
over-fitting our model; while if our complexity falls short of the
sweet spot, we are under-fitting the model.
• In practice, there is no analytical way to find this location.

Estimation error (variance)
• Instead we must use an accurate measure of prediction error and Approximation error (bias2)
explore differing levels of model complexity and then choose the
35
complexity level that minimizes the overall error.

36
MORE ABOUT MANAGING BIAS AND VARIANCE, PRACTICALLY AN APPLIED EXAMPLE:
• Ensemble methods reduce variance
VOTER PARTY REGISTRATION
Consider a bit more realistic example:
• Multiple classifiers are combined
• Eg: Bagging, boosting Assume we have a training data set of voters each tagged with
three properties: voter party registration, voter wealth, and a
• Decision trees of a given depth quantitative measure of voter religiousness.
• Increasing depth decreases bias, increases variance
• SVMs
• Higher degree polynomial kernels decreases bias, increases variance
• Stronger regularization increases bias, decreases variance We want to predict
• Neural networks
voter registration
• Deeper models can increase variance, but decrease bias using wealth and
religiousness as
• K- Nearest Neighbors
• Increasing k generally increases bias, reduces variance 37 predictors.
Red circles: Republican voters
Blue circles: Democratic voters
38
AN APPLIED EXAMPLE: 40
THE K-NEAREST NEIGHBOR ALGORITHM

THE K-NEAREST NEIGHBOR ALGORITHM
• There are many ways to go about this modeling task. • The following figure shows the nearest neighborhoods for each of
• For binary data like ours, logistic regressions are often used. the original voters. If k was specified as 1, a new voter's party
• However, if we think there are non-linearities in the registration would be determined by whether they fall within a
relationships between the variables, a more flexible, red or blue region.
data-adaptive approach might be desired.
• One such very flexible machine-learning technique is
k-Nearest Neighbors or k-NN.
• In k-NN, the party registration of a given voter will be found by
plotting him or her on the plane with the other voters.
• The nearest k other voters to him or her will be found using a
geographic measure of distance and the average of their
registrations will be used to predict his or her registration.
• So if the nearest voter to him/her (in terms of wealth and
religiousness) is a Democrat, s/he will also be predicted to be a
Democrat. 39
41 42
THE K-NEAREST NEIGHBOR ALGORITHM KNN: BEST K?

• If we sample new voters, we can use our existing training data to predict their • We can also plot the full prediction regions for where individuals will be
registration. classified as either Democrats or Republicans.
• The following figure plots the wealth and religiousness for these new voters
and uses the k-NN algorithm to predict their registration. • Depending upon the choice of k in kNN (1 to number of data points in training
set), our accuracy of predictions/placement of new data points varies.
• What is the best value of k?
• Let us try experimenting with the value of k to find the best prediction
algorithm that matches up well with the black boundary line.
• Visit here for trying out this example out yourself.
Note that Red region represents Republican voters, Blue region

represents Democratic voters and Black dotted line represents
Actual models’ split.
43 44
COMPARING RESULTS FOR VARIOUS VALUES OF K

K AND BIAS/VARIANCE
• Increasing k results in the averaging of more voters in each prediction. This
results in smoother prediction curves.
• With a k of 1, the separation between Democrats and Republicans is very
rough. Furthermore, there are "islands" of Democrats in generally Republican
territory and vice versa.

45 46
K AND BIAS/VARIANCE K AND BIAS/VARIANCE

• As k is increased to, say, 20, the transition becomes smoother and the islands disappear • As k becomes very large, say, 80, the distinction between the two categories becomes
and the split between Democrats and Republicans does a good job of following the more blurred and the boundary prediction line is not matched very well at all.
boundary line. • On the other hand, at large k's the transition is very smooth so there isn't much variance,
• At small k's the roughness and islands are signs of variance. The locations of the islands but the lack of a match to the boundary line is a sign of high bias.
and the exact curves of the boundaries will change radically as new data is gathered.
Red circles: Republican voters Red circles: Republican voters

Blue circles: Democratic voters Blue circles: Democratic voters
47 48
KNN AND BIAS/VARIANCE TRADEOFF KNN AND ANALYTICAL BIAS/VARIANCE

Results and Inferences (k vs Bias/Variance):
• Increasing k will decrease variance and increase bias.
• Decreasing k will increase variance and decrease bias.
Results and Inferences (k vs Bias/Variance):

•See how variable the predictions are for different data sets at
low k. As k increases this variability is reduced.
•However, if we increase k too much, then we no longer follow the
true boundary line and we observe high bias.
This is the nature of the Bias-Variance Tradeoff.

REFERENCES
• Huang, Catherine. "A Blog About Lunch and Data Science — How There is No Such a Thing As
Free Lunch." Medium, 27 Sept. 2018,
towardsdatascience.com/a-blog-about-lunch-and-data-science-how-there-is-no-such-a-thing-as-f
ree-lunch-e46fd57c7f27.
• "Just a Moment..." Just a Moment..,
machinelearningmastery.com/no-free-lunch-theorem-for-machine-learning/.
• Mavuduru, Amol. "What Occam’s Razor Means in Machine Learning." Medium, 9 Aug. 2022,
towardsdatascience.com/what-occams-razor-means-in-machine-learning-53f07effc97c.
• "ML | Underfitting and Overfitting." GeeksforGeeks, 22 Aug. 2022,
www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/.
• "ML | Understanding Hypothesis." GeeksforGeeks, 21 Nov. 2019,
www.geeksforgeeks.org/ml-understanding-hypothesis/.
• Shah, Abhishek. "Generalization Error in Machine Learning (Bias Vs. Variance)." Medium, 49
19 July 2022,
medium.com/@jwbtmf/generalization-error-in-machine-learning-4617141932b7.
• "Understanding the Bias-Variance Tradeoff." scott.fortmann-roe.com/docs/BiasVariance.html.
• www.inf.ed.ac.uk/teaching/courses/mlsc/Notes/Lecture4/BiasVariance.pdf.
Data Mining Ensemble Methods
● Construct a set of base classifiers learned from

the training data
Ensemble Techniques
● Predict class label of test records by combining
the predictions made by multiple classifiers (e.g.,
Introduction to Data Mining, 2nd Edition by taking majority vote)
by
Tan, Steinbach, Karpatne, Kumar
Acknowledgments: The slides in this presentation are mostly the textbook slides from the
data mining textbook “Introduction to Data Mining (2nd edition)” by Tan, Steinbach,
Karpatne, Kumar. The slides have been modified i.e., some new slides have been added and
some slides have been deleted. Information in this presentation has also been obtained from
a wide variety of publicly available Internet sources.
10/11/2021 Introduction to Data Mining, 2nd Edition 1 10/11/2021 Introduction to Data Mining, 2nd Edition 2
Example: Why Do Ensemble Methods Work? Necessary Conditions for Ensemble Methods
● Ensemble Methods work better than a single base classifier if:

1. All base classifiers are independent of each other
2. All base classifiers perform better than random guessing
(error rate < 0.5 for binary classification)
Classification error for an

ensemble of 25 base classifiers,
assuming their errors are
uncorrelated.
Rationale for Ensemble Learning Bias-Variance Decomposition
● Ensemble Methods work best with unstable ●

base classifiers
– Classifiers that are sensitive to minor perturbations in
training set, due to high model complexity
– Examples: Unpruned decision trees, ANNs, …
Bias-Variance Trade-off and Overfitting General Approach of Ensemble Learning
Overfitting
Underfitting Using majority vote or

weighted majority vote
(weighted according to their
● Ensemble methods try to reduce the variance of complex accuracy or relevance)
models (with low bias) by aggregating responses of
multiple base classifiers
Constructing Ensemble Classifiers
Ensemble Learning
• The core idea of Ensemble learning is to improve ML
● By manipulating training set predictions/results by amalgamating/combining different
– Example: bagging, boosting, random forests
models to facilitate towards providing better predictions than if
● By manipulating input features you had used any of those models on an individual or
– Example: random forests
standalone basis
• Remember that ensemble in English means a group of people
● By manipulating class labels playing music together, some kind of collection: hence, the
– Example: error-correcting output coding name “Ensemble learning”
• Core idea is to learn multiple classifiers and then use a voting
● By manipulating learning algorithm mechanism
– Example: injecting randomness in the initial weights of ANN
• Types of ensemble learning
– Bagging
– Boosting
10
10/11/2021 Introduction to Data Mining, 2nd Edition 9
Bagging (Bootstrap AGGregatING)

Bagging
● Bootstrap sampling: sampling with replacement
• Bootstrap Aggregating ● Build classifier on each bootstrap sample
• ML ensemble algorithm for improving stability and
accuracy
● Probability of a training instance being selected in
• Decreases the variance 🡪 avoids overfitting a bootstrap sample is:
• Used in classification and regression, decision tree 1 – (1 - 1/n)n (n: number of training instances)
methods ~0.632 when n is large
• Special case of model averaging approach Each classifier provides its class prediction
Ensemble bagging classifier counts votes and assigns the
class that got highest votes to the test sample
11
Bagging Algorithm
Big picture of Bagging
• From the dataset, create multiple subsets (samples)
with equal number of tuples (with replacement)
• Build classifier model on each sample
– Observe how each model is learned in parallel and
independently
• Combine predictions from all models based on voting
mechanism
13
Boosting
Boosting
● An iterative procedure to adaptively change
• The core idea is to build a model using the training data, and
then build another model that attempts to rectify the errors of
distribution of training data by focusing more on
the first model previously misclassified records
• This is an iterative process with stopping conditions such as: – Initially, all N records are assigned equal
– stop when you have already added the maximum number weights (for being selected for training)
of models or
– Unlike bagging, weights may change at the
– stop when the complete training set has been correctly
end of each boosting round
predicted
• Observe how the learning in case of boosting is sequential
(not parallel) and adaptive
– Observe how this is different from bagging, where the
learning is happening in parallel and independently
15
Boosting
Boosting
● Records that are wrongly classified will have their
• There are many boosting algorithms
weights increased in the next round
• AdaBoost algorithm by Robert Schapire and Yoav Freund
● Records that are classified correctly will have • Adaptive boosting algorithm is among the most popular
their weights decreased in the next round boosting algorithms
• Converts multiple weak classifiers to create one strong
classfier
Source:
https://www.geeksforgeeks.org/bagging-vs-boosting-in-machine-learning/ 18
AdaBoost
Big picture of AdaBoost
● Base classifiers: C1, C2, …, CT
• Initialize the dataset and assign equal weight to each of the
data point ● Error rate of a base classifier:
• Provide this as input to the model and identify the wrongly
classified data points
• Increase the weight of the wrongly classified data points and
decrease the weights of correctly classified data points. And
then normalize the weights of all data points.
• If (required results have been obtained) then end, otherwise ● Importance of a classifier:
continue iteratively
Source:
https://www.geeksforgeeks.org/bagging-vs-boosting-in-machine-l
earning/ (Adapted from the above source with some minor
modifications) 19
AdaBoost Algorithm AdaBoost Algorithm
● Weight update:
● If any intermediate rounds produce error rate

higher than 50%, the weights are reverted back to
1/n and the resampling procedure is repeated
● Classification:
Random Forest Algorithm Characteristics of Random Forest
● Construct an ensemble of decision trees by

manipulating training set as well as features
– Use bootstrap sample to train every decision

tree (similar to Bagging)
– Use the following tree induction algorithm:
◆ At every internal node of decision tree, randomly
sample p attributes for selecting split criterion
◆ Repeat this procedure until all leaves are pure
(unpruned tree)
Gradient Boosting
● Constructs a series of models

– Models can be any predictive model that has a
differentiable loss function
– Commonly, trees are the chosen model
◆ XGboost (extreme gradient boosting) is a popular
package because of its impressive performance
● Boosting can be viewed as optimizing the loss
function by iterative functional gradient descent.
● Implementations of various boosted algorithms
are available in Python, R, Matlab, and more.

Data Mining
Classification: Alternative Techniques
Lecture Notes for Chapter 4
Classification_Part2 Instance-Based Learning
Introduction to Data Mining , 2nd Edition

by
Acknowledgments: The slides in this presentation are mostly the textbook slides from the
textbook “Introduction to Data Mining” by Tan, Steinbach, Karpatne, Kumar. The slides have
been modified i.e., some new slides have been added and some slides have been deleted.
Information in this presentation has also been obtained from a wide variety of publicly
available Internet sources.
Nearest Neighbor Classifiers Nearest-Neighbor Classifiers
Requires the following:

● Basic idea: ●
– A set of labeled records

– If it walks like a duck, quacks like a duck, then – Proximity metric to compute
it’s probably a duck distance/similarity between a
pair of records
– e.g., Euclidean distance
Compute – The value of k, the number of

Distance Test nearest neighbors to retrieve
Record – A method for using class
labels of K nearest neighbors
to determine the class label of
unknown record (e.g., by
taking majority vote)
Training Choose k of the

Records “nearest” records
How to Determine the class label of a Test Sample? Choice of proximity measure matters
●
● For documents, cosine is better than correlation or
Euclidean
111111111110 000000000001
vs
011111111111 100000000000
Euclidean distance = 1.4142 for both pairs, but

the cosine similarity measure has different
values for these pairs.
Nearest Neighbor Classification… Nearest Neighbor Classification…
● Data preprocessing is often required ● Choosing the value of k:

– Attributes may have to be scaled to prevent distance – If k is too small, sensitive to noise points
measures from being dominated by one of the – If k is too large, neighborhood may include points from
attributes other classes
◆Example:
– height of a person may vary from 1.5m to 1.8m

– weight of a person may vary from 90lb to 300lb
– income of a person may vary from $10K to $1M
– Time series are often standardized to have 0

means a standard deviation of 1
Nearest-neighbor classifiers Nearest Neighbor Classification…
● Nearest neighbor ● How to handle missing values in training and

classifiers are local
classifiers
test sets?
– Proximity computations normally require the
● They can produce 1-nn decision boundary is presence of all attributes
decision boundaries of a Voronoi Diagram – Some approaches use the subset of attributes
arbitrary shapes. present in two instances
◆ This may not produce good results since it
effectively uses different proximity measures for
each pair of instances
◆ Thus, proximities are not comparable
K-NN Classifiers…
Handling Irrelevant and Redundant Attributes K-NN Classifiers: Handling attributes that are interacting
– Irrelevant attributes add noise to the proximity measure

– Redundant attributes bias the proximity measure towards certain
attributes
Improving KNN Efficiency
● Avoid having to compute distance to all objects in

the training set
– Multi-dimensional access methods (k-d trees)
– Fast approximate similarity search
– Locality Sensitive Hashing (LSH) Rule-based Classification
● Condensing
– Determine a smaller set of objects that give
the same performance
● Editing
– Remove objects to improve efficiency
Data Mining
Rule-Based Classifier
● Classify records by using a collection of
Lecture Notes for Chapter 4 “if…then…” rules
● Rule: (Condition) → y
Rule-Based – where
◆ Condition is a conjunction of tests on attributes
◆ y is the class label
Introduction to Data Mining , 2nd Edition – Examples of classification rules:
by ◆ (Blood Type=Warm) ∧ (Lay Eggs=Yes) → Birds
Tan, Steinbach, Karpatne, Kumar ◆ (Taxable Income < 50K) ∧ (Refund=Yes) → Evade=No
Rule-based Classifier (Example) Application of Rule-Based Classifier
● A rule r covers an instance x if the attributes of

the instance satisfy the condition of the rule
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

R3: (Give Birth = yes) ∧ (Blood Type = warm) →
The rule R1 covers a hawk => Bird
Mammals
The rule R3 covers the grizzly bear => Mammal
R5: (LiveIntroduction
2/10/2021
in Water to= Data
sometimes) → Amphibians
Mining, 2nd Edition 17 2/10/2021 Introduction to Data Mining, 2nd Edition 18
Rule Coverage and Accuracy How does Rule-based Classifier Work?

● Coverage of a rule:
– Fraction of records R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
that satisfy the R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
antecedent of a rule R5: (Live in Water = sometimes) → Amphibians
● Accuracy of a rule:
– Fraction of records
that satisfy the
antecedent that
A lemur triggers rule R3, so it is classified as a mammal
also satisfy the A turtle triggers both R4 and R5
consequent of a (Status=Single) → No A dogfish shark triggers none of the rules
rule Coverage = 40%, Accuracy = 50%
Characteristics of Rule Sets: Strategy 1 Characteristics of Rule Sets: Strategy 2
● Mutually exclusive rules ● Rules are not mutually exclusive

– Classifier contains mutually exclusive rules if – A record may trigger more than one rule
the rules are independent of each other – Solution?
– Every record is covered by at most one rule ◆ Ordered rule set
◆ Unordered rule set – use voting schemes
● Exhaustive rules
– Classifier has exhaustive coverage if it ● Rules are not exhaustive
accounts for every possible combination of – A record may not trigger any rules
attribute values – Solution?
– Each record is covered by at least one rule ◆ Use a default class
Ordered Rule Set Rule Ordering Schemes
● Rules are rank ordered according to their priority ● Rule-based ordering

– An ordered rule set is known as a decision list – Individual rules are ranked based on their quality
● When a test record is presented to the classifier ● Class-based ordering
– It is assigned to the class label of the highest ranked rule it has – Rules that belong to the same class appear together
triggered
– If none of the rules fired, it is assigned to the default class

R3: (Give Birth = yes) ∧ (Blood Type = warm) →
Mammals
R5: (Live in Water = sometimes) → Amphibians
Building Classification Rules Direct Method: Sequential Covering
● Direct Method: 1. Start from an empty rule

◆ Extract rules directly from data 2. Grow a rule using the Learn-One-Rule function
◆ Examples: RIPPER, CN2, Holte’s 1R
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping criterion
● Indirect Method: is met
◆ Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
◆ Examples: C4.5rules
Example of Sequential Covering Example of Sequential Covering…
Rule Growing Rule Evaluation
FOIL: First Order Inductive
● Two common strategies ● Learner – an early
rule-based learning
algorithm
Direct Method: RIPPER Direct Method: RIPPER
● For 2-class problem, choose one of the classes as ● Growing a rule:

positive class, and the other as negative class – Start from empty rule
– Learn rules for positive class – Add conjuncts as long as they improve FOIL’s
– Negative class will be default class information gain
– Stop when rule no longer covers negative examples
● For multi-class problem
– Prune the rule immediately using incremental reduced
– Order the classes according to increasing class error pruning
prevalence (fraction of instances that belong to a
– Measure for pruning: v = (p-n)/(p+n)
particular class)
◆ p: number of positive examples covered by the rule in
– Learn the rule set for smallest class first, treat the rest the validation set
as negative class ◆ n: number of negative examples covered by the rule in
the validation set
– Repeat with next smallest class as positive class
– Pruning method: delete any final sequence of
conditions that maximizes v
Direct Method: RIPPER Direct Method: RIPPER
● Building a Rule Set: ● Optimize the rule set:

– Use sequential covering algorithm – For each rule r in the rule set R
◆ Finds the best rule that covers the current set of ◆ Consider 2 alternative rules:
positive examples – Replacement rule (r*): grow new rule from scratch
◆ Eliminate both positive and negative examples – Revised rule(r′): add conjuncts to extend the rule r
covered by the rule ◆ Compare the rule set for r against the rule set for r*
and r′
– Each time a rule is added to the rule set,
◆ Choose rule set that minimizes MDL principle (see
compute the new description length
next slide)
◆ Stop adding new rules when the new description
length is d bits longer than the smallest description – Repeat rule generation and rule optimization
length obtained so far for the remaining positive examples
What is MDL (Mininum Description Length) Indirect Methods
● The minimum description length (MDL) principle

– Given a limited set of observed data, the best
explanation is the one that allows for the maximum
compression of the data.
● The minimum description length (MDL) principle is
very important in ML, pattern recognition, inductive
inference and statistical modeling.
● Used in scenarios where the models are very
complex, and overfitting is a serious issue
Source:
https://mitpress.mit.edu/9780262529631/the-minimum-description-length-principle/#:~:text=T
he%20minimum%20description%20length%20(MDL)%20principle%20is%20a%20powerful
%20method,greatest%20compression%20of%20the%20data.
Indirect Method: C4.5rules Indirect Method: C4.5rules
● Extract rules from an unpruned decision tree ● Instead of ordering the rules, order subsets of
● For each rule, r: A → y, rules (class ordering)
– consider an alternative rule r′: A′ → y where A′ – Each subset is a collection of rules with the
is obtained by removing one of the conjuncts same rule consequent (class)
in A
– Compare the pessimistic error rate for r
against all r’s
– Prune if one of the alternative rules has lower
pessimistic error rate
– Repeat until we can no longer improve
generalization error
Example C4.5 versus C4.5rules versus RIPPER

C4.5rules:
(Give Birth=No, Can Fly=Yes) → Birds
(Give Birth=No, Live in Water=Yes) → Fishes
(Give Birth=Yes) → Mammals
(Give Birth=No, Can Fly=No, Live in Water=No) → Reptiles
( ) → Amphibians
RIPPER:
(Live in Water=Yes) → Fishes
(Have Legs=No) → Reptiles
(Give Birth=No, Can Fly=No, Live In Water=No)
→ Reptiles
(Can Fly=Yes,Give Birth=No) → Birds
() → Mammals
C4.5 versus C4.5rules versus RIPPER Advantages of Rule-Based Classifiers
C4.5 and C4.5rules: ● Has characteristics quite similar to decision trees

– As highly expressive as decision trees
– Easy to interpret (if rules are ordered by class)
– Performance comparable to decision trees
◆Can handle redundant and irrelevant attributes
RIPPER: ◆ Variable interaction can cause issues (e.g., X-OR problem)
● Better suited for handling imbalanced classes
● Harder to handle missing values in the test set
Data Mining
Imbalanced Class Problem

Imbalanced Class Problem
Introduction to Data Mining, 2nd Edition
by
Class Imbalance Problem Confusion Matrix
● Lots of classification problems where the classes ● Confusion Matrix:

are skewed (more records from one class than
another) PREDICTED CLASS
– Credit card fraud Class=Yes Class=No
– Intrusion detection
– Defective products in manufacturing assembly line Class=Yes a b
ACTUAL
– COVID-19 test results on a random sample
CLASS Class=No c d
● Key Challenge:
a: TP (true positive)
– Evaluation measures such as accuracy are not b: FN (false negative)
well-suited for imbalanced class c: FP (false positive)
d: TN (true negative)
Accuracy Problem with Accuracy

● Consider a 2-class problem
PREDICTED CLASS – Number of Class NO examples = 990
– Number of Class YES examples = 10
Class=Yes Class=No
● If a model predicts everything to be class NO, accuracy is
990/1000 = 99 %
Class=Yes a b
ACTUAL (TP) (FN)
– This is misleading because this trivial model does not detect any class
YES example
CLASS
Class=No c d – Detecting the rare class is usually more interesting (e.g., frauds,
(FP) (TN) intrusions, defects, etc)
PREDICTED CLASS
● Most widely-used metric:
Class=Yes Class=No
Class=Yes 0 10
ACTUAL
CLASS Class=No 0 990
nd nd
2/10/2021 Introduction to Data Mining, 2 Edition 47 2/10/2021 Introduction to Data Mining, 2 Edition 48
Which model is better? Which model is better?
PREDICTED
Class=Yes Class=No PREDICTED
A ACTUAL Class=Yes 0 10 A Class=Yes Class=No
Class=No 0 990 ACTUAL Class=Yes 5 5
Accuracy: 99% Class=No 0 990
PREDICTED
PREDICTED
B Class=Yes Class=No B Class=Yes Class=No
ACTUAL Class=Yes 10 0
Class=No 500 490
Class=No 500 490
Accuracy: 50%
Alternative Measures Alternative Measures
PREDICTED CLASS PREDICTED CLASS

Class=Yes Class=No Class=Yes Class=No
Class=Yes 10 0
Class=Yes a b ACTUAL
ACTUAL CLASS Class=No 10 980
CLASS Class=No c d
Alternative Measures Which of these classifiers is better?

Class=Yes Class=No Class=Yes Class=No
Class=Yes 10 0
A Class=Yes 40 10
ACTUAL ACTUAL
CLASS Class=No 10 980 CLASS Class=No 10 40

Class=Yes Class=No B Class=Yes Class=No
Class=Yes 1 9 Class=Yes 40 10
ACTUAL ACTUAL
CLASS Class=No 0 990 CLASS Class=No 1000 4000
Measures of Classification Performance Alternative Measures
A PREDICTED CLASS
PREDICTED CLASS Class=Yes Class=No
Yes No
CLASS
Yes TP FN ACTUAL
No FP TN CLASS Class=No 10 40
α is the probability that we reject

the null hypothesis when it is
true. This is a Type I error or a
false positive (FP).
B PREDICTED CLASS
β is the probability that we
Class=Yes Class=No
accept the null hypothesis when
it is false. This is a Type II error
or a false negative (FN). Class=Yes 40 10
ACTUAL
Which of these classifiers is better?
A PREDICTED CLASS
Class=Yes Class=No
Class=Yes 10 40
ACTUAL
Class=No 10 40
CLASS
B PREDICTED CLASS
Class=Yes Class=No
Class=Yes 25 25
ACTUAL Class=No 25 25
CLASS
C PREDICTED CLASS
Class=Yes Class=No
Class=Yes 40 10
ACTUAL

Clustering_Part2
Quick recap on clustering and its
applications
Acknowledgments: The slides in this presentation are mostly the textbook slides
from the textbook “Data Mining: Concepts and Techniques” by Jiawei Han,
Micheline Kamber, and Jian Pei. The slides have been modified i.e., some new
slides have been added and some slides have been deleted. Information in this
presentation has also been obtained from a wide variety of publicly available
Internet sources.
1
Han/Eick: Clustering II 2
General Applications of Clustering Examples of Clustering Applications

■ Pattern Recognition ■ Marketing: Help marketers discover distinct groups in their
■ Spatial Data Analysis customer bases, and then use this knowledge to develop
■ create thematic maps in GIS by clustering feature targeted marketing programs
spaces ■ Land use: Identification of areas of similar land use in an
■ detect spatial clusters and explain them in spatial data
earth observation database
mining
■ Image Processing ■ Insurance: Identifying groups of motor insurance policy
■ Economic Science (especially market research) holders with a high average claim cost
■ WWW ■ City-planning: Identifying groups of houses according to
■ Document classification their house type, value, and geographical location
■ Cluster Weblog data to discover groups of similar ■ Earth-quake studies: Observed earth quake epicenters
access patterns should be clustered along continent faults
Han/Eick: Clustering II 3 Han/Eick: Clustering II 4

Requirements of Clustering in Data
What Is Good Clustering? Mining
■ A good clustering method will produce high quality ■ Scalability
clusters with ■ Ability to deal with different types of attributes
■ Discovery of clusters with arbitrary shape
■ high intra-class similarity
■ Minimal requirements for domain knowledge to
■ low inter-class similarity
determine input parameters
■ The quality of a clustering result depends on both the
■ Able to deal with noise and outliers
similarity measure used by the method and its
■ Insensitive to order of input records
implementation.
■ High dimensionality
■ The quality of a clustering method is also measured by its
■ Incorporation of user-specified constraints
ability to discover some or all of the hidden patterns.
■ Interpretability and usability
Conceptual note on REALLY learning

algorithms
■ In this course, you have seen several

algorithms for clustering & classification
Conceptual notes ■ What is more important?
A) Memorizing the key steps of an algorithm
B) Understanding the core idea of an algorithm

Conceptual note on REALLY learning
algorithms
■ In this course, you have seen several ■ If you memorize just the steps of an algorithm,
algorithms for clustering & classification you won’t remember very long
■ What is more important? ■ And more importantly, you won’t know how
A) Memorizing the key steps of an algorithm and where to apply it
B) Understanding the core idea of an algorithm
■ B) is the correct answer.
■ Focus on understanding, not memorizing
■ You should understand that most of these algorithms ■ Example: The classical algorithms you studied were
were developed to solve specific problems proposed more than a decade ago
■ At that time, main memory, computational power etc
■ Hence, when you see a new problem, simply using one were all not as good as now
algorithm may not be effective ■ But now lot of data management work including ML will
■ Need to combine ideas from different algorithms occur on mobile platforms, whereas those algorithms
were designed for fixed computing infrastructures such
■ As technology progresses and technological as centralized systems and clusters
environments keep changing, you would have to make ■ Mobile platforms means energy constraints, mobility
some modifications to these existing algorithms to issues, network partitioning issues, connectivity,
make them applicable to solving current problems distributed and autonomous settings
■ Hence, you would need to make changes to all these
algorithms to apply them to these new settings

Scalability issue
■ For solving real-world ML problems, you need ■ As you all know by now, datasets can be VERY
to understand which mix of ideas from different LARGE
algorithms you want to use ■ The usual techniques used for dealing in a
■ This is a judgment call, hence some amount scalable manner with VERY LARGE datasets
of thinking is required ■ Sampling
■ Partitioning (divide and conquer)
Examples for how to combine ideas from Examples for how to combine ideas from
different algorithms different algorithms
■ Suppose you need to cluster a very large ■ Suppose you need to cluster a very large
spatial dataset spatial dataset
■ You could first use a grid-based approach by
imposing a grid structure on the dataset
■ Then run any clustering algorithm within
each grid
■ You can define an epsilon factor to take
“fringe” objects into consideration
■ You could cluster for different grids in
parallel 🡪 faster execution time
■ This is essentially divide and conquer.

■ Suppose you need to cluster a very large ■ Instead of random sampling, can you use some
spatial dataset other sampling approach?
■ Yes!
■ You could use sampling
■ Instead of clustering the whole dataset, you

■ Better way of selecting samples would likely
lead to better clustering results
can select a sample of k representative
■ Can use domain knowledge to select “good”
points and then run any clustering algorithm
samples
only on these k points
■ Can use stratified sampling
■ Larger sample size generally leads to better
■ Can select the sample points as far away
accuracy from each other as possible
■ Suppose you need to cluster a very large ■ Suppose you need to cluster a very large
spatial dataset spatial dataset
■ You could simply use domain knowledge to
■ You could use hierarchical agglomerative
clustering on a sample of representative figure out which areas are dense and which
areas are sparse
points
■ Then use any clustering algorithm in each
■ Then use any clustering algorithm at the
dense area
desired level of the hierarchy ■ The sparse areas could be combined (if
appropriate) and then you can use any
clustering algorithm on the merged areas

How to cluster a dataset that has plenty Can you combine ideas from clustering
of dimensions? with association rule mining algorithms?
■ First figure out which dimensions are most relevant to ■ Suppose you need to do some data analysis on
your analysis
40,000 different items in a supermarket
■ Look at the question that you are trying to answer,
and you will know which dimensions are most ■ You could first run a clustering algorithm to
relevant to that question break those items into clusters
■ Can use any dimensionality reduction technique
■ Now on each cluster, you can do your
■ Doing clustering in high-dimensional space may give
you results, whose significance is hard to interpret analysis (this is more like divide and
■ At very high dimensions, similarity/dissimilarity
conquer)
among the points (objects) may get blurred
■ Bottomline: Do the clustering only on the dimensions
that are relevant to your analysis
Can you combine ideas from clustering

with association rule mining algorithms?
■ Suppose you need to do some data analysis on

40,000 different items in a supermarket
■ You could select a set of representative items
Broad Overview of clustering
first (could be based on domain knowledge algorithms
or could be found by clustering)
■ Then run your data analysis algorithm(s) only
on the representative items

■ Note that this approach is conceptually similar
to the approach on the previous slide.

Major Clustering Approaches
Major clustering approaches
■ Partitioning algorithms: Construct various partitions and ■ Note: The first three approaches to clustering
then evaluate them by some criterion (and especially using a permutation and
■ Hierarchy algorithms: Create a hierarchical decomposition
combination of ideas from these approaches)
are more than adequate to deal with almost all
of the set of data (or objects) using some criterion kinds of clustering problems you can think of
■ Grid-based: based on a multiple-level granularity structure ■ Can you think of any real-world clustering
■ Model-based: A model is hypothesized for each of the problem that cannot be solved by applying
ideas from the first three approaches?
clusters and the idea is to find the best fit of that model to
■ In fact, most clustering problems can be solved
each other
by using partitioning approaches like k-means.
■ Density-based: based on connectivity and density functions ■ K-means is among the most popular
clustering algorithms!
Partitioning Algorithms: Basic

Concept
■ Heuristic methods: k-means and k-medoids algorithms
■ Quick recap
■ k-means: Each cluster is represented by the center of
Partitioning algorithms the cluster
■ k-medoids or PAM (Partition around medoids): Each
cluster is represented by one of the objects in the
cluster

The K-Medoids Clustering Method Using sampling
■ Find representative objects, called medoids, in clusters
■ PAM (Partitioning Around Medoids, 1987) ■ PAM does not scale well to large data sets
■ starts from an initial set of medoids and iteratively ■ Can you use sampling?
replaces one of the medoids by one of the ■ YES!
non-medoids if it improves the total distance of the
■ Now we will discuss two variations of PAM
resulting clustering
which uses sampling
■ PAM works effectively for small data sets, but does not
scale well for large data sets

■ CLARA (Kaufmann & Rousseeuw, 1990)
■ CLARANS (Ng & Han, 1994): Randomized sampling
CLARA (Clustering Large Applications) (1990) CLARANS (“Randomized” CLARA) (1994)

■ CLARA (Kaufmann and Rousseeuw in 1990) ■ CLARANS (A Clustering Algorithm based on Randomized
■ Built in statistical analysis packages, such as S+ Search) (Ng and Han’94)
■ It draws multiple samples of the data set, applies PAM on ■ CLARANS draws sample of neighbors dynamically
each sample, and gives the best clustering as the output ■ The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
■ Strength: deals with larger data sets than PAM
set of k medoids
■ Weakness: ■ If the local optimum is found, CLARANS starts with new
■ Efficiency depends on the sample size randomly selected node in search for a new local optimum
■ A good clustering based on samples will not ■ It is more efficient and scalable than both PAM and CLARA
necessarily represent a good clustering of the whole
data set if the sample is biased
Hierarchical Clustering
■ Quick recap of hierarchical clustering
■ Use distance matrix as clustering criteria.
■ This method does not require the number of clusters k
Hierarchical algorithms as an input, but needs a termination condition
■ Pls revise from the lecture slides on hierarchical
clustering, especially using the hierarchical clustering
animation for agglomerative case.
Remarks on Hierarchical Clustering Methods Remarks on Hierarchical Clustering Methods
■ Major weakness of agglomerative clustering methods ■ Use hierarchical clustering to obtain a “rough cut”
2
■ do not scale well: time complexity of at least O(n ),
■ That is, don’t apply hierarchical clustering on a
where n is the number of total objects
very LARGE dataset because hierarchical
■ can never undo what was done previously
clustering algorithms are generally not very
■ Integration of hierarchical with distance-based clustering
scalable
■ BIRCH (1996): uses CF-tree and incrementally adjusts
■ Look at the hierarchical clustering animation and you
the quality of sub-clusters
will understand the reason for this
■ CURE (1998): selects well-scattered points from the
cluster and then shrinks them towards the center of the ■ Since hierarchical clustering is such that undo is
cluster by a specified fraction not possible, your initial set on which you want to
■ CHAMELEON (1999): hierarchical clustering using do the clustering must ensure that no undo would
dynamic modeling actually be required
Remarks on Hierarchical Clustering Methods
■ Once you have obtained a “rough cut” by doing

hierarchical clustering, you can use any
partitioning approach to do clustering in finer
granularity Grid-based clustering algorithms
Advantages of Grid-based Clustering

Steps of Grid-based Clustering Algorithms Algorithms
■ fast:
Basic Grid-based Algorithm
■ No distance computations
1. Define a set of grid-cells
■ Clustering is performed on summaries and not
2. Assign objects to the appropriate grid cell and
individual objects; complexity is usually
compute the density of each cell.
O(#-populated-grid-cells) and not O(#objects)
3. Eliminate cells, whose density is below a
■ Easy to determine which clusters are
certain threshold τ.
neighboring
4. Form clusters from contiguous (adjacent)
■ Shapes are limited to union of grid-cells
groups of dense cells (usually minimizing a
given objective function)

STING: A Statistical Information Grid
Grid-Based Clustering Methods Approach
■ Using multi-resolution grid data structure ■ Wang, Yang and Muntz (VLDB’97)
■ Clustering complexity depends on the number of ■ The spatial area area is divided into rectangular cells
■ There are several levels of cells corresponding to different
populated grid cells and not on the number of objects in levels of resolution
the dataset
■ Several interesting methods (in addition to the basic
grid-based algorithm)
■ STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
STING: A Statistical Information STING: A Statistical Information

Grid Approach (2) Grid Approach (3)
■ Each cell at a high level is partitioned into a number of smaller ■ Advantages:
cells in the next lower level ■ Query-independent, easy to parallelize, incremental
■ Statistical info of each cell is calculated and stored beforehand update

and is used to answer queries ■ O(K), where K is the number of grid cells at the
■ Parameters of higher level cells can be easily calculated from lowest level
parameters of lower level cell
■ Disadvantages:
■ count, mean, s, min, max
■ All the cluster boundaries are either horizontal or
■ type of distribution—normal, uniform, etc.
■ Use a top-down approach to answer spatial data queries vertical, and no diagonal boundary is detected

Self-organizing feature maps
(SOMs)
■ Clustering is also performed by having several units

competing for the current object
■ The unit whose weight vector is closest to the current
object wins
Model-based clustering algorithms ■ The winner and its neighbors learn by having their
weights adjusted
■ SOMs are believed to resemble processing that can
occur in the brain
■ Useful for visualizing high-dimensional data in 2- or 3-D
space
Model-Based Clustering Methods COBWEB Clustering Method

■ Attempt to optimize the fit between the data and some
mathematical model A classification tree
■ Statistical and AI approach
■ Conceptual clustering
■ A form of clustering in machine learning

■ Produces a classification scheme for a set of unlabeled objects
■ Finds characteristic description for each concept (class)
■ COBWEB (Fisher’87)
■ A popular a simple method of incremental conceptual learning
■ Creates a hierarchical clustering in the form of a classification
tree
■ Each node refers to a concept and contains a probabilistic
description of that concept

More on Statistical-Based Clustering
■ Limitations of COBWEB
■ The assumption that the attributes are independent
of each other is often too strong because correlation
may exist
■ Not suitable for clustering large database data – Outlier discovery
skewed tree and expensive probability distributions
Outlier Discovery:
What Is Outlier Discovery? Statistical Approaches
■ What are outliers?
■ The set of objects are considerably dissimilar from
the remainder of the data ● Assume a model underlying distribution that generates
■ Example: Sports: Michael Jordon, Wayne Gretzky,
data set (e.g. normal distribution)
...
■ Use discordancy tests depending on
■ Problem
■ data distribution
■ Find top n outlier points
■ distribution parameter (e.g., mean, variance)
■ Applications:
■ Credit card fraud detection ■ number of expected outliers
■ Telecom fraud detection ■ Drawbacks

■ Customer segmentation ■ most tests are for single attribute
■ Medical analysis ■ In many cases, data distribution may not be known

Outlier Discovery:
Deviation-Based Approach Summary
■ Identifies outliers by examining the main characteristics ■ Cluster analysis groups objects based on their similarity
of objects in a group and has wide applications
■ Objects that “deviate” from this description are ■ Measure of similarity can be computed for various types
considered outliers of data
■ Clustering algorithms can be categorized into partitioning
■ sequential exception technique
methods, hierarchical methods, density-based methods,
■ simulates the way in which humans can distinguish grid-based methods, and model-based methods
unusual objects from among a series of supposedly ■ Outlier detection and analysis are very useful for fraud
like objects detection, etc. and can be performed by statistical,
■ OLAP data cube technique distance-based or deviation-based approaches
■ uses data cubes to identify regions of anomalies in ■ There are still lots of research issues on cluster analysis,
large multidimensional data such as constraint-based clustering
Problems and Challenges References (1)

■ R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of
■ Considerable progress has been made in scalable high dimensional data for data mining applications. SIGMOD'98
clustering methods ■ M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
■ M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify
■ Partitioning: k-means, k-medoids, CLARANS the clustering structure, SIGMOD’99.
■ P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996
■ Hierarchical: BIRCH, CURE ■ M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
■ Density-based: DBSCAN, CLIQUE, OPTICS clusters in large spatial databases. KDD'96.
■ M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases:
■ Grid-based: STING, WaveCluster Focusing techniques for efficient class identification. SSD'95.
D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning,
■ Model-based: Autoclass, Denclue, Cobweb ■
2:139-172, 1987.
■ Current clustering techniques do not address all the ■ D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based
on dynamic systems. In Proc. VLDB’98.
requirements adequately
■ S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large
■ Constraint-based clustering analysis: Constraints exist in databases. SIGMOD'98.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
data space (bridges and highways) or in user queries ■

References (2)
■ L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
■ E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets.
VLDB’98.
■ G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
■ P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
■ R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
■ E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
■ G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
■ W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB’97.
■ T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method
for very large databases. SIGMOD'96.
Han/Eick: Clustering II 57

MergedPDF Iml

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MergedPDF Iml

Uploaded by

Copyright:

Available Formats

Note to other teachers and users of these

slides. Andrew would be delighted if you

History of SVM Linear Classifiers Estimation:

How would you How would you

How would you Any of these

■ http://mathworld.wolfram.com/Point-LineDistance2-Dim ■ What is the distance expression for a point x to a line

Large-margin Decision Boundary Finding the Decision Boundary

■ Allowing a few errors all αi; if we know all αi, we know w

■ Kernel functions maximized!

Properties of αi when we introduce The result when we differentiate the

The Dual Problem Characteristics of the Solution

compression as in the construction of knn classifier

Soft Margin Hyperplane Extension to Non-linear Decision Boundary

■ ξi is an upper bound of the number of errors i

Input space Feature space

■ Computation in the feature space can be costly because it is

The Kernel Trick An Example for φ(.) and K(.,.)

■ An inner product in the feature space is

■ The data points only appear as inner product

■ It does not satisfy the Mercer condition on all κ and θ

Non-linear SVMs: Feature spaces Example

class 1 class 2 class 1

■ b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1,

Degree of Polynomial Features Choosing the Kernel Function

X^1 X^2 X^3

X^4 X^5 X^6

kernel trick ■ http://www.support-vector.net/icml-tutorial.pdf

■ Equation for the line: let u be a variable, then any point

• How well a set of data points fits a straight line

Regression About Maximum Likelihood Estimation

An example Delving deeper into the example

Example of a dice being rolled Example of a dice being rolled

Contd: A Linear Model Contd: A Linear Model

identically distributed (iid), we have the

17 and the expected value of y for any given x is f(x) 18

Consider Linear Regression Training Linear Regression

How can we learn W from the training data?

Learn Maximum Conditional Likelihood Estimate

Learn Maximum Conditional Likelihood Estimate!

MAP (Maximum A Posteriori) instead of MLE?

• Then, having the observed data

4. Again, we can use gradient descent as a general learning algorithm

5. Almost nothing we said here required that f(x) be linear in x

Partial Correlation Partial Correlation

How does Gradient descent compute the

REVISITING BIOLOGY REVISITING BIOLOGY

9 • 1987: Marvin Minsky published The Society of Mind, a theoretical description of

•• 2002: AI entered the home in the form of Roomba, a vacuum

•• Deep learning was conceptualized by Geoffrey Hinton in the

•• More specifically, Hinton created the concept of a "neural

•• The structure that Hinton created was called an artificial neural 18

network (or artificial neural net for short).

ARTIFICIAL NEURAL NETWORK ARTIFICIAL NEURAL NETWORK

NEURONS IN ARTIFICIAL NEURAL NETWORKS NEURONS IN ARTIFICIAL NEURAL NETWORKS

• There are four main types of activation functions:

1. Threshold Functions, or unit step functions 2. Sigmoid Functions

3. Rectifier Functions, or Rectified Linear Units (ReLUs)

Simple regression considers the relation between a single

Multinomial and Logistic Regression

Acknowledgments: Information in this presentation has been obtained from a

Multiple regression simultaneously considers the

Again, estimates for the multiple slope