Download as pdf or txt
Download as pdf or txt
You are on page 1of 114

Note to other teachers and users of these

slides. Andrew would be delighted if you


found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them
to fit your own needs. PowerPoint
originals are available. If you make use
of a significant portion of these slides in
■ For your end-term exam, the syllabus is slides 1-25.
your own lecture, please include this
message, or the following link to the
source repository of Andrew’s tutorials:
http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully

Introduction to Support
received.

Vector Machines

Thanks:
Andrew Moore
CMU
And
Martin Law
Michigan State University

* 2

History of SVM Linear Classifiers Estimation:


x f yest
■ SVM is related to statistical learning theory [3]
■ SVM was first introduced in 1992 [1]
f(x,w,b) = sign(w. x - b)
■ SVM becomes popular because of its success in denotes +1
handwritten digit recognition denotes -1 w: weight vector
■ 1.1% test error rate for SVM. This is the same as the error
x: data vector
rates of a carefully constructed neural network, LeNet 4.
■ See Section 5.11 in [2] or the discussion in [3] for details How would you
classify this data?
■ SVM is now regarded as an important example of “kernel
methods”, one of the key area in machine learning
■ Note: the meaning of “kernel” is different from the “kernel”
function for Parzen windows
[1] B.E. Boser et al. A Training Algorithm for Optimal Margin Classifiers. Proceedings of the Fifth Annual Workshop on
Computational Learning Theory 5 144-152, Pittsburgh, 1992.
[2] L. Bottou et al. Comparison of classifier methods: a case study in handwritten digit recognition. Proceedings of the 12th
IAPR International Conference on Pattern Recognition, vol. 2, pp. 77-82.
[3] V. Vapnik. The Nature of Statistical Learning Theory. 2 nd edition, Springer, 1999.

* 3 * 4
α α
Linear Classifiers Linear Classifiers
x f y est x f yest
f(x,w,b) = sign(w. x - b) f(x,w,b) = sign(w. x - b)
denotes +1 denotes +1
denotes -1 denotes -1

How would you How would you


classify this data? classify this data?

* 5 * 6

α α
Linear Classifiers Linear Classifiers
x f y est x f yest
f(x,w,b) = sign(w. x - b) f(x,w,b) = sign(w. x - b)
denotes +1 denotes +1
denotes -1 denotes -1

How would you Any of these


classify this data? would be fine..

..but which is
best?

* 7 * 8
α α
Classifier Margin Maximum Margin
x f y est x f yest
f(x,w,b) = sign(w. x - b) f(x,w,b) = sign(w. x - b)
denotes +1 denotes +1
denotes -1 Define the margin denotes -1 The maximum
of a linear margin linear
classifier as the classifier is the
width that the linear classifier
boundary could be with the, um,
increased by maximum margin.
before hitting a
This is the
datapoint.
simplest kind of
SVM (Called an
LSVM)
Linear SVM
* 9 * 10

α
Maximum Margin Why Maximum Margin?
x f yest
f(x,w,b) = sign(w. x + b) f(x,w,b) = sign(w. x - b)
denotes +1 denotes +1
denotes -1 The maximum denotes -1 The maximum
margin linear margin linear
classifier is the classifier is the
linear classifier linear classifier
Support Vectors with the, um, Support Vectors with the, um,
are those are those
datapoints that maximum margin. datapoints that maximum margin.
the margin This is the the margin This is the
pushes up pushes up
against
simplest kind of against
simplest kind of
SVM (Called an SVM (Called an
LSVM) LSVM)
Linear SVM
* 11 * 12
How to calculate the distance from a point
to a line? Estimate the Margin
denotes +1 denotes +1
denotes -1
x
wx +b = 0 denotes -1
x
wx +b = 0

X – Vector X – Vector
W W
W – Normal Vector W – Normal Vector
b – Scale Value b – Scale Value

■ http://mathworld.wolfram.com/Point-LineDistance2-Dim ■ What is the distance expression for a point x to a line


ensional.html wx+b= 0?
■ In our case, w *x +w *x +b=0,
1 1 2 2
■ thus, w=(w ,w ), x=(x ,x )
1 2 1 2

* 13 * 14

Large-margin Decision Boundary Finding the Decision Boundary


■ The decision boundary should be as far away from the ■ Let {x1, ..., xn} be our data set and let yi ∈ {1,-1} be
data of both classes as possible the class label of xi
■ We should maximize the margin, m ■ The decision boundary should classify all points correctly
■ Distance between the origin and the line wtx=-b is b/||w|| ⇒
■ To see this: when y=-1, we wish (wx+b)<1, when y=1,
we wish (wx+b)>1. For support vectors, we wish
y(wx+b)=1.
■ The decision boundary can be found by solving the
Class 2 following constrained optimization problem

Class 1
m

* 15 * 16
Next step… Optional The Dual Problem (we ignore the derivation)
■ Converting SVM to a form we can solve ■ The new objective function is in terms of αi only
■ Dual form ■ It is known as the dual problem: if we know w, we know

■ Allowing a few errors all αi; if we know all αi, we know w


■ Soft margin ■ The original problem is known as the primal problem

■ Allowing nonlinear boundary ■ The objective function of the dual problem needs to be

■ Kernel functions maximized!


■ The dual problem is therefore:

Properties of αi when we introduce The result when we differentiate the


the Lagrange multipliers original Lagrangian w.r.t. b
* 17 * 18

The Dual Problem Characteristics of the Solution


■ Many of the αi are zero (see next page for example)
■ w is a linear combination of a small number of data points
■ This “sparse” representation can be viewed as data

compression as in the construction of knn classifier


■ xi with non-zero αi are called support vectors (SV)
■ This is a quadratic programming (QP) problem ■ The decision boundary is determined only by the SV
■ A global maximum of αi can always be found
■ Let t (j=1, ..., s) be the indices of the s support vectors.
j
We can write
■ w can be recovered by
■ For testing with a new data z
■ Compute and
classify z as class 1 if the sum is positive, and class 2
otherwise
■ Note: w need not be formed explicitly

* 19 * 20
A Geometrical Interpretation Allowing errors in our solutions
■We allow “error” ξi in classification; it is based on the
Class 2 output of the discriminant function wTx+b
■ ξ approximates the number of misclassified samples
i
α8=0.6 α10=0

α7=0 Class 2
α2=0
α5=0

α1=0.8
α4=0
α6=1.4
α9=0
α3=0
Class 1
Class 1

* 21 * 22

Soft Margin Hyperplane Extension to Non-linear Decision Boundary


■ If we minimize ∑iξi, ξi can be computed by ■ So far, we have only considered
large-margin classifier with a linear
decision boundary
■ ξi are “slack variables” in optimization ■ How to generalize it to become nonlinear?
Note that ξi=0 if there is no error for xi
■ Key idea: transform x to a higher

■ ξi is an upper bound of the number of errors i


■ We want to minimize
dimensional space to “make life easier”
■ Input space: the space the point xi are
■ C : tradeoff parameter between error and margin
■ The optimization problem becomes
located
■ Feature space: the space of φ(x ) after
i
transformation
* 23 * 24
Transforming the Data (c.f. DHS Ch. 5)
φ( )
φ( ) φ( )
φ( ) φ( ) φ( )
φ φ( )
φ( ) φ( )
φ( ) φ( )
(.) φ( )
φ( ) φ( )
φ( ) φ( )
φ( )
φ( )

Input space Feature space


Note: feature space is of higher dimension
than the input space in practice

■ Computation in the feature space can be costly because it is


high dimensional
■ The feature space is typically infinite-dimensional!
■ The kernel trick comes to rescue

* 25 * 26

The Kernel Trick An Example for φ(.) and K(.,.)


■ Recall the SVM optimization problem ■ Suppose φ(.) is given as follows

■ An inner product in the feature space is

■ The data points only appear as inner product


■ As long as we can calculate the inner product in the ■ So, if we define the kernel function as follows, there is
feature space, we do not need the mapping explicitly no need to carry out φ(.) explicitly
■ Many common geometric operations (angles, distances)
can be expressed by inner products
■ Define the kernel function K by ■ This use of kernel function to avoid carrying out φ(.)
explicitly is known as the kernel trick

* 27 * 28
More on Kernel Functions Examples of Kernel Functions
■ Not all similarity measures can be used as kernel
function, however ■ Polynomial kernel with degree d
■ The kernel function needs to satisfy the Mercer function,
i.e., the function is “positive-definite”
■ This implies that ■ Radial basis function kernel with width σ
■ the n by n kernel matrix,
■ in which the (i,j)-th entry is the K(xi, xj), is always positive
definite
■ Closely related to radial basis function neural networks
■ This also means that optimization problem can be solved ■ The feature space is infinite-dimensional
in polynomial time!
■ Sigmoid with parameter κ and θ

■ It does not satisfy the Mercer condition on all κ and θ

* 29 * 30

Non-linear SVMs: Feature spaces Example


■ General idea: the original input space can always be mapped to ■ Suppose we have 5 one-dimensional data points
some higher-dimensional feature space where the training set is
■ x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4,
separable:
5 as class 2 ⇒ y1=1, y2=1, y3=-1, y4=-1, y5=1
■ We use the polynomial kernel of degree 2
■ K(x,y) = (xy+1)2
Φ: x → φ(x) ■ C is set to 100
■ We first find αi (i=1, …, 5) by

* 31 * 32
Example Example
■ By using a QP solver, we get
■ α1=0, α2=2.5, α3=0, α4=7.333, α5=4.833 Value of discriminant function
■ Note that the constraints are indeed satisfied
■ The support vectors are {x2=2, x4=5, x5=6}
■ The discriminant function is

class 1 class 2 class 1

■ b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1,


as x2 and x5 lie on the line and x4 1 2 4 5 6
lies on the line
■ All three give b=9

* 33 * 34

Degree of Polynomial Features Choosing the Kernel Function


■ Probably the most tricky part of using SVM.

X^1 X^2 X^3

X^4 X^5 X^6

* 35 * 36
Software Summary: Steps for Classification
■ A list of SVM implementation can be found at ■ Prepare the pattern matrix
http://www.kernel-machines.org/software.html ■ Select the kernel function to use

■ Some implementation (such as LIBSVM) can handle ■ Select the parameter of the kernel function and the
multi-class classification value of C
■ SVMLight is among one of the earliest implementation of ■ You can use the values suggested by the SVM software, or
SVM you can set apart a validation set to determine the values
■ Several Matlab toolboxes for SVM are also available of the parameter
■ Execute the training algorithm and obtain the αi
■ Unseen data can be classified using the α and the
i
support vectors

* 37 * 38

Conclusion Resources
■ SVM is a useful alternative to neural networks ■ http://www.kernel-machines.org/
■ Two key concepts of SVM: maximize the margin and the ■ http://www.support-vector.net/

kernel trick ■ http://www.support-vector.net/icml-tutorial.pdf


■ Many SVM implementations are available on the web for ■ http://www.kernel-machines.org/papers/tutorial-nips.ps.
you to try on your data set! gz
■ http://www.clopinet.com/isabelle/Projects/SVM/applist.h
tml

* 39 * 40
Appendix: Distance from a point to a line Distance and margin

■ Equation for the line: let u be a variable, then any point


on the line can be described as:
■ P = P1 + u (P2 - P1) ■ x = x1 + u (x2 - x1)
■ Let the intersect point be u, P2 y = y1 + u (y2 - y1)
■ Then, u can be determined by:

■ The two vectors (P2-P1) is orthogonal to P3-u: P ■ The distance therefore between the point P3 and the
■ That is, line is the distance between P=(x,y) above and P3
■ (P3-P) dot (P2-P1) =0 ■ Thus,

■ P=P1+u(P2-P1) ■ d= |(P3-P)|=
P1=(x1,y1),P2=(x2,y2),P3=(x3,y3) P3

P1

* 41 * 42
Linear Regression: Part 2 Introduction to Linear Regression (cont.)

• How well a set of data points fits a straight line


can be measured by calculating the distance
between the data points and the line.
• The total error between the data points and the
line is obtained by squaring each distance and
then summing the squared values.
• The regression equation is designed to produce
Acknowledgments: Information in this presentation has been obtained from a
wide variety of publicly available Internet sources. Slides created by Mr. Saransh the minimum sum of squared errors.
Gupta for academic use only, as part of course material for Introduction to
Machine Learning course.

1 2

Regression About Maximum Likelihood Estimation


So far, we’ve been interested in learning P(Y|X) where Y • Maximum Likelihood Estimation (MLE) is an
has discrete values (called ‘classification’) approach used for estimating the parameters of
a model when you have observed data as input.
What if Y is continuous? (called ‘regression’)
• Input: Observed data
• predict weight from gender, height, age, …
• predict Google stock price today from Google, Yahoo,
• Output: Estimated parameters of the model
MSFT prices yesterday
• predict each pixel intensity in robot’s current camera
image, from previous image and previous action

3 4
The big picture of MLE What is a model?
• Take a look at the data in detail • A model is a way to represent your beliefs, assumptions
etc. about how some event or process is. It is a formal
• Try to figure out / determine a model about how
way to represent how you view that event or process.
the data could have been created
• Are models a perfect representation of the real-world?
• Assign values to the parameters of the model Generally not
such that the likelihood of the parameters is • Models are usually approximate because real-world
maximized w.r.t. the data scenarios are hard to model perfectly
• You can have simple models to represent an event or a
process OR you can have models of much higher
Now let’s delve a bit deeper about what constitutes complexity to represent the exact same event or process
a model

5 6

An example Delving deeper into the example


• Recall the example of the model, where hours studied is
• Why can’t models predict accurately 100% of the time?
the predictor variable and marks obtained is the outcome
variable • Could be that other additional predictor variables should
have been used to improve the model’s prediction
• Is the model always going to be accurate? Accurate here
accuracy
means:
– Will the model always be able to correctly predict the student’s • Possible that you don’t have data regarding those other
marks obtained when given the student’s number of hours studied predictor variables or you are simply unaware of those
as input? predictor variables
• Now suppose you add the student’s years of experience in • Your training data is not infinite in practice
the topics associated with the course to make your model – There is only so much prediction accuracy that your training data
more complex in the hope of getting better prediction can support
accuracy – Concept of unavoidable errors due to the training data
– Even then, the model will not be able to always predict perfectly
7 8
Delving deeper into the example Delving deeper into the example
• In our example about the student’s marks obtained (given • Was the student at peak performance form while giving the
the number of hours studied), let us try to understand some exam?
of the beliefs and assumptions in our model: – the student could have been studying till very late at night for an exam
starting at 8.30 am on the next day
• Is number of hours studied easy to practically quantify? A
student could be studying with full focus for 2 hours, while • Did the student arrive 30 minutes late for a 2-hour exam?
another student could be just staring at the computer • And so many other factors that you can think of….
screen for the same 2 hours.
– How to quantify and model focus? Bottomline: There are many simplifying assumptions that we
• Was the student mentally or physically healthy during the make while modeling because trying to capture all
time of the exam? How to quantify the extent of sickness? aspects of most real-world scenarios is way too complex
Can the extent of sickness always be quantified? in terms of modeling

9 10

Example of a dice being rolled Example of a dice being rolled


• Think about how you would model the roll of a dice • Instead of going for very complex models, you could adopt a
• You have to first model the dice itself much more simplified model
– The dice has 6 faces, color of the dice, weight of the • The simple model could entail that the outcome of the die
dice, thickness of the die, length, breadth, width etc. being rolled is drawn from some probability distribution
• When the dice is rolled, there will be some forces on the • Note that it is not necessarily true that a more complex model
dice based on the laws of physics will be more accurate than a simple model
– You would also have to model those forces • Observe that fitting the simpler model is obviously going to be
– Remember that the outcome of the rolled dice (which of far easier than fitting the more complex model
the numbers 1-6 come up) depends upon how you roll
the dice in what direction with how much force etc. Bottomline: What matters is the usefulness of the model.
• All of this keeps increasing the complexity of the model Not so critical how simple or how complex the model is,
but what is most important thing is how useful the model
11 is 12
A Linear Model Contd: A Linear Model
• Consider the corresponding • Notice that the data points do not
plot. all fit into a single line i.e., there
are deviations. These are errors
• Our goal is to find the best-fit that arise out of a distribution
line to the datapoints, i.e to (suppose it is Gaussian)
predict future values of y
Source: towardsdatascience
given x. • This indicates that the error arises
• Start with the equation of a from a Gaussian distribution with
line. mean 0 and variance \sigma^{2}.
Source: towardsdatascience So, our resulting model is given by
the following equation:

13 14

Contd: A Linear Model Contd: A Linear Model


• In general, we might have any number of • Maximising the likelihood is equivalent to
parameters, so let’s refer to the entire collection maximising the log-likelihood, given below:
of them as θ (theta)
• Knowing what the values of θ must be involved
computing the likelihood function of θ, given the
data observed. For a continuous distribution • The Maximum Likelihood Estimator takes the
with the probability density function f(x|θ), we Gaussian Distribution, and slides it over all
have the likelihood function Source: towardsdatascience
possible means. The mean which causes the
model to fit the data the best is chosen.
Given that the data is independent and
As shown in the picture here, the log-likelihood

Source: towardsdatascience •

identically distributed (iid), we have the


likelihood function can be re-written as is maximised at the mean of 2.
• This is the true mean of the distribution!
• The solutions provided by the MLS provides the
• In other words, likelihood is the product of Source: towardsdatascience

probability density for each data point. We need best fit model to our data.
15 16
to maximize this likelihood.
Choose parameterized form for P(Y|X; θ)
Regression
Wish to learn f:X🡪Y, where Y is real, given {<x1,y1>…<xn,yn>}

Approach:

1. Choose some parameterized form for P(Y|X; θ) Assume Y is some deterministic f(X), plus random noise
( θ is the vector of parameters)

2. derive learning algorithm as MLE or MAP estimate for θ Therefore Y is a random variable that follows the distribution

17 and the expected value of y for any given x is f(x) 18

Consider Linear Regression Training Linear Regression

How can we learn W from the training data?


E.g., assume f(x) is linear function of x
Learn Maximum Conditional Likelihood Estimate!

Example:
Consider developing a model to forecast a company's stock price. You
noticed that the stock price rose significantly throughout the previous
night. There could be a variety of causes for it. Maximum Likelihood
Notation: to make our parameters explicit, let’s write
Estimation seeks to determine the probability of the most likely cause.
This idea is applied, among other things, to satellite imaging, MRIs, and
economics.

 MLE can be defined as a method for estimating population parameters (such as the
mean and variance for Normal, rate (lambda) for Poisson, etc.) from sample data such
19
that the probability (likelihood) of obtaining the observed data is maximized.
Training Linear Regression Training Linear Regression

Learn Maximum Conditional Likelihood Estimate


How can we learn W from the training data?

Learn Maximum Conditional Likelihood Estimate!


where

so:

where
21 22

MAP (Maximum A Posteriori) instead of MLE?


Training Linear Regression
Like MLE, MAP estimation is a method of estimating parameters of
statistical models.

•MLE is powerful when you have enough data. However, it doesnʼt work well
Learn Maximum Conditional Likelihood Estimate when observed data size is small. For example, if Liverpool only had 2
matches and they won the 2 matches, then the estimated value of Ɵ by MLE
is 2/2 = 1. It means that the estimation says Liverpool wins 100%, which is
unrealistic estimation. MAP can help dealing with this issue.
Can we derive gradient descent rule for training? •Assume that we have a prior knowledge that Liverpoolʼs winning percentage
for the past few seasons were around 50%.
Then, without the data from this season, we already have somewhat idea of
potential value of Ɵ. Based (only) on the prior knowledge, the value of Ɵ is
most likely to be 0.5, and less likely to be 0 or 1.

23 24
• On the other words, the probability of Ɵ = 0.5 is higher than
Ɵ = 0 or 1. Calling this as the prior probability P(Ɵ), and if we Regression – key points
visualise this Under general assumption

• Then, having the observed data


1. MLE corresponds to minimizing sum of squared prediction errors (SSE)
D (30 win out of 38 matches) from this
season, we can update this P(Ɵ) which 2. MAP estimate minimizes SSE plus sum of squared weights
is based only on the prior knowledge.
The updated probability of Ɵ given D is 3. Again, learning is an optimization problem once we choose
our objective function
expressed as P(Ɵ |D) and called the posterior probability.
• maximize data likelihood
• maximize posterior probability of W

4. Again, we can use gradient descent as a general learning algorithm


• as long as our objective function is differentiable with respect to W
• though we might learn local optima ins

5. Almost nothing we said here required that f(x) be linear in x


26
25
Reference: TowardsDataScience

Partial Correlation Partial Correlation

• A partial correlation measures the relationship between • For example, there probably is no underlying relationship
two variables (X and Y) while eliminating the influence of between weight and mathematics skill for elementary
a third variable (Z). school children.

• Partial correlations are used to reveal the real, • However, both of these variables are positively related to
underlying relationship between two variables when age: Older children weigh more and, because they have
researchers suspect that the apparent relation may be spent more years in school, have higher mathematics
distorted by a third variable. skills.

27 28
Partial Correlation Error Minimization
• “Minimization”, “Maximization”- both forms of “Optimization”
• As a result, weight and mathematics skill will show a
Given a function f(x) [In our case, the mean-squared error
positive correlation for a sample of children that includes

several different ages. function while fitting a regression line with points], how do we
minimize or maximize it?
• A partial correlation between weight and mathematics – Least Squares Optimization
skill, holding age constant, would eliminate the influence
Lagrange Multipliers (for continuous and
of age and show the true correlation which is near zero.

partially-differentiable functions)
– Convex optimization techniques
– Greedy algorithms like Gradient Descent

29 30

How does Gradient descent compute the


Gradient descent regression line
• So far, you have understood that linear • Start with a random line and then keep changing
regression is used for finding the line of best fit the slope and intercept iteratively until you get to
• Then you can use this regression line to predict the line of best fit
a value of y given some value of x. • Observe that for this, you need to know when
• Now, for finding the line of best fit, you can use you have already reached the line of best fit
gradient descent. • This is based on the sum of squares of the
• Now let’s see how you can use gradient descent errors
for computing the line of best fit

31 32
Bottomline: Gradient descent algorithm essentially
arrives at the least squared regression line by
doing multiple iterations for minimizing the sum of
squared errors

33
2

CONTENTS
• Revisiting Biology
NEURAL NETWORKS • Intelligence: Biological vs Artificial
• History of AI
• Why Deep Learning
• Neural roots of Deep Learning
• Artificial Neural Network
• Revisiting and understanding Neurons
• Neurons in Artificial Neural Networks
• Activation Functions in Artificial Neural Networks
• An illustrative example
Acknowledgments: Information in this presentation has been obtained from a
wide variety of publicly available Internet sources. Slides created by Mr. Saransh
Gupta for academic use only, as part of course material for Introduction to
Machine Learning course.
1

REVISITING BIOLOGY REVISITING BIOLOGY


The human brain is the most complex organ in the human
body. It helps us think, understand, and make decisions. Biological Neuron
The secret behind its power is a neuron.
• A human brain has billions of neurons.
• Neurons are interconnected nerve cells in the human brain that
are involved in processing and transmitting chemical and
electrical signals.
• Dendrites are branches that receive information from other
neurons.
• Cell nucleus or Soma processes the information received
from dendrites.
• Axon is a cable that is used by neurons to send information.
3
• Synapse is the connection between an axon and other neuron
dendrites. 4
INTELLIGENCE: BIOLOGICAL VS ARTIFICIAL HISTORY OF AI
Definition #1: Artificial intelligence (AI) is intelligence - Maturation of Artificial Intelligence (1943-1952)
perceiving, synthesizing, and inferring information - demonstrated • 1943: The first work which is now recognized as AI was done
by machines, as opposed to intelligence displayed by Warren McCulloch and Walter Pitts in 1943. They published
by animals and humans.  "A Logical Calculus of the Ideas Immanent in Nervous Activity"
Example tasks are speech recognition, computer vision, planning, (1943), laying foundations for artificial neural networks and
natural language processing, robotics, etc. proposed a model of artificial neurons.
Definition #2: the theory and development of computer systems
able to perform tasks that normally require human intelligence, • 1949: Donald Hebb demonstrated an updating rule for
such as visual perception, speech recognition, decision-making, modifying the connection strength between neurons. His rule is
and translation between languages. now called Hebbian learning.
AI applications include advanced web search engine (Google, Bing),
recommendation systems (YouTube, Amazon, Netflix), understanding • 1950: Alan Turing, who was an English mathematician and
pioneered Machine learning in 1950, published ”Computing
human speech (Siri, Alexa), self-driving cars (Tesla), automated Machinery and Intelligence” in which he proposed a test as a
decision-making and competing at the highest level in strategic measure of machine intelligence. This test (now called Turing
5 6
game systems (chess, Go) and many more! Test) can check the machine's ability to exhibit intelligent
behavior equivalent to human intelligence.

HISTORY OF AI HISTORY OF AI
The birth of Artificial Intelligence (1952-1956) Golden years: Early enthusiasm & optimism (1956-1972)

•• 1955: Allen Newell, J.C. Shaw and Herbert A. Simon created • 1956-1960: High-level computer languages such as FORTRAN, LISP, or COBOL
the “first AI program“ which was named as ”Logic Theorist”. It were invented in the same decade soon after this and the excitement and optimism
proved 38 of the first 52 theorems in Bertrand for AI was very high at that time.
Russell and Alfred Whitehead's Principia Mathematica, and • 1965: The researchers emphasized developing algorithms which can solve
found new and more elegant proofs for some.
mathematical problems. Joseph Weizenbaum created the first natural language
processing computer program in 1966, which was named as ELIZA at the MIT AI
•• Year 1956: The Dartmouth College summer AI conference was lab.
organized by John McCarthy, Marvin Minsky, Nathan
Rochester of IBM and Claude Shannon. McCarthy coined the • 1972: The first intelligent anthropomorphic robot was built in Japan which was
term artificial intelligence for the conference. named as WABOT-1. It consisted of a limb-control system, a vision system and a
conversation system. The WABOT-1 was able to communicate-with a person in
•• Year 1959: The ”General Problem Solver (GPS)” was created Japanese and to measure distances and directions to the objects using external
by Newell, Shaw and Simon while at CMU. John receptors, artificial ears and eyes, and an artificial mouth.
McCarthy and Marvin Minsky founded the MIT AI Lab. 7
8
HISTORY OF AI HISTORY OF AI
The first AI winter (1972-1980) AI re-emerges (1980-1987)
• 1969: Marvin Minsky and Seymour Papert published ’Perceptrons’, • 1980: After AI winter, AI re-emerged publicly with the development and
demonstrating previously unrecognized limits of the feed-forward marketing of Lisp machines and the offering of "Expert System“ shells and
two-layered structure. commercial applications. Expert systems were programmed to emulate the
decision-making ability of a human expert. The first national conference of the
• 1970: Seppo Linnainmaa published the reverse mode of automatic American Association of Artificial Intelligence was held at Stanford University in
differentiation which became later known as backpropagation, and is heavily the same year.
used to train artificial neural networks.
• mid-1980s: Neural Networks became widely used with the Backpropagation 
• late-1970s: The duration between years 1974 to 1980 was the first AI Algorithm which was published by Seppo Linnainmaa in 1970 and were applied
winter duration. AI winter refers to the time period where computer scientists to neural networks by Paul Werbos.
dealt with a severe shortage of both confidence and funding from
Government for AI research. During AI winters, an interest in the publicity on • 1983: DARPA again began to fund AI research through the Strategic Computing
artificial intelligence reduced significantly. Initiative.

9 • 1987: Marvin Minsky published The Society of Mind, a theoretical description of


10
the mind as a collection of cooperating agents.

HISTORY OF AI HISTORY OF AI
AI rises re-emerges again with intelligent agents (1993-2009)
The second AI winter (1987-1993)

• By early 1990s: The earliest successful expert systems, such as XCON, proved •• early 1990s: TD-Gammon, a backgammon program written by
too expensive to maintain and the few remaining expert system shell Gerry Tesauro, demonstrated that reinforcement (learning) is
companies were forced to downsize and search for new markets. powerful enough to create a championship-level game-playing
program by competing favorably with world-class players.
• 1987: Expert systems were dismissed as "clever programming" and DARPA
changed its strategy to focus its funding only on those technologies which
showed the most promise believing strongly that AI was not "the next wave". •• 1997: IBM Deep Blue defeated the world chess champion,
Gary Kasparov, and became the first computer to beat a world
• late 1980s: Investors and Government stopped the funding for AI research chess champion.
once again owing to the high cost and not as many results which were
promised optimistically in the earlier years. Expectations had run much higher
than what was actually possible.
•• late 1990s: Web crawlers and other AI-based information
extraction programs became essential in widespread use of
11
the World Wide Web. 12
HISTORY OF AI HISTORY OF AI
AI rises re-emerges again with intelligent agents (1993-2009)
Deep learning, big data and artificial general
intelligence (2011- present)

•• 2002: AI entered the home in the form of Roomba, a vacuum


•• 2011: IBM's Watson won jeopardy, a quiz show, where it had to
cleaner which could autonomously vacuum the floor while
solve complex questions as well as riddles. Watson proved that
navigating and avoiding obstacles.
it could understand natural language and can solve tricky
questions quickly.
•• 2005: Recommendation technology based on tracking web
activity or media usage brought AI to marketing.
•• 2011-2014: Apple's Siri (2011), Google's Google Now (2012)
and Microsoft's Cortana (2014) are smartphone apps that
•• 2006: AI found a strong footing in the Business world when use natural language to answer questions, make
companies like Facebook, Twitter, and Netflix started recommendations and perform actions.
significantly leveraging AI.
13 •• 2014: In the year 2014, Chatbot "Eugene Goostman" won a 14
•• 2009: Google built autonomous car. competition in the infamous "Turing test."

HISTORY OF AI HISTORY OF AI
Deep learning, big data and artificial general Deep learning, big data and artificial general
intelligence (2011- present) intelligence (2011- present)

•• 2015: Google DeepMind's AlphaGo defeated three-time • early-2020: Microsoft introduced its Turing Natural Language
European Go champion and professional Fan Hui by 5 games Generation (T-NLG), which was then the "largest language
to 0. model ever published at 17 billion parameters."

•• 2018: Alibaba language processing AI outscores top humans at • mid-2020: OpenAI's GPT-3, a state-of-the-art autoregressive
a Stanford University reading and comprehension test, scoring language model that uses deep learning to produce a variety
82.44 against 82.304 on a set of 100,000 questions. of computer codes, poetry and other language tasks
exceptionally similar, and almost indistinguishable from those
•• 2018: Google announced "Duplex" which is a service to allow written by humans. Its capacity was ten times greater than that
an AI assistant to book appointments over the phone on user’s of the T-NLG.
15 16
behalf in a manner indiscernible to that of humans.
WHY DEEP LEARNING? ARTIFICIAL NEURAL NETWORK
Neural Roots of Deep Learning

•• Deep learning was conceptualized by Geoffrey Hinton in the


1980s who is widely considered to be the founding father of
the field of deep learning.

•• More specifically, Hinton created the concept of a "neural


network", which is a deep learning algorithm structured similar
to the organization of neurons in the brain. Hinton took this
approach because the human brain is arguably the most
powerful computational engine known today.
17

•• The structure that Hinton created was called an artificial neural 18

network (or artificial neural net for short).

ARTIFICIAL NEURAL NETWORK ARTIFICIAL NEURAL NETWORK


The question that Geoffrey Hinton asked during his Here’s a brief description of how
seminal research in neural networks was whether we they function:
could build computer algorithms that behave similarly to • Artificial neural networks are
neurons in the brain. composed of layers of node.
• The hope was that by mimicking the brain’s structure, we might • Each node is designed to behave
capture some of its capability. similarly to a neuron in the brain.
• The first layer of a neural net is
• In order to do this, researchers studied the way that neurons called the input layer, followed
behaved in the brain. by hidden layers, then finally the
output layer.
• One important observation was that a neuron by itself is useless. • Each node in the neural net
performs some sort of calculation,
• Instead, you require networks of neurons to generate any which is passed on to other nodes
meaningful functionality. 19 deeper in the neural net 20
REVISITING AND UNDERSTANDING NEURONS
ARTIFICIAL NEURAL NETWORK • Neurons are a critical component of any deep learning model, biological or
artificial!
Neural nets have pushed the field of deep learning very
far very fast. However, it took decades for Machine • Neurons have quite an interesting structure in that groups of neurons work
Learning (and Deep Learning) to gain prominence. together inside the human brain to perform the functionality that we require
in our day-to-day lives.
• This was because any mature deep learning model requires an
abundance of two resources: • Neurons function by receiving and sending signals. More specifically, the
•• Data neuron’s dendrites receive signals and pass along those signals through the
axon. The dendrites of one neuron are connected to the axon of another
•• Computing power neuron. These connections are called synapses, which is a concept that has
been generalized to the field of deep learning.
• At the time of deep learning’s conceptual birth, researchers did
not have access to enough of either data or computing power
to build and train meaningful deep learning models.

• This has changed over time, which has led to deep learning’s
prominence today. 21 22

NEURONS IN ARTIFICIAL NEURAL NETWORKS NEURONS IN ARTIFICIAL NEURAL NETWORKS


• Neurons in deep learning models are nodes through which data and • Neurons in a deep learning model are capable of having synapses that
computations flow. connect to more than one neuron in the preceding layer.
• The working of computational neurons is as follows:
• They receive one or more input signals. These input signals can come • Each synapse has an associated weight, which impacts the preceding
from either the raw data set or from neurons positioned at a previous neuron’s importance in the overall neural network.
layer of the neural net.
• They perform some calculations. • Weights are a very important topic in the field of deep learning because
• They send some output signals to neurons deeper in the neural net adjusting a model’s weights is the primary way through which deep learning
through a synapse. models are trained.

23 24
NEURONS IN ARTIFICIAL NEURAL NETWORKS ACTIVATION FUNCTIONS IN ARTIFICIAL NEURAL NETWORKS
• Once a neuron receives its inputs from the neurons in the preceding layer of • Activation functions allow neurons in a neural network to
the model, it adds up each signal multiplied by its corresponding weight and communicate with each other through their synapses.
passes them on to an activation function.

• The activation function calculates the output value for the neuron. This output • We have established that neurons receive input signals from the
value is then passed on to the next layer of the neural network through preceding layer of a neural network. A weighted sum of these
another synapse. signals is fed into the neuron's activation function. Then the
activation function's output is passed onto the next layer of the
network.

• There are four main types of activation functions:


1. Threshold functions
2. Sigmoid functions
3. Rectifier functions, or ReLUs
4. Hyperbolic Tangent functions
25 26

ACTIVATION FUNCTIONS IN ARTIFICIAL NEURAL NETWORKS ACTIVATION FUNCTIONS IN ARTIFICIAL NEURAL NETWORKS

1. Threshold Functions, or unit step functions 2. Sigmoid Functions


• Threshold functions compute a different • The sigmoid function is also used Mathematical definition of sigmoid
output signal depending on whether or in logistic regression, one of the core function:
not its input lies above or below a machine learning techniques used to
certain threshold. solve classification problems.
• The input value to an activation function • The sigmoid function can accept any
is the weighted sum of the input values value, but always computes a value
from the preceding layer in the neural between 0 and 1.
network.
• One benefit of the sigmoid function
• Threshold functions are similar to over the threshold function is that its
Boolean variables in computer curve is smooth. This means that it is
programming. Their computed value is possible to calculate derivatives at
either 1 (similar to True) or 0 (equivalent any point along the curve.
to False). 27 28
ACTIVATION FUNCTIONS IN ARTIFICIAL NEURAL NETWORKS
ACTIVATION FUNCTIONS IN ARTIFICIAL NEURAL NETWORKS

3. Rectifier Functions, or Rectified Linear Units (ReLUs)


Hyperbolic Tangent Functions
• The rectifier function does not have
Mathematical definition of ReLU: • Out of the 4 discussed Mathematical definition of Hyperbolic
the same smoothness property as the
Tangent Functions:
sigmoid function from the last section. activation functions, the
However, it is still very popular in the hyperbolic tangent function is
field of deep learning.
the only one based on a
• The rectifier function is defined as trigonometric identity.
follows:
• If the input value is less than 0,
• The hyperbolic tangent function
then the function outputs 0. is similar in appearance to the
• If not, the function outputs its sigmoid function, but its output
input value values are all shifted
29
downwards. 30

REFERENCES
• https://machinelearningmastery.com/what-is-deep-learning/

• https://www.javatpoint.com/history-of-artificial-intelligence

• https://en.wikipedia.org/wiki/Timeline_of_artificial_intelligence

• https://en.wikipedia.org/wiki/AI_winter

• https://www.humanoid.waseda.ac.jp/booklet/kato_2.html

• https://www.freecodecamp.org/news/deep-learning-neural-networks-explained-in-plain-englis

h/

31
Recap:

Simple regression considers the relation between a single


explanatory variable and response variable

Multinomial and Logistic Regression

Acknowledgments: Information in this presentation has been obtained from a


wide variety of publicly available Internet sources. Slides created by Mr. Saransh
Gupta for academic use only, as part of course material for Introduction to
Machine Learning course.

Multiple regression simultaneously considers the


influence of multiple explanatory variables on a response Regression Modeling
variable Y
• A simple regression
model (one independent
The intent is to look at the variable) fits a regression
independent effect of each line in 2-dimensional
variable while “adjusting space
out” the influence of
potential confounders • A multiple regression
model with two
explanatory variables fits
a regression plane in
3-dimensional space

3 4
Simple Regression Model Multiple Regression Model

Again, estimates for the multiple slope


Regression coefficients are estimated by coefficients are derived by minimizing ∑residuals2
minimizing ∑residuals2 (i.e., sum of the squared to derive this multiple regression model:
residuals) to derive this model:

Again, the standard error of the regression


The standard error of the regression (sY|x) is
based on the squared residuals: is based on the ∑residuals2:

5 6

Multiple Regression Model Multiple Regression Model


• Intercept α predicts
where the regression A multiple regression model with k independent
plane crosses the Y axis variables fits a regression “surface” in k + 1 dimensional
space (cannot be visualized)
• Slope for variable X1 (β1)
predicts the change in Y
per unit X1 holding X2
constant

• The slope for variable X2


(β2) predicts the change
in Y per unit X2 holding X1
constant

7 8
Categorical Explanatory Variables in
Regression Models

Introduction to Data
Introduction to Logistic
• Categorical independent variables can be incorporated Analysis
Regression
into a regression model by converting them into 0/1
(“dummy”) variables

• For binary variables, code dummies “0” for “no” and 1 for
“yes”

9
Slides are adapted from SPIA, University of Georgia

• Categorical dependent variables in more complicated models. But, first an experiment


Logistic regression (for binary categorical dependent variables). • I’m going to show you a short video of some students playing
basketball.
• Why can’t we just use OLS? • There are 6 people; 3 dressed in black shirts and 3 in white
shirts.
• How does logistic regression work?
• I’d like you to count the number of times that the white shirted
students pass the ball to each other in two different ways.
• How do we compare logistic models? • An ‘aerial’ pass (without touching the ground on the way).
• A ‘bounce’ pass (touching the ground on the way).

• Thus after the video has ended you should have two totals, one for
aerial passes by white shirts and one for bounce passes by white
shirts.
“Gorillas in our midst” (1)
http://viscog.beckman.uiuc.edu/grafs/demos/1
5.html

http://viscog.beckman.uiuc.edu/grafs/demos/15.
html

“Gorillas in our midst” (2) Predicting gorilla sightings (1)

• This is a real bit of psychology research by Simons and • Our dependent variable is just like the variables we were using
Chabris (1999) at Harvard. earlier.
• They find that the harder the task, the more likely it is
that people don’t spot the gorilla.
• But let’s say with this example we want to predict whether the
• Only 50% of his subjects spotted the gorilla…
gorilla will be spotted by a person with a particular set of
characteristics.
• How is this relevant to us?
• In this case, let’s say with a particular concentration span
• Imagine we wanted to predict whether someone saw the
(measured on a 1-100 scale).
gorilla or not, this is a binary dependent variable.
• We might have independent variables like concentration
span, difficulty of the task, time of day and so on. • Since our independent variable is interval level data we can’t use
cross-tabs.
Predicting gorilla sightings (2) What’s wrong with SLR?
• So, what we want to know is the probability that any person • We want to predict a probability, this can only vary between
will be a gorilla spotter or not for any value of concentration zero and 1.
span.
• Remember if we know this, we will know the • But our SLR may predict values that are below zero or above
proportion of people that will spot the gorilla at each 1…
level of concentration of span on average.
• Let’s quickly fit a SLR to our example.
• We could use simple linear regression (SLR) here, with the • Our sample here is the 108 subjects that Simons and
dependent variable coded as 0 (no gorilla spot) or 1 (gorilla Chabris used. I’ve added some extra data on their
spotted). concentration spans.

• Well, why can’t we…? • A scatter-plot isn’t all that much use here.

Scatter-plot (1) Scatter-plot (2) People with CS below


More low concentration 21 have > 1 probability
people spot the gorilla. of being a spotter…

People with CS above


92 have < 0 probability
of being a spotter…

Could add a linear


More high concentration regression line
people DON’T spot the gorilla.
Other problems What to do (1)
• If you think about it, that’s just one problem. • Instead of linear OLS regression we use something called
• For linear regression we assumed that the population logistic regression.
distribution was normally distributed around the mean,
for each value of the X variable. • This is a very widely used method, and it’s important to
understand how it works.
• That’s not going to be the case if we’ve got a binary
response. The distribution around the mean is going to • Probably more widely used (especially if include
be quite different. variants) than linear OLS, as interesting dependent
variables are often categorical.
• Looking at our data, when CS=50 we’ll have about 60%
of cases scoring 1 (being spotters) and 40% of cases • A randomly selected academic (by the name of Tilley)
scoring 0 (not being spotters). That doesn’t sound much has used logistic regression in 55.5% of all his
like a normal distribution… sociology and politics articles.

What to do? (2) What to do? (3)


Here’s a more realistic
representation of the
relationship between the
• Somehow we need to dump the linear OLS bit of our model probability of gorilla
spotting and CS
for this binary categorical variable.

• So what we want to do is assume a different kind of


relationship between the probability of seeing gorillas (or
whatever) and concentration span.

• Maybe something like this…


The logistic transformation (1) The logistic transformation (2)

• This type of relationship is described by a special formula.


• Remember, if the relationship was linear then the
equation is just:

• But the relationship on the graph is actually described This is just the odds.
by: As the probability increases (from zero So if β is ‘large’ then as X increases
to 1), the odds increase from 0 to the log of the odds will increase
infinity. steeply.
The log of the odds then increases from The steepness of the curve will
–infinity to +infinity. therefore increase as β gets bigger.

Fitting this model (1) Fitting this model (2)


• So that’s what we want to do, but how do we do it? • Maximum likelihood is an iterative process that estimates the
• With SLR we tried to minimize the squares of the best fitted equation.
residuals, to get the best fitting line.
• The iterative bit just means that we try lots of models
• This doesn’t really make sense here (remember the until we get to a situation where tweaking the equation
errors won’t be normally distributed as there’s only two any further doesn’t improve the fit.
values).
• The maximum likelihood bit is kind of complicated,
• We use something called maximum likelihood to estimate what although the underlying assumptions are simple to
the β and α are. understand, and very intuitive. The basic idea is that we
find the coefficient value that makes the observed data
most likely.
Back to the gorillas Gorilla results
• So pressing the appropriate buttons in STATA or SPSS, allows
us to fit a logistic regression to our gorilla spotting data.
Variable Coefficient Standard p-value
value error
• The numbers that we get out are not immediately interpretable
however. Concentration -0.07 0.01 0.00

Intercept 3.69 0.72 0.00


• Remember for OLS linear regression, a change of one
unit on the X variable meant that the Y variable would
increase by the coefficient for X.
• This is how logistic regression results are often reported in
• That’s not what the coefficient associated with X in our articles.
logistic regression means. • It’s clear that concentration span has a negative (and
statistically significant) effect on gorilla sightings.
• But what does the -0.07 actually mean?

Interpreting the coefficients (1) Interpreting the coefficients (2)


• What we need to do is think • Antilog both sides and we get
about the equation again, and the odds on the LH side.
what an increase in X means.
• If we enter a value of X we can
• So an increase in X of 1 unit work out what the predicted
will decrease our log (odds) by odds will be.
0.07.

• Thus the odds of spotting the


• If we antilog both sides then
we could see how the odds gorilla (as opposed to not
change… spotting the gorilla) are nearly
5. For every 5 spotters there
Remember the ‘hat’ should be one non-spotter.
sign means the
predicted value.
Interpreting the coefficients (3) Yet more coefficient interpretation (1)
• We can also think about what happens to the odds when we • The other way of thinking about
increase X by a certain amount. things is in terms of
probabilities.
• Another way of writing ea+bX is ea(eb)X. That means that a one
unit increase in X multiples the odds by eb (as it’s to the power • If we rearrange the ‘antilogged’
of 1). equation then we work out what
the probability (for a particular
value of X) would be.
• In our case therefore a one unit increase in X multiplies the
odds by e-0.07, or 0.93.
• When X increases from 30 to 31, the odds are • The probability of a person with
4.90*0.93, or 4.56. CS=30 of gorilla spotting is thus
83%.
• When X increases from 30 to 40, the odds are
4.90*(0.93)10, or 2.37.

Yet more coefficient interpretation (2) Adding extra variables (1)


• Including other interval level independent variables and categorical
independent variables is as easy as in multiple linear regression.
When CS=30, probability of
spotting the gorilla is 83% • The logic is the same as before, we are examining the effects
of one independent variable when the other is held constant.
Perhaps the most useful
thing to do is to plot the • The important bit is to understand what the coefficients from
predicted probabilities (it is extra independent variables actually mean.
easiest to do this in STATA).
• Since this is less clear cut than in multiple linear regression
we need to be careful in interpretation.
Adding extra variables (2) Interpreting extra variables (1)
• Let’s say we think that people that own monkeys are more • So owning a monkey (holding concentration span constant),
adept at spotting the gorilla. multiplies the odds by e3.15, or 23.3 times.
• We could include a dummy variable for monkey owner
(1 if you are a monkey owner, and 0 if not). • The odds of monkey owners spotting the gorilla are 23 times
the odds of non-monkey owners spotting the gorilla.

• The probability of a person with a CS of 50 that owns a monkey


Variable Coefficient Standard p-value being a gorilla spotter is 93%, and the probability of a person with
a CS of 50 that does not own a monkey being a gorilla spotter is
value error
only 40%.
Concentration -0.09 0.02 0.000

Monkey owner 3.15 0.96 0.001


• With such a simple model we can still display it graphically.

Intercept 4.01 0.83 0.000 • A linear model would have two parallel lines for each type of
person (monkey or none) by CS. Our lines are NOT
parallel.

Interpreting extra variables (2)


Monkeys and no monkeys • Generally, we want to present information from a logistic
Monkey owners
regression in the form of probabilities as these are easiest to
understand.

• If we have lots of variables, then we normally set them to


a particular value and then examine how the predicted
probability of the dependent outcome varies.

• e.g. if I had more independent variables (age, sex,


Non-monkey owners eyesight), I would produce the first graph before for men
of average age with average eyesight not owning a
monkey. Then I could see how concentration alone
affected the predicted probability of a gorilla sighting.
Interactive monkeys (1) Interactive monkeys (2)
• We can also include interaction effects. Again though
we need to be careful interpreting these. Monkey owners

Variable Coefficient Standard p-value


value error
Concentration -0.12 0.02 0.00

Monkey owner -1.92 2.00 0.34 Non-monkey owners

Monkey*concentration 0.08 0.04 0.02

Intercept 5.07 1.14 0.00

Comparing models (1) Comparing models (2)


• One of the most important differences between logistic • The best way of comparing models is to use something called
regression and linear regression is in how we compare models. the likelihood-ratio test.

• Remember for linear regression we looked at how the • When we were using OLS regression, we were trying to
adjusted R2 changed. If there was a significant increase minimize the sum of squares, for logistic regression we
are trying to maximize something called the likelihood
when we added another variable (or interaction) then we
function (normally called L).
thought the model had improved.
• To see whether our model has improved by adding a
• For logistic regression there are a variety of ways of looking variable (or interaction, or squared term), we can compare
model improvement. the maximum of the likelihood function for each model
(just like we compared the R2 before for OLS
regressions).
Comparing models (3)

• In fact, just to complicate matters we actually compare the


maximised values of -2*log L.

First model’s maximised value Second model’s maximised value


• By logging the Ls and multiplying them by -2, this statistic
conveniently ends up with a chi-square distribution. This means
we test whether there is a statistically significant improvement
with reference to the χ2 distribution.
Introduction
• ML models can be:
– discriminative

– generative
Discriminative vs Generative
Models

Acknowledgments: The information in the slides in this presentation have been obtained
from a wide variety of publicly available Internet sources such as:
https://www.baeldung.com/cs/ml-generative-vs-discriminative
Some of the slides have been modified.
1 2

Discriminative models Generative Models


• Core goal of discriminative models is to separate between • Generative models are essentially statistical
classes models
• Tries to learn boundaries between classes
• Can generate new data instances
• Cannot generate new data
• Attempt to model how to populate a dataset
• Uses conditional probability
– hence, discriminative models are sometimes referred to as conditional • Used in unsupervised ML
models) • Types of generative models include Bayesian
– also uses maximum likelihood
Network, Hidden Markov model, Generative
• Discriminative models are used mostly for supervised ML;
classification, regression etc.
adversarial network (GAN)
• Types of discriminative models: Logistic regression, Support
vector machine, Decision tree, Random forest
3 4
Discriminative vs Generative An example
models • Say you are given a document and you need to find out the
• In the data space, discriminative models aim at drawing class language in which the document is written
boundaries – Is it written in English, Spanish, Japanese etc.?
– Trains model by learning parameters for maximizing • Option 1: Generative model
conditional probability – You can learn each language one-by-one
• In the data space, generative models aim at – Then you will know which language the document is written in
modeling/predicting how the data is throughout the space – This is the generative model’s way of doing things

– Learns parameters by maximizing joint probability • Option 2: Discriminative model


– Learn distinguishing patterns of languages, but no need to learn the
whole language as such
– This means you know enough to discriminate/distinguish among the
languages, but don’t know any of the languages in depth
– This is the discriminative model’s way of doing things

5 Source: https://www.baeldung.com/cs/ml-generative-vs-discriminative 6

Some insights Some applications of generative


• Discriminative models try to learn decision boundaries to distinguish cases models
• Generative models try to learn the main properties of the data
• Data augmentation
• The generative approach focuses on modeling, whereas the discriminative
approach focuses on a solution. So, we can use generative algorithms to • Restoration of old images, automatic image coloring
generate new data points. Discriminative algorithms don’t serve that • Re-constructing missing regions in an image
purpose.
• Generative algorithms converge faster than discriminative algorithms
Source:
• Generative models converge to a higher asymptotic error than
discriminative models
https://www.baeldung.com/cs/applications-of-generative-mod
– More training samples means error rate decreases for discriminative models els#:~:text=A%20very%20important%20application%20of,the
• Generative models have more “explainablity” than discriminatory models 🡪 %20size%20of%20the%20dataset.
better for applications in anomaly detection, NLP etc.
• Generative models can deduce from a context because they are more
comprehensive

Source: https://www.baeldung.com/cs/ml-generative-vs-discriminative 8
Why is feature selection important?
• Real-world datasets can have lots of features
– Features can also be sometimes referred to as
attributes, dimensions etc.
• When trying to design an ML model for doing
Feature Selection predictions in real-world scenarios, some
features will be relevant; other features will be
irrelevant
• Think of it like filtering out the noise from the data
Acknowledgments: The slides in this presentation have been obtained from a wide variety of • Remember that the features that you consider to
publicly available Internet sources such as:
https://www.geeksforgeeks.org/feature-selection-techniques-in-machine-learning/. Some of be irrelevant to your analysis or ML model, others
the slides have been modified.
1
may consider those features to be relevant 2

Why is feature selection important? Goal of feature selection


• Different stakeholders will need to use different • To find the best set of features, which are most relevant
features for building ML models for their towards building your ML model
respective purposes • Some caveats need to be added here
• The “best” depends upon domain knowledge and the
• What happens when you use irrelevant features decision concerning which feature set constitutes the
in your dataset? “best” may differ among experts
More noise in the dataset w.r.t. the ML model that you
– • You need to have adequate data corresponding to each
are trying to build of the features
– Unnecessarily increases the complexity of the model – No point saying that a feature is relevant if you do not have
– Decreases the generalization capability of the model enough data concerning the feature
– Sometimes, the data may exist for a given feature, but you may
– Increases the bias in the model
not have access to it due to reasons such as cost, privacy etc.
Bottomline: Why deal with irrelevant features?
3 4
Widely used feature selection techniques
in ML
Filter methods
• Filter methods • Selects features from the dataset without using any ML
algorithm
• Wrapper methods • Typically used for pre-processing datasets
• Embedded methods • Very good for getting rid of duplicate and irrelevant features
• Usually fast and inexpensive computationally
• Each feature is evaluated and selected on an individual
basis, but does not account for dependencies among
features
• When multiple features together can result in better
performance of the ML model, filter methods are generally
not good to use

5 6

Techniques used in filter methods Techniques used in filter methods


• Information Gain • Fisher’s Score
– How much information is being given by the feature – Selects each feature based on the Fischer scores
towards building a “good” ML model – Features are selected on an independent basis
– Measures reduction in the entropy values. – Usually, suboptimal set of features.
– Compute information gain for each attribute w.r.t. the – Higher Fischer score means better selected feature
target values for feature selection • Correlation Coefficient
• Chi-square test – Pearson’s Correlation Coefficient
– Evaluates the relationship between categorical – How associated are two continuous variables
variables.

7 8
Techniques used in filter methods Techniques used in filter methods
• Dispersion Ratio
• Variance Threshold – Computes the ratio of the Arithmetic mean to Geometric mean for a
– Core idea: Higher variance features usually contain specific feature.
more information – Higher value of dispersion ratio means that the feature is more relevant
from a feature selection perspective
– Sets a threshold for variance and gets rid of features
• Mutual Dependence
that do not satisfy this variance threshold – Computes if two variables are mutually dependent,
• Mean Absolute Difference (MAD) – If a particular feature is present/absent, how much information does
– Similar to the variance threshold method that feature contribute to the prediction that you are trying to do in your
ML model
– Computes the mean absolute difference from the
• Relief
mean value.
– Measures the quality of attributes by means of random sampling of
instances from the dataset

9 10

Wrapper methods Techniques used in wrapper methods


• These are basically greedy algorithms
• Forward selection
• Train by using a subset of features
– Start with an empty set of features and keep adding
• Training happens iteratively
features that provides best improvement to your ML model
• Depending upon the results/outcome of the training, perform addition
– This is an iterative approach
or removal of features
• Stopping criteria could be based on threshold performance criteria or – Stop when the addition of a new feature fails to improve the
when you the model has a desired number of features performance of your ML model
• Wrapper methods generally select a more optimal set of features • Backward elimination
than filter methods; this means generally better accuracy in case of – Also, an iterative approach
wrapper methods – Start with all features and keep getting rid of the least
• The downside is that wrapper methods involve more computational significant feature after each iteration
cost than filter methods
– Stop when no improvement happens in the performance of
your model after removal of the feature.
11 12
Techniques used in wrapper methods Embedded methods
• Bi-directional elimination
– Uses forward selection and backward elimination in tandem
• The feature selection algorithm is embedded in the
• Exhaustive selection learning algorithm
– Basically, a brute force approach • Basically, has its own built-in feature selection methods
– Creates all possible subsets of features • Consider a combination of features
– Builds a learning algorithm for each subset • Tries to address the deficiencies in filter and wrapper
– Selects the subset that has the best performance of the ML model methods, while attempting to use their advantages.
• Recursive elimination
– Basically, a greedy optimization method
– Selects features by recursively considering smaller and smaller set of
features
– Attributes some importance value to each feature
– Removes least important features from the dataset until you get the
target number of features
13 14

Techiques used in embedded methods Feature selection is different from PCA

• Regularization • Observe that in feature selection, you are


• Adds a penalty to ML model’s parameters to avoid over-fitting. essentially deciding which features to select for
• Uses uses Lasso (L1 regularization) and Elastic nets (L1 and arriving at better prediction results of the ML
L2 regularization) model
• The penalty is applied over the coefficients
• Similar to feature selection, in PCA, you also
– This means that some coefficients can become zero
reduce the number of dimensions (features), but
– The features with zero coefficient can be removed
you do so by converting to lower dimensional
• Tree-based methods
– Examples: Random Forest, Gradient Boosting
space and NOT by getting rid of the features
– Uses feature importance towards selecting features
themselves
• This is a very important conceptual point
15 16
2

CONTENTS
• Computational Learning Theory
PAC LEARNING AND VC DIMENSION • PAC Learning
• VC Dimension
• VC Dimension: Learners and Complexity
• VC Dimension continued
• VC Dimension: Shattering
• Using VC Dimension
• Training Error vs Prediction/Test Error
• No-free-lunch Theorem
• Occam’s Razor (Principle of Parsimony)
Acknowledgments: Information in this presentation has been obtained from a
wide variety of publicly available Internet sources. Slides created by Mr. Saransh
Gupta for academic use only, as part of course material for Introduction to
Machine Learning course.
1

COMPUTATIONAL LEARNING THEORY COMPUTATIONAL LEARNING THEORY


• It is a field of study concerned with the use of formal mathematical methods
applied to learning systems.

• It seeks to use the tools of theoretical computer science to quantify learning


problems. This includes characterizing the difficulty of learning specific tasks.

• The division between ‘Learning tasks’ and ‘Learning algorithms’ is arbitrary and
Computational learning theory, or statistical in practice, there is a lot of overlap between the two fields:
learning theory, refers to mathematical • Computational Learning Theory (CoLT): Formal study of learning tasks.
• Statistical Learning Theory (SLT): Formal study of learning algorithms.
frameworks for quantifying learning tasks
and algorithms. • CoLT and SLT are largely synonymous in modern usage.

Statistical learning theory can be extended to computational


learning theory by taking computational complexity of the learner
into account. 4

3
COMPUTATIONAL LEARNING THEORY COMPUTATIONAL LEARNING THEORY
• Hypothesis Space: It is the set of all the possible legal hypothesis.
This is the set from which the ML algorithm would determine the
best possible (only one) hypothesis which would best describe the
target function or the outputs.
• Hypothesis: A hypothesis is a function that best describes the
target in supervised machine learning. The hypothesis that The main unanswered question in learning is this:
an algorithm would come up with depends upon the data How can we be sure that our learning algorithm
and the restrictions and bias that we have imposed on the
has produced a hypothesis that will predict the
data.
Every learning algorithm requires assumptions about the hypothesis space. correct value for previously unseen inputs?
• Eg: “My hypothesis space is
• …linear”
• …decision trees with 5 nodes”
• …a three layer neural network with rectifier hidden units”
6
5

COMPUTATIONAL LEARNING THEORY COMPUTATIONAL LEARNING THEORY

The focus in computational learning theory is typically on Yet…


supervised learning tasks.
as a machine learning practitioner, it is important to to know
Formal analysis of real problems and real algorithms is very about computational learning theory and some of the main
challenging. As such, it is common to reduce the complexity of areas of investigation.
the analysis by focusing on binary classification tasks and even
simple binary rule-based systems.
The CoLT field provides an immensely useful grounding and a
mathematically rigorous framework for understanding what we
Consequently, the practical application of the theorems may be are trying to achieve when we are trying to fit models on data,
limited or challenging to interpret for real problems and as it may provide both valuable insight and the limitations of
algorithms. the methods.

7 8
COMPUTATIONAL LEARNING THEORY COMPUTATIONAL LEARNING THEORY
Questions explored in computational learning theory might include:
Among many subfields of CoLT, two of the most widely discussed
areas of study are PAC Learning and VC Dimension.
• How do we know a model has a good approximation for the
target function? Crudely speaking,
• What hypothesis space should be used?
PAC (Probably Approximately Correct) Learning is the theory of
• How do we know if we have a local or globally good solution?
machine learning problems
• How do we avoid overfitting?
• How many data examples are needed? whereas
• … and so on. VC (Vapnik–Chervonenkis) Dimension is the theory of machine
learning algorithms.
The main unanswered question in learning is this: How can we be
sure that our learning algorithm has produced a hypothesis that
will predict the correct value for previously unseen inputs?
— Page 713, Artificial Intelligence: A Modern Approach, 3rd 9 10
edition, 2009.

PAC LEARNING PAC LEARNING


• Probably approximately correct learning, or PAC learning, refers
to a theoretical machine learning framework developed by Leslie Think of the target function as a hidden random hypothesis that
Valiant. maps all the input to its correct labels and could potentially exist.
It may have been decided by ‘Nature’ but we have not idea
about it. Our best bet is to come up with a model/hypothesis that
• PAC learning seeks to quantify the difficulty of a learning task and
can perhaps approximate this Nature-given target function.
might be considered one of the premier sub-fields of computational
learning theory. PAC learning is concerned with how much computational effort is
required to find a hypothesis (fit model) that is a close match for
• Consider that in supervised learning, we are trying to approximate the possibly underlying and unknown Nature-given target function.
an unknown underlying mapping function from inputs to outputs. This
function that maps the input to its correct labels is called the target PAC is a framework under which numerous results on learning
function. theory were proved, and has the following set of assumptions:
the training and testing sets follow the same distribution • the
• We don’t know what this mapping function (target function) looks training examples are drawn independently
like, but we suspect it exists in nature, and we have examples of
11 12
data possibly produced by the function.
PAC LEARNING PAC LEARNING
Consider the following scenario:

PAC is a framework under which numerous results  


on learning theory were proved, and has the
following set of assumptions:
• the training and testing sets follow the same
distribution
• the training examples are drawn independently

13 14

15 16

PAC LEARNING PAC LEARNING


Continuing the scenario: Continuing the scenario and arriving on an important result:
 

 
 
PAC LEARNING PAC LEARNING

What does the PAC Learning theory say (in simple words)?
The underlying principle is that any hypothesis that is seriously
• The idea is that a bad hypothesis will be found out based on wrong will almost certainly be “found out” with high probability
after a small number of examples, because it will make an
the predictions it makes on new data, i.e. based on its incorrect prediction. Thus, any hypothesis that is consistent with a
generalization error. sufficiently large set of training examples is unlikely to be seriously
wrong: that is, TELY it must be probably approximately correct.
• A hypothesis that gets most or a large number of predictions
correct, i.e. has a small generalization error, is probably a
— Page 714, Artificial Intelligence: A Modern Approach, 3rd
good approximation for the target function.
edition, 2009.
• This probabilistic language gives the theorem its name:
“probably approximately correct.” That is, a hypothesis seeks to
“approximate” a target function and is “probably” good if it
has a low generalization error. 17 18

PAC LEARNING PAC LEARNING


What does the PAC Learning theory say (in simple words)?
The goal of the PAC framework is to understand how large a data
• A PAC learning algorithm refers to an algorithm that returns a set needs to be in order to give good generalization. It also gives
hypothesis that is Probably Approximately Correct (PAC). bounds for the computational cost of learning …
— Page 344, Pattern Recognition and Machine Learning, 2006.
• Using formal methods, a minimum generalization error can be
specified for a supervised learning task. The theorem can then
be used to estimate the expected number of samples from the
problem domain that would be required to determine whether
a hypothesis was PAC or not. That is, it provides a way to
estimate the number of samples required to find a PAC A hypothesis space is said to be efficiently PAC-learnable if there is
hypothesis. a polynomial time algorithm that can identify a function that is PAC.
— Page 210, Machine Learning: A Probabilistic Perspective,
• Additionally, a hypothesis space (machine learning algorithm) is 2012.
efficient under the PAC framework if an algorithm can find a
19 20
PAC hypothesis (fit model) in polynomial time.
21 22

PAC LEARNING PAC LEARNING


ε-Representative Uniform Convergence
   

PAC LEARNING VC DIMENSION


Estimating the number of samples required to find a PAC hypothesis. • Vapnik–Chervonenkis theory, or VC theory for short, refers to a
theoretical machine learning framework developed by Vladimir
  Vapnik and Alexey Chervonenkis.

• VC theory seeks to quantify the capability of a learning algorithm


and might be considered another one of the premier sub-fields of
statistical learning theory.

• VC theory is comprised of many elements, most notably the VC


dimension.

• The VC dimension quantifies the complexity of a hypothesis space,


e.g. the models that could be fit given a representation and
learning algorithm.

23 24
VC DIMENSION VC DIMENSION
Why do we need VC Dimension?
A key quantity in PAC learning is the
• One way to consider the complexity of a hypothesis space (space of
Vapnik-Chervonenkis dimension, or VC dimension,
models that could be fit) is based on the number of distinct hypotheses which provides a measure of the complexity of a
it contains and perhaps how the space might be navigated. The VC space of functions, and which allows the PAC
dimension is a clever approach that instead measures the number of
examples from the target problem that can be discriminated by
framework to be extended to spaces containing an
hypotheses in the space. infinite number of functions.
• The VC dimension estimates the capability or capacity of a
classification machine learning algorithm for a specific dataset — Page 344, Pattern Recognition and Machine
(number and dimensionality of examples).
Learning, 2006.
• The VC dimension is used as part of the PAC learning framework.

25 26

VC DIMENSION: LEARNERS AND COMPLEXITY VC DIMENSION: LEARNERS AND COMPLEXITY


• We’ve seen many versions of underfit/overfit trade-off • We’ve seen many versions of underfit/overfit trade-off
– Complexity of the learner – Complexity of the learner
– “Representational Power”
 – “Representational Power”

• Different learners have different power • Different learners have different power

Feature Values Parameters Feature Values Parameters


(measured)
(measured)
x1 Predicted Class x1 Predicted Class
3
x2 Classifier 3
x2 Classifier


2 2

xn xn
1 1

0 0

-1
Example: -1
Example:
-2 -2

-3
-2 -1 0 1 2 3
(c) Alexander Ihler -3
-2 -1 0 1 2 3
(c) Alexander Ihler
-3 -3
VC DIMENSION: LEARNERS AND COMPLEXITY VC DIMENSION: LEARNERS AND COMPLEXITY
• We’ve seen many versions of underfit/overfit trade-off • We’ve seen many versions of underfit/overfit trade-off
– Complexity of the learner – Complexity of the learner
– “Representational Power”
 – “Representational Power”

• Different learners have different power • Different learners have different power

Feature Values Parameters


• Usual trade-off:
(measured)
x1 Predicted Class – More power = represent more complex systems, might overfit
x2 Classifier – Less power = won’t overfit, but may not find “best” learner

xn
• How can we quantify representational power?
– Not easily…
Example: – One solution is VC (Vapnik-Chervonenkis) dimension

(c) Alexander Ihler (c) Alexander Ihler

VC DIMENSION VC DIMENSION CONTINUED


What does VC Dimension mean though? What does VC Dimension mean though?

• Formally, the VC dimension is the largest number of examples from • Any placement of three points on a 2d plane with class labels 0 or 1
the training dataset that the space of hypothesis from the algorithm can be “correctly” split by label with a line, e.g. shattered. But, there
can “shatter.” exists placements of four points on plane with binary class labels that
cannot be correctly split by label with a line, e.g. cannot be shattered.
• Shatter or a shattered set, in the case of a dataset, means points in Instead, another “algorithm” must be used, such as ovals.
the feature space can be selected or separated from each other using
hypotheses in the space such that the labels of examples in the
separate groups are correct (whatever they happen to be).

• Whether a group of points can be shattered by an algorithm


depends on the hypothesis space and the number of points.
• For example, a line (hypothesis space) can be used to shatter
three points, but not four points.

31 32
VC DIMENSION CONTINUED VC DIMENSION: SHATTERING
What does VC Dimension mean though? • We say a classifier f(x) can shatter points x(1)…x(h) iff For
• Therefore, the VC dimension of a machine learning algorithm is the all y(1)…y(h), f(x) can achieve zero error on training data
largest number of data points in a dataset that a specific (x(1),y(1)), (x(2),y(2)), … (x(h),y(h))
configuration of the algorithm (hyperparameters) or specific fit model
can shatter. (i.e., there exists some θ that gets zero error)

• A classifier that predicts the same value in all cases will have a VC
dimension of 0, no points. • Can f(x;θ) = sign(θ0 + θ1x1 + θ2x2) shatter these
points?
• A large VC dimension indicates that an algorithm is very flexible,
although the flexibility may come at the cost of additional risk of
overfitting.

33 34

VC DIMENSION: SHATTERING VC DIMENSION: SHATTERING


• We say a classifier f(x) can shatter points x(1)…x(h) iff For • We say a classifier f(x) can shatter points x(1)…x(h) iff
all y(1)…y(h), f(x) can achieve zero error on training data For all y(1)…y(h), f(x) can achieve zero error on
(x(1),y(1)), (x(2),y(2)), … (x(h),y(h)) training data (x(1),y(1)), (x(2),y(2)), … (x(h),y(h))

(i.e., there exists some θ that gets zero error) (i.e., there exists some θ that gets zero error)

• Can f(x;θ) = sign(x12 + x22 - θ) shatter these


• Can f(x;θ) = sign(θ0 + θ1x1 + θ2x2) shatter these points?
points?
• Yes: there are 4 possible training sets…

35 36
VC DIMENSION: SHATTERING VC DIMENSION
• We say a classifier f(x) can shatter points x(1)…x(h) iff • The VC dimension H is defined as the maximum number
For all y(1)…y(h), f(x) can achieve zero error on of points h that can be arranged so that f(x) can shatter
training data (x(1),y(1)), (x(2),y(2)), … (x(h),y(h)) them

(i.e., there exists some θ that gets zero error)


• Example: what’s the VC dimension of the (zero-centered)
2 2
• Can f(x;θ) = sign(x1 + x2 - θ) shatter these points? circle, f(x;θ) = sign(x12 + x22 - θ) ?
• Nope!
• VCdim = 1 : can arrange one point, cannot arrange two
(previous example was general)

37 38

VC DIMENSION: SHATTERING VC DIMENSION: SHATTERING


• EXAMPLE:WHAT’S THE VC DIMENSION OF THE TWO-DIMENSIONAL LINE, • EXAMPLE:WHAT’S THE VC DIMENSION OF THE TWO-DIMENSIONAL
F(X;Θ) = SIGN(Θ1 X1 + Θ2 X2 + Θ0)?
• LINE, F(X;Θ) = SIGN(Θ1 X1 + Θ2 X2 + Θ0)?

• VC DIM >= 3? YES


• VC DIM >= 3? YES

• VC dim >= 4?

39 40
VC DIMENSION: SHATTERING VC DIMENSION: SHATTERING
• EXAMPLE:WHAT’S THE VC DIMENSION OF THE TWO-DIMENSIONAL • EXAMPLE:WHAT’S THE VC DIMENSION OF THE TWO-DIMENSIONAL
LINE, F(X;Θ) = SIGN(Θ1 X1 + Θ2 X2 + Θ0)? • LINE, F(X;Θ) = SIGN(Θ1 X1 + Θ2 X2 + Θ0)?

• VC DIM >= 3? YES • VC DIM >= 3? YES


Turns out:
For a general , linear
classifier (perceptron) in
d dimensions with a
• VC dim >= 4? No… • VC dim >= 4? No… constant term:

Any line through these points Any line through these points VC dim = d+1
must split one pair (by must split one pair (by
crossing one of the lines) crossing one of the lines)

41 42

VC DIMENSION USING VC DIMENSION


• Use validation / cross-validation to select complexity
• VC dimension measures the “power” of the learner
• Does *not* necessarily equal the # of parameters!

• Number of parameters does not necessarily equal complexity


– Can define a classifier with a lot of parameters but not much power
# Params Train Error X-Val Error
(how?)
f1
– Can define a classifier with one parameter but lots of power (how?)
f2 f3 f4 f5 f6

• It is a lot of work to determine what the VC dimension of


various learners is…

43 44
(c) Alexander Ihler
USING VC DIMENSION USING VC DIMENSION
• Use validation / cross-validation to select complexity • Use validation / cross-validation to select complexity
• Use VC dimension based bound on test error similarly • Use VC dimension based bound on test error similarly

• Other Alternatives
• “Structural Risk Minimization” (SRM) – Probabilistic models: likelihood under model (rather than
classification error)
– AIC (Aikike Information Criterion)
# Params Train Error VC VC Test Bound • Log-likelihood of training data - # of parameters
f1 Term – BIC (Bayesian Information Criterion)
• Log-likelihood of training data - (# of parameters)*log(m)
f2 f3 f4 f5 f6
• Similar to VC dimension: performance + penalty

• BIC conservative; SRM very conservative


• Also, “true Bayesian” methods (take prob. learning…)
45 46
(c) Alexander Ihler (c) Alexander Ihler

47

TRAINING ERROR VS PREDICTION/TEST ERROR NO-FREE-LUNCH THEOREM


• The theorem states that all optimization algorithms perform equally well when
their performance is averaged across all possible problems.
MSEtest • It implies that there is no single best optimization algorithm i.e. no classifier is
inherently better than any other: you need to make assumptions to generalize

• Because of the close relationship between optimization, search, and machine


learning, it also implies that there is no single best machine learning algorithm
for predictive modeling problems such as classification and regression.

MSEtrain

47 48
OCCAM’S RAZOR (PRINCIPLE OF PARSIMONY) REFERENCES
• This philosophical idea in the context of ML suggests that all else being equal, • https://machinelearningmastery.com/introduction-to-computational-learning-t
a simpler model is be preferred over a more complex model.
heory
• It does not mean that simpler models are universally better than complex • Artificial Intelligence: A Modern Approach, 3rd edition, 2009 (Book)
models, but rather that a model must be complex enough to learn the patterns
• The Nature of Statistical Learning Theory, 1999 (Book)
in a dataset and to avoid underfitting but simple enough to avoid overfitting.
• Pattern Recognition and Machine Learning, 2006 (Book)
• When choosing between two models, we can only say a simpler model is • Machine Learning, 1997 (Book)
better if it’s generalization error is equal to or less than that of the more
complex model. • Slides of Andrew W. Moore (Associate Professor, School of Computer
Science, Carnegie Mellon University
• An Introduction to Computational Learning Theory by Keanrs and Vazirani

50

49
Data Mining: Bayesian Belief Networks
Concepts and Techniques ■ Bayesian belief networks (also known as Bayesian
networks, probabilistic networks): allow class
(3rd ed.)
conditional independencies between subsets of variables
— Chapter 9 —
Classification: Advanced Methods ■ A (directed acyclic) graphical model of causal relationships
Jiawei Han, Micheline Kamber, and Jian Pei ■ Represents dependency among the variables
University of Illinois at Urbana-Champaign & ■ Gives a specification of joint probability distribution
Simon Fraser University
❑ Nodes: random variables
©2011 Han, Kamber & Pei. All rights reserved.
❑ Links: dependency
X Y
Acknowledgments: The slides in this presentation are mostly the textbook slides ❑ X and Y are the parents of Z, and Y is the
from the data mining textbook “Data Mining: Concepts and Techniques” by Jiawei parent of P
Han, Micheline Kamber, and Jian Pei. The slides have been modified i.e., some Z
new slides have been added and some slides have been deleted. Information in P ❑ No dependency between Z and P
this presentation has also been obtained from a wide variety of publicly available ❑ Has no loops/cycles
Internet sources. 1 2

Training Bayesian Networks: Several


Bayesian Belief Network: An Example Scenarios
Family
Smoke CPT: Conditional Probability Table ■ Scenario 1: Given both the network structure and all variables
History observable: compute only the CPT entries
r (S) for variable LungCancer:
(FH)
■ Scenario 2: Network structure known, some variables hidden: gradient
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
descent (greedy hill-climbing) method, i.e., search for a solution along
LC 0.8 0.5 0.7 0.1 the steepest descent of a criterion function
LungC ■ Weights are initialized to random probability values
Emphy
ancer ~LC 0.2 0.5 0.3 0.9
sema ■ At each iteration, it moves towards what appears to be the best
(LC)
solution at the moment, w.o. backtracking
shows the conditional probability for
■ Weights are updated at each iteration & converge to local optimum
each possible combination of its parents
■ Scenario 3: Network structure unknown, all variables observable:
Positiv Dyspne
Derivation of the probability of a search through the model space to reconstruct network topology
eXRay a particular combination of values of X, ■ Scenario 4: Unknown structure, all hidden variables: No good
from CPT: algorithms known for this purpose
Bayesian Belief Network ■ D. Heckerman. A Tutorial on Learning with Bayesian Networks. In
Learning in Graphical Models, M. Jordan, ed.. MIT Press, 1999.
3 4
Classification by Backpropagation Neural Network as a Classifier
■ Weakness
■ Backpropagation: A neural network learning algorithm ■ Long training time
■ Started by psychologists and neurobiologists to develop ■ Require a number of parameters typically best determined
and test computational analogues of neurons empirically, e.g., the network topology or “structure.”
■ Poor interpretability: Difficult to interpret the symbolic meaning
■ A neural network: A set of connected input/output units behind the learned weights and of “hidden units” in the network
where each connection has a weight associated with it ■ Strength
■ During the learning phase, the network learns by ■ High tolerance to noisy data
adjusting the weights so as to be able to predict the ■ Ability to classify untrained patterns
■ Well-suited for continuous-valued inputs and outputs
correct class label of the input tuples
■ Successful on an array of real-world data, e.g., hand-written letters
■ Also referred to as connectionist learning due to the ■ Algorithms are inherently parallel
connections between units ■ Techniques have recently been developed for the extraction of
rules from trained neural networks
5 6

A Multi-Layer Feed-Forward Neural Network How A Multi-Layer Neural Network Works


■ The inputs to the network correspond to the attributes measured
Output vector for each training tuple
■ Inputs are fed simultaneously into the units making up the input
Output layer layer
■ They are then weighted and fed simultaneously to a hidden layer
■ The number of hidden layers is arbitrary, although usually only one
■ The weighted outputs of the last hidden layer are input to units
Hidden layer making up the output layer, which emits the network's prediction

wij ■ The network is feed-forward: None of the weights cycles back to


an input unit or to an output unit of a previous layer
Input layer ■ From a statistical point of view, networks perform nonlinear
regression: Given enough hidden units and enough training
Input vector: X samples, they can closely approximate any function

7 8
Defining a Network Topology Backpropagation
■ Iteratively process a set of training tuples & compare the network's
■ Decide the network topology: Specify # of units in the
prediction with the actual known target value
input layer, # of hidden layers (if > 1), # of units in each
■ For each training tuple, the weights are modified to minimize the
hidden layer, and # of units in the output layer
mean squared error between the network's prediction and the actual
■ Normalize the input values for each attribute measured in
target value
the training tuples to [0.0—1.0]
■ Modifications are made in the “backwards” direction: from the output
■ One input unit per domain value, each initialized to 0 layer, through each hidden layer down to the first hidden layer, hence
■ Output, if for classification and more than two classes, “backpropagation”
one output unit per class is used ■ Steps
■ Once a network has been trained and its accuracy is ■ Initialize weights to small random numbers, associated with biases
unacceptable, repeat the training process with a different ■ Propagate the inputs forward (by applying activation function)
network topology or a different set of initial weights ■ Backpropagate the error (by updating weights and biases)
■ Terminating condition (when error is very small, etc.)
9 10

Neuron: A Hidden/Output Layer Unit Efficiency and Interpretability


bias ■ Efficiency of backpropagation: Each epoch (one iteration through the
x0 w0 μk training set) takes O(|D| * w), with |D| tuples and w weights, but # of
epochs can be exponential to n, the number of inputs, in worst case

x1 w1
f output y ■ For easier comprehension: Rule extraction by network pruning
■ Simplify the network structure by removing weighted links that
xn wn have the least effect on the trained network
■ Then perform link, unit, or activation value clustering
Input weight weighted Activation ■ The set of input and activation values are studied to derive rules
vector x vector w sum function describing the relationship between the input and hidden unit
■ An n-dimensional input vector x is mapped into variable y by means of the layers
scalar product and a nonlinear function mapping ■ Sensitivity analysis: assess the impact that a given input variable
■ The inputs to unit are outputs from the previous layer. They are multiplied by has on a network output. The knowledge gained from this analysis
their corresponding weights to form a weighted sum, which is added to the
can be represented in rules
bias associated with unit. Then a nonlinear activation function is applied to it.
11 12
Lazy vs. Eager Learning Lazy Learner: Instance-Based Methods

■ Lazy vs. eager learning ■ Instance-based learning:


■ Lazy learning (e.g., instance-based learning): Simply ■ Store training examples and delay the processing
stores training data (or only minor processing) and (“lazy evaluation”) until a new instance must be
waits until it is given a test tuple classified
■ Eager learning (the above discussed methods): Given ■ Typical approaches
a set of training tuples, constructs a classification model ■ k-nearest neighbor approach
before receiving new (e.g., test) data to classify ■ Instances represented as points in a Euclidean
■ Lazy: less time in training but more time in predicting space.
■ Accuracy ■ Locally weighted regression
■ Lazy method effectively uses a richer hypothesis space ■ Constructs local approximation
since it uses many local linear functions to form an ■ Case-based reasoning
implicit global approximation to the target function
■ Uses symbolic representations and
■ Eager: must commit to a single hypothesis that covers
knowledge-based inference
the entire instance space
13 14

Case-Based Reasoning (CBR) Genetic Algorithms (GA)


■ CBR: Uses a database of problem solutions to solve new problems
■ Genetic Algorithm: based on an analogy to biological evolution
■ Store symbolic description (tuples or cases)—not points in a Euclidean
■ An initial population is created consisting of randomly generated rules
space
■ Each rule is represented by a string of bits
■ Applications: Customer-service (product-related diagnosis), legal ruling
■ E.g., if A1 and ¬A2 then C2 can be encoded as 100
■ Methodology
■ If an attribute has k > 2 values, k bits can be used
■ Instances represented by rich symbolic descriptions (e.g., function
■ Based on the notion of survival of the fittest, a new population is
graphs)
formed to consist of the fittest rules and their offspring
■ Search for similar cases, multiple retrieved cases may be combined
■ The fitness of a rule is represented by its classification accuracy on a
■ Tight coupling between case retrieval, knowledge-based reasoning,
set of training examples
and problem solving
■ Offspring are generated by crossover and mutation
■ Challenges
■ The process continues until a population P evolves when each rule in P
■ Find a good similarity metric
satisfies a prespecified threshold
■ Indexing based on syntactic similarity measure, and when failure,
■ Slow but easily parallelizable
backtracking, and adapting to additional cases
15 16
Active Learning Transfer Learning: Conceptual Framework
■ Class labels are expensive to obtain ■ Transfer learning: Extract knowledge from one or more source tasks
■ Active learner: query human (oracle) for labels and apply the knowledge to a target task
■ Pool-based approach: Uses a pool of unlabeled data ■ Traditional learning: Build a new classifier for each new task
■ L: a small subset of D is labeled, U: a pool of unlabeled data in D
■ Transfer learning: Build new classifier by applying existing knowledge
■ Use a query function to carefully select one or more tuples from U
and request labels from an oracle (a human annotator) learned from source tasks
■ The newly labeled samples are added to L, and learn a model
■ Goal: Achieve high accuracy using as few labeled data as possible
■ Evaluated using learning curves: Accuracy as a function of the number
of instances queried (# of tuples to be queried should be small)
■ Research issue: How to choose the data tuples to be queried?
■ Uncertainty sampling: choose the least certain ones
■ Reduce version space, the subset of hypotheses consistent w. the
training data
■ Reduce expected entropy over U: Find the greatest reduction in
the total number of incorrect predictions
Traditional Learning Framework Transfer Learning Framework
17 18

Transfer Learning: Methods and Applications

■ Applications: Especially useful when data is outdated or distribution


changes, e.g., Web document classification, e-mail spam filtering
■ Instance-based transfer learning: Reweight some of the data from
source tasks and use it to learn the target task
■ TrAdaBoost (Transfer AdaBoost)
■ Assume source and target data each described by the same set of
attributes (features) & class labels, but rather diff. distributions
■ Require only labeling a small amount of target data
■ Use source data in training: When a source tuple is misclassified,
reduce the weight of such tupels so that they will have less effect on
the subsequent classifier
■ Research issues
■ Negative transfer: When it performs worse than no transfer at all
■ Heterogeneous transfer learning: Transfer knowledge from different
feature space or multiple source domains
■ Large-scale transfer learning

19
2

CONTENTS
BIAS VARIANCE TRADE-OFF • Bias and Variance: Introduction, why and what?
• Bias and Variance: Conceptual Definition
• Bias-Variance: Tradeoff
• Bias and Variance: Graphical Definition
• An Illustrative Example: Voting Intentions
• An Illustrative Example: Voting Intentions Caveats
• Bias and Variance: Mathematical Definition
• Bias-Variance: Tradeoff Revisited
• ML Model Space, Hypothesis Space & Hyperparameters
• Generalization and Bias Variance Tradeoff Intuitions
Acknowledgments: Information in this presentation has been obtained from a
wide variety of publicly available Internet sources. Slides created by Mr. Saransh • Error comparisons, No Free Lunch Theorem and Occam’s Razor
Gupta for academic use only, as part of course material for Introduction to • Going beyond theory and Managing Bias and Variance
Machine Learning course. • Bias-Variance Tradeoff: Overfitting/Underfitting & Summary
1 • More about Managing Bias and Variance practically
• An Applied Example: Voter Party Registration

BIAS AND VARIANCE: INTRODUCTION BIAS AND VARIANCE: WHY AND WHAT?

Understanding these two types of error i.e. the error due to “Bias”
and the error due to “Variance” can help us diagnose model results
and avoid the mistake of over- fitting or under-fitting.

• Think Conceptually
• Think Graphically
• Think Mathematically

3
4
BIAS AND VARIANCE:
BIAS-VARIANCE: TRADEOFF
CONCEPTUAL DEFINITION

Remember: There is a tradeoff between a


If we repeat the entire model building process multiple times with new model’s ability to minimize bias and variance.
data every time, the resulting models will have a range of predictions
because of the randomness in underlying data sets. Bias measures how
far off in general these models' predictions are from the correct value. In statistics and machine learning, the bias–variance tradeoff is
the property of a model that the variance of the parameter
estimated across samples can be reduced by increasing
the bias in the estimated parameters.

If you repeat the entire model building process multiple times, the
variance tells us how much the predictions for a given point vary Note: We will revisit this again in detail in this slide deck.
between different realizations of the model.
5 6

BIAS AND VARIANCE: BIAS AND VARIANCE:


GRAPHICAL DEFINITION GRAPHICAL DEFINITION

• Imagine that the center of the target is a model that perfectly


predicts the correct values. As we move away from the bulls-eye,
our predictions get worse and worse.
• Imagine we can repeat our entire model building process to get a Bias: Bias measures how far off
number of separate hits on the target. Each hit represents an in general models’ predictions
individual realization of our model, given the chance variability in are from the correct value.
the training data we gather.
• Sometimes we will get a good distribution of training data so we Variance: The variance is how
predict very well and we are close to the bulls-eye, while much the predictions for a given
sometimes our training data might be full of outliers or point vary between different
non-standard values resulting in poorer predictions. realizations of the model.
• These different realizations result in a scatter of hits on the target.
7
AN ILLUSTRATIVE EXAMPLE: AN ILLUSTRATIVE EXAMPLE:
VOTING INTENTIONS VOTING INTENTIONS
 

We put out our press release that the Democrats


are going to win by over 10 points; but, when the
election comes around, it turns out they
actually lose by 10 points.

Voting Republican Voting Democratic Non-Respondent Total That certainly reflects poorly on us!
13 16 21 50

Where did we go wrong in our model?


9
10

AN ILLUSTRATIVE EXAMPLE: AN ILLUSTRATIVE EXAMPLE:


VOTING INTENTIONS VOTING INTENTIONS
Sources of Bias Source of Variance
• Multiple issues with the trivial model we built:
1. We only sample people from the phone book and so only 1. Using a phonebook to select 4. Small sample size – increasing sample
participants in our survey – surveying only size will give consistent results each time
include people with listed numbers. certain classes of people, consistently! the survey and prediction are repeated.*
2. We did not follow up with non-respondents. They may have
2. Not following up with respondents – it
different voting patterns from the respondents. consistently changes the mixture of
3. We do not try to weight responses by likeliness to vote. responses we get!
4. We have a very small sample size. 3. Not trying to weight responses by
… likeliness to vote – treating all responses
… with an equal weightage
… Result: On our bulls-eye diagram, these Result: On the bulls-eye diagram, the low
and so on. biases move us away from the center of sample size results in a wide scatter of
the target, but they would not result in an estimates. Increasing the sample size
increased scatter of estimates. would make the estimates clump closers
• Can we classify these sources of errors as bias and variance? together, but they still might miss the
center of the target.
11 12
AN ILLUSTRATIVE EXAMPLE: BIAS AND VARIANCE:
VOTING INTENTIONS CAVEATS MATHEMATICAL DEFINITION
 

For instance, in general the data set used to build the model is
provided prior to model construction and the modeler cannot
simply say, "Let's increase the sample size to reduce variance."

*Note2: Even if we increase the sample size and get


consistent results, they still might be highly inaccurate due
to our large sources of bias, but the variance of
predictions will be reduced. 13 14

15

BIAS AND VARIANCE:


BIAS-VARIANCE: TRADEOFF REVISITED
MATHEMATICAL DEFINITION Three kinds of error:
 
1. Inherent (because of noise): unavoidable and irreducible
2. Bias: due to over-simplifications
Error due to incorrect assumptions (or Error due to variance of training samples (or 3.Variance: due to inability to perfectly estimate parameters
restriction of model class) randomness of training data)
from limited data
 

In practice an explicit tradeoff exists between bias and variance


where decreasing one increases the other.

Minimizing the total error of the model requires a careful


balancing of these two forms of error.
16
MACHINE LEARNING MODEL SPACE HYPOTHESIS SPACE AND HYPOTHESIS
• Hypothesis Space: It is the set of all the possible legal hypothesis.
This is the set from which the ML algorithm would determine the
models
consistent
best possible (only one) hypothesis which would best describe the
with data target function or the outputs.
correct
model all possible • Hypothesis: A hypothesis is a function that best describes the
restricted models
model
target in supervised machine learning. The hypothesis that
class an algorithm would come up with depends upon the data
and the restrictions and bias that we have imposed on the
data.
Every learning algorithm requires assumptions about the hypothesis space.
• Eg: “My hypothesis space is
Restricting model class can help. Or it can hurt! • …linear”
Depends on whether restrictions are domain appropriate • …decision trees with 5 nodes”
• …a three layer neural network with rectifier hidden units”
18

HYPERPARAMETER SELECTION GENERALIZATION (ON TEST DATA)


 
• Hyperparameters: Every Machine Learning model has
parameters that are explicitly defined by the user to control the
learning process and need to be tuned for an improved learning
of the model. Examples:
• k in k-NN or k-Nearest Neighbors
• Learning rate for training a neural network
• Train-test split ratio
• Batch Size
• Branches in Decision Tree
• Number of clusters in Clustering Algorithm
• … and so on.
We want to choose hyperparameter values which would give us
the best performance on test data.
20
19
21 22

GENERALIZATION (ON TEST DATA) BIAS –VARIANCE TRADEOFF - INTUITION 1


• On one hand, while choosing a hypothesis space, we may impose
• Model is too simple
the inductive bias too restrictively - leads to Underfitting! • Too few parameters
• A fundamental question in learning theory is, over which • Does not fit the data well
hypothesis classes, learning will not result in overfitting. • It is a biased solution
• On the other hand, if we minimize the error (loss) with respect to • Not enough flexibility
training data alone, it does not result in a good • Inaccurate!
expected/predictive performance – leads to Overfitting!
• We need to be able to generalize on the test data better • Model is too complex
i.e. minimize the expected loss (prediction risk). • Too many parameters
• Small changes to the data
changes the solution a lot
Intuitively, choosing a more restricted hypothesis class better • It is a high-variance solution
protects us against overfitting but at the same time might cause us • Too much sensitivity to the sample
• Inaccurate!
a stronger inductive bias. We will get back to this fundamental
tradeoff later.

BIAS –VARIANCE TRADEOFF - INTUITION 2 BIAS –VARIANCE TRADEOFF - INTUITION 3


• Bias is the true error (loss) of the best • Variance describes how much the best
predictor in the hypothesis set classifier depends on the training set

Under-fitting Appropriate-fitting Over-fitting Under-fitting (high bias)


(too simple to explain variance) (forcefitting – too good to be true

How to reduce variance?


How to reduce bias?
•Decrease model size (make simple) Appropriate-fitting
•Increase model size (make complex)
•Cross-validate the parameters
•Change model architecture
•Feature selection and dimensionality
•Feature addition and parameter tuning
reduction 23
•Ensure the data is truly representative Over-fitting (high variance)
•Get more training data
24
GENERALIZATION (ON TEST DATA): TRAINING ERROR VS PREDICTION/TEST ERROR
UNDERFITTING AND OVERFITTING
• Underfitting occurs when model is too “simple” to represent all
the relevant class characteristics.  
• High bias and low variance
• There can be a high training error and a high test error
• Arises when the classifier cannot represent the data.

• Overfitting occurs when model is too “complex” and fits


irrelevant characteristics (noise) in the data.
• Low bias and high variance
 
• There may be a low training error but a high test error
• Arises when the learner tries to fit every data point in the
training set very closely i.e. overfits.

25 26

27

TRAINING ERROR VS PREDICTION/TEST ERROR NO-FREE-LUNCH THEOREM


• The theorem states that all optimization algorithms perform equally well when
their performance is averaged across all possible problems.
MSEtest • It implies that there is no single best optimization algorithm i.e. no classifier is
inherently better than any other: you need to make assumptions to generalize

• Because of the close relationship between optimization, search, and machine


learning, it also implies that there is no single best machine learning algorithm
for predictive modeling problems such as classification and regression.

MSEtrain

28
30

GOING BEYOND THEORY: PRACTICAL CONSIDERATIONS


OCCAM’S RAZOR (PRINCIPLE OF PARSIMONY) • There are practical situations wherein we may choose a simpler model even if
• This philosophical idea in the context of ML suggests that all else being equal, it’s generalization error is lower than that of the more complex models.
a simpler model is be preferred over a more complex model. • In fact simpler models may provide the following advantages:
• Less memory usage.
• It does not mean that simpler models are universally better than complex • Faster inference times.
models, but rather that a model must be complex enough to learn the patterns • Better explainability.
in a dataset and to avoid underfitting but simple enough to avoid overfitting. • Ex: Consider a fraud detection problem where a decision tree achieves an
accuracy of 98% and a neural network achieves an accuracy of 99%.
• When choosing between two models, we can only say a simpler model is • Consider other restrictions/limitations of the problem:
better if it’s generalization error is equal to or less than that of the more • Requires quick inference times
complex model. • Our deployment server has memory limitations
• The model’s predictions need to be explained to a regulatory body
within the company that is working on this use case.

• Clearly, a decision tree is a much better model than a neural network here as it
is likely a smaller model with faster inference times and is much easier to
explain than a neural network.
29 • Unless the 1% drop in accuracy is immensely significant in such a problem, the
practical decision would be to choose the decision tree over the neural network.

MANAGING BIAS AND VARIANCE MANAGING BIAS AND VARIANCE


Fight your instincts: Gut feeling of minimizing bias even at the Bagging and Resampling Techniques: Can be used to reduce
expense of variance is not a good idea. This is mistaken logic! the variance in model predictions.
• In Bagging (Bootstrap Aggregating), numerous replicates of the original data set
• It is common to assume the presence of bias indicates something basically are created using random selection with replacement. Each derivative data set is
wrong with the model and algorithm. It is wrongly inferred that while variance then used to construct a new model and the models are gathered together into an
is also bad, a model with high variance could at least predict well on ensemble. To make a prediction, all of the models in the ensemble are polled and
average, i.e. at least it is not fundamentally wrong. their results are averaged.
• One powerful modeling algorithm that makes good use of bagging is Random
• It is true that a high variance and low bias model can preform well in some Forests (RF). It works by training numerous decision trees each based on a
sort of long-run average sense. However, in practice modelers are always different resampling of the original training data. In RF, the bias of the full model
dealing with a single realization of the data set. In these cases, long run is equivalent to the bias of a single decision tree (which itself has high variance).
averages are irrelevant, what is important is the performance of the model on • By creating many of these trees, in effect a "forest", and then averaging them the
the data you actually have and in this case bias and variance are equally variance of the final model can be greatly reduced over that of a single tree. In
important and one should not be improved at an excessive expense to the 31 practice the only limitation on the size of the forest is computing time as an infinite
32
other. number of trees could be trained without ever increasing bias and with a continual
(if asymptotically declining) decrease in the variance.
34

MANAGING BIAS AND VARIANCE MANAGING BIAS AND VARIANCE


Understanding Over- and Under-Fitting: At its root, dealing with bias and
Asymptotic Properties of Algorithms: Theoretically speaking, as variance is really about dealing with over- and under-fitting. Bias is reduced
your training sample size grows towards infinity, your model’s and variance is increased in relation to model complexity. As more and more
bias will fall to 0 (asymptotic consistency) and your model will parameters are added to a model, the complexity of the model rises and
have a variance that is no worse than any other potential model variance becomes our primary concern while bias steadily falls. 
you could have used (asymptotic efficiency). Understanding bias and variance
• However, we do not live in a world of infinite sample sizes so asymptotic is critical for understanding the
properties generally have very little practical use. behavior of prediction models,
• An algorithm that may have close to no bias when you have a million points, may but in general what you really
have very significant bias when you only have a few hundred data points. care about is overall error, not
• More importantly, an asymptotically consistent and efficient algorithm may the specific decomposition.
actually perform worse on small sample size data sets than an algorithm that is
neither asymptotically consistent nor efficient. The sweet spot for any model is
When working with real data, it is best to leave aside theoretical the level of complexity at which
properties of algorithms and to instead focus on their actual accuracy the increase in bias is equivalent
in a given scenario. to the reduction in variance.
33

BIAS-VARIANCE TRADEOFF: BIAS VARIANCE TRADEOFF: SUMMARY


OVERFITTING AND UNDERFITTING  

• Tradeoff between bias and variance:
• Simple Models: High Bias, Low Variance 
• Complex Models: Low Bias, High Variance
• If our model complexity exceeds this sweet spot, we are in effect Mean Squared Error (true risk)
over-fitting our model; while if our complexity falls short of the
sweet spot, we are under-fitting the model.

• In practice, there is no analytical way to find this location.


Estimation error (variance)
• Instead we must use an accurate measure of prediction error and Approximation error (bias2)
explore differing levels of model complexity and then choose the
35

complexity level that minimizes the overall error. 


36
MORE ABOUT MANAGING BIAS AND VARIANCE, PRACTICALLY AN APPLIED EXAMPLE:
• Ensemble methods reduce variance
VOTER PARTY REGISTRATION
Consider a bit more realistic example:
• Multiple classifiers are combined
• Eg: Bagging, boosting Assume we have a training data set of voters each tagged with
three properties: voter party registration, voter wealth, and a
• Decision trees of a given depth quantitative measure of voter religiousness.
• Increasing depth decreases bias, increases variance

• SVMs
• Higher degree polynomial kernels decreases bias, increases variance
• Stronger regularization increases bias, decreases variance We want to predict
• Neural networks
voter registration
• Deeper models can increase variance, but decrease bias using wealth and
religiousness as
• K- Nearest Neighbors
• Increasing k generally increases bias, reduces variance 37 predictors.
Red circles: Republican voters
Blue circles: Democratic voters
38

AN APPLIED EXAMPLE: 40

THE K-NEAREST NEIGHBOR ALGORITHM


THE K-NEAREST NEIGHBOR ALGORITHM
• There are many ways to go about this modeling task. • The following figure shows the nearest neighborhoods for each of
• For binary data like ours, logistic regressions are often used. the original voters. If k was specified as 1, a new voter's party
• However, if we think there are non-linearities in the registration would be determined by whether they fall within a
relationships between the variables, a more flexible, red or blue region.
data-adaptive approach might be desired.
• One such very flexible machine-learning technique is
k-Nearest Neighbors or k-NN.
• In k-NN, the party registration of a given voter will be found by
plotting him or her on the plane with the other voters.
• The nearest k other voters to him or her will be found using a
geographic measure of distance and the average of their
registrations will be used to predict his or her registration.
• So if the nearest voter to him/her (in terms of wealth and
religiousness) is a Democrat, s/he will also be predicted to be a
Democrat. 39
Red circles: Republican voters
Blue circles: Democratic voters
41 42

THE K-NEAREST NEIGHBOR ALGORITHM KNN: BEST K?


• If we sample new voters, we can use our existing training data to predict their • We can also plot the full prediction regions for where individuals will be
registration. classified as either Democrats or Republicans.
• The following figure plots the wealth and religiousness for these new voters
and uses the k-NN algorithm to predict their registration. • Depending upon the choice of k in kNN (1 to number of data points in training
set), our accuracy of predictions/placement of new data points varies.

• What is the best value of k?

• Let us try experimenting with the value of k to find the best prediction
algorithm that matches up well with the black boundary line.

• Visit here for trying out this example out yourself.

Note that Red region represents Republican voters, Blue region


represents Democratic voters and Black dotted line represents
Red circles: Republican voters
Actual models’ split.
Blue circles: Democratic voters

43 44

COMPARING RESULTS FOR VARIOUS VALUES OF K


K AND BIAS/VARIANCE
• Increasing k results in the averaging of more voters in each prediction. This
results in smoother prediction curves.
• With a k of 1, the separation between Democrats and Republicans is very
rough. Furthermore, there are "islands" of Democrats in generally Republican
territory and vice versa.

Red circles: Republican voters


Blue circles: Democratic voters
45 46

K AND BIAS/VARIANCE K AND BIAS/VARIANCE


• As k is increased to, say, 20, the transition becomes smoother and the islands disappear • As k becomes very large, say, 80, the distinction between the two categories becomes
and the split between Democrats and Republicans does a good job of following the more blurred and the boundary prediction line is not matched very well at all.
boundary line. • On the other hand, at large k's the transition is very smooth so there isn't much variance,
• At small k's the roughness and islands are signs of variance. The locations of the islands but the lack of a match to the boundary line is a sign of high bias.
and the exact curves of the boundaries will change radically as new data is gathered.

Red circles: Republican voters Red circles: Republican voters


Blue circles: Democratic voters Blue circles: Democratic voters

47 48

KNN AND BIAS/VARIANCE TRADEOFF KNN AND ANALYTICAL BIAS/VARIANCE


Results and Inferences (k vs Bias/Variance):  
• Increasing k will decrease variance and increase bias.
• Decreasing k will increase variance and decrease bias.

Results and Inferences (k vs Bias/Variance):


•See how variable the predictions are for different data sets at
low k. As k increases this variability is reduced.
•However, if we increase k too much, then we no longer follow the
true boundary line and we observe high bias.

This is the nature of the Bias-Variance Tradeoff.


REFERENCES
• Huang, Catherine. "A Blog About Lunch and Data Science — How There is No Such a Thing As

Free Lunch." Medium, 27 Sept. 2018,

towardsdatascience.com/a-blog-about-lunch-and-data-science-how-there-is-no-such-a-thing-as-f

ree-lunch-e46fd57c7f27.

• "Just a Moment..." Just a Moment..,

machinelearningmastery.com/no-free-lunch-theorem-for-machine-learning/.

• Mavuduru, Amol. "What Occam’s Razor Means in Machine Learning." Medium, 9 Aug. 2022,

towardsdatascience.com/what-occams-razor-means-in-machine-learning-53f07effc97c.

• "ML | Underfitting and Overfitting." GeeksforGeeks, 22 Aug. 2022,

www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/.

• "ML | Understanding Hypothesis." GeeksforGeeks, 21 Nov. 2019,

www.geeksforgeeks.org/ml-understanding-hypothesis/.

• Shah, Abhishek. "Generalization Error in Machine Learning (Bias Vs. Variance)." Medium, 49

19 July 2022,

medium.com/@jwbtmf/generalization-error-in-machine-learning-4617141932b7.

• "Understanding the Bias-Variance Tradeoff." scott.fortmann-roe.com/docs/BiasVariance.html.

• www.inf.ed.ac.uk/teaching/courses/mlsc/Notes/Lecture4/BiasVariance.pdf.
Data Mining Ensemble Methods

● Construct a set of base classifiers learned from


the training data

Ensemble Techniques
● Predict class label of test records by combining
the predictions made by multiple classifiers (e.g.,
Introduction to Data Mining, 2nd Edition by taking majority vote)
by
Tan, Steinbach, Karpatne, Kumar
Acknowledgments: The slides in this presentation are mostly the textbook slides from the
data mining textbook “Introduction to Data Mining (2nd edition)” by Tan, Steinbach,
Karpatne, Kumar. The slides have been modified i.e., some new slides have been added and
some slides have been deleted. Information in this presentation has also been obtained from
a wide variety of publicly available Internet sources.
10/11/2021 Introduction to Data Mining, 2nd Edition 1 10/11/2021 Introduction to Data Mining, 2nd Edition 2

Example: Why Do Ensemble Methods Work? Necessary Conditions for Ensemble Methods

● Ensemble Methods work better than a single base classifier if:


1. All base classifiers are independent of each other
2. All base classifiers perform better than random guessing
(error rate < 0.5 for binary classification)

Classification error for an


ensemble of 25 base classifiers,
assuming their errors are
uncorrelated.

10/11/2021 Introduction to Data Mining, 2nd Edition 3 10/11/2021 Introduction to Data Mining, 2nd Edition 4
Rationale for Ensemble Learning Bias-Variance Decomposition

● Ensemble Methods work best with unstable ●


base classifiers
– Classifiers that are sensitive to minor perturbations in
training set, due to high model complexity
– Examples: Unpruned decision trees, ANNs, …

10/11/2021 Introduction to Data Mining, 2nd Edition 5 10/11/2021 Introduction to Data Mining, 2nd Edition 6

Bias-Variance Trade-off and Overfitting General Approach of Ensemble Learning

Overfitting

Underfitting Using majority vote or


weighted majority vote
(weighted according to their
● Ensemble methods try to reduce the variance of complex accuracy or relevance)
models (with low bias) by aggregating responses of
multiple base classifiers
10/11/2021 Introduction to Data Mining, 2nd Edition 7 10/11/2021 Introduction to Data Mining, 2nd Edition 8
Constructing Ensemble Classifiers
Ensemble Learning
• The core idea of Ensemble learning is to improve ML
● By manipulating training set predictions/results by amalgamating/combining different
– Example: bagging, boosting, random forests
models to facilitate towards providing better predictions than if
● By manipulating input features you had used any of those models on an individual or
– Example: random forests
standalone basis
• Remember that ensemble in English means a group of people
● By manipulating class labels playing music together, some kind of collection: hence, the
– Example: error-correcting output coding name “Ensemble learning”
• Core idea is to learn multiple classifiers and then use a voting
● By manipulating learning algorithm mechanism
– Example: injecting randomness in the initial weights of ANN
• Types of ensemble learning
– Bagging
– Boosting
10
10/11/2021 Introduction to Data Mining, 2nd Edition 9

Bagging (Bootstrap AGGregatING)


Bagging
● Bootstrap sampling: sampling with replacement
• Bootstrap Aggregating ● Build classifier on each bootstrap sample
• ML ensemble algorithm for improving stability and
accuracy
● Probability of a training instance being selected in
• Decreases the variance 🡪 avoids overfitting a bootstrap sample is:
• Used in classification and regression, decision tree 1 – (1 - 1/n)n (n: number of training instances)
methods ~0.632 when n is large
• Special case of model averaging approach Each classifier provides its class prediction
Ensemble bagging classifier counts votes and assigns the
class that got highest votes to the test sample

11
10/11/2021 Introduction to Data Mining, 2nd Edition 12
Bagging Algorithm
Big picture of Bagging
• From the dataset, create multiple subsets (samples)
with equal number of tuples (with replacement)
• Build classifier model on each sample
– Observe how each model is learned in parallel and
independently
• Combine predictions from all models based on voting
mechanism

13
10/11/2021 Introduction to Data Mining, 2nd Edition 14

Boosting
Boosting
● An iterative procedure to adaptively change
• The core idea is to build a model using the training data, and
then build another model that attempts to rectify the errors of
distribution of training data by focusing more on
the first model previously misclassified records
• This is an iterative process with stopping conditions such as: – Initially, all N records are assigned equal
– stop when you have already added the maximum number weights (for being selected for training)
of models or
– Unlike bagging, weights may change at the
– stop when the complete training set has been correctly
end of each boosting round
predicted
• Observe how the learning in case of boosting is sequential
(not parallel) and adaptive
– Observe how this is different from bagging, where the
learning is happening in parallel and independently
15
10/11/2021 Introduction to Data Mining, 2nd Edition 16
Boosting
Boosting
● Records that are wrongly classified will have their
• There are many boosting algorithms
weights increased in the next round
• AdaBoost algorithm by Robert Schapire and Yoav Freund
● Records that are classified correctly will have • Adaptive boosting algorithm is among the most popular
their weights decreased in the next round boosting algorithms
• Converts multiple weak classifiers to create one strong
classfier

Source:
https://www.geeksforgeeks.org/bagging-vs-boosting-in-machine-learning/ 18
10/11/2021 Introduction to Data Mining, 2nd Edition 17

AdaBoost
Big picture of AdaBoost
● Base classifiers: C1, C2, …, CT
• Initialize the dataset and assign equal weight to each of the
data point ● Error rate of a base classifier:
• Provide this as input to the model and identify the wrongly
classified data points
• Increase the weight of the wrongly classified data points and
decrease the weights of correctly classified data points. And
then normalize the weights of all data points.
• If (required results have been obtained) then end, otherwise ● Importance of a classifier:
continue iteratively
Source:
https://www.geeksforgeeks.org/bagging-vs-boosting-in-machine-l
earning/ (Adapted from the above source with some minor
modifications) 19
10/11/2021 Introduction to Data Mining, 2nd Edition 20
AdaBoost Algorithm AdaBoost Algorithm

● Weight update:

● If any intermediate rounds produce error rate


higher than 50%, the weights are reverted back to
1/n and the resampling procedure is repeated
● Classification:

10/11/2021 Introduction to Data Mining, 2nd Edition 21 10/11/2021 Introduction to Data Mining, 2nd Edition 22

Random Forest Algorithm Characteristics of Random Forest

● Construct an ensemble of decision trees by


manipulating training set as well as features

– Use bootstrap sample to train every decision


tree (similar to Bagging)
– Use the following tree induction algorithm:
◆ At every internal node of decision tree, randomly
sample p attributes for selecting split criterion
◆ Repeat this procedure until all leaves are pure
(unpruned tree)

10/11/2021 Introduction to Data Mining, 2nd Edition 23 10/11/2021 Introduction to Data Mining, 2nd Edition 24
Gradient Boosting

● Constructs a series of models


– Models can be any predictive model that has a
differentiable loss function
– Commonly, trees are the chosen model
◆ XGboost (extreme gradient boosting) is a popular
package because of its impressive performance
● Boosting can be viewed as optimizing the loss
function by iterative functional gradient descent.
● Implementations of various boosted algorithms
are available in Python, R, Matlab, and more.

10/11/2021 Introduction to Data Mining, 2nd Edition 25


Data Mining
Classification: Alternative Techniques

Lecture Notes for Chapter 4

Classification_Part2 Instance-Based Learning

Introduction to Data Mining , 2nd Edition


by
Acknowledgments: The slides in this presentation are mostly the textbook slides from the
textbook “Introduction to Data Mining” by Tan, Steinbach, Karpatne, Kumar. The slides have
been modified i.e., some new slides have been added and some slides have been deleted.
Tan, Steinbach, Karpatne, Kumar
Information in this presentation has also been obtained from a wide variety of publicly
available Internet sources.

2/10/2021 Introduction to Data Mining, 2nd Edition 1

Nearest Neighbor Classifiers Nearest-Neighbor Classifiers

Requires the following:


● Basic idea: ●

– A set of labeled records


– If it walks like a duck, quacks like a duck, then – Proximity metric to compute
it’s probably a duck distance/similarity between a
pair of records
– e.g., Euclidean distance

Compute – The value of k, the number of


Distance Test nearest neighbors to retrieve
Record – A method for using class
labels of K nearest neighbors
to determine the class label of
unknown record (e.g., by
taking majority vote)

Training Choose k of the


Records “nearest” records

2/10/2021 Introduction to Data Mining, 2nd Edition 3 2/10/2021 Introduction to Data Mining, 2nd Edition 4
How to Determine the class label of a Test Sample? Choice of proximity measure matters


● For documents, cosine is better than correlation or
Euclidean

111111111110 000000000001
vs
011111111111 100000000000

Euclidean distance = 1.4142 for both pairs, but


the cosine similarity measure has different
values for these pairs.

2/10/2021 Introduction to Data Mining, 2nd Edition 5 2/10/2021 Introduction to Data Mining, 2nd Edition 6

Nearest Neighbor Classification… Nearest Neighbor Classification…

● Data preprocessing is often required ● Choosing the value of k:


– Attributes may have to be scaled to prevent distance – If k is too small, sensitive to noise points
measures from being dominated by one of the – If k is too large, neighborhood may include points from
attributes other classes
◆Example:

– height of a person may vary from 1.5m to 1.8m


– weight of a person may vary from 90lb to 300lb
– income of a person may vary from $10K to $1M

– Time series are often standardized to have 0


means a standard deviation of 1

2/10/2021 Introduction to Data Mining, 2nd Edition 7 2/10/2021 Introduction to Data Mining, 2nd Edition 8
Nearest-neighbor classifiers Nearest Neighbor Classification…

● Nearest neighbor ● How to handle missing values in training and


classifiers are local
classifiers
test sets?
– Proximity computations normally require the
● They can produce 1-nn decision boundary is presence of all attributes
decision boundaries of a Voronoi Diagram – Some approaches use the subset of attributes
arbitrary shapes. present in two instances
◆ This may not produce good results since it
effectively uses different proximity measures for
each pair of instances
◆ Thus, proximities are not comparable

2/10/2021 Introduction to Data Mining, 2nd Edition 9 2/10/2021 Introduction to Data Mining, 2nd Edition 10

K-NN Classifiers…
Handling Irrelevant and Redundant Attributes K-NN Classifiers: Handling attributes that are interacting

– Irrelevant attributes add noise to the proximity measure


– Redundant attributes bias the proximity measure towards certain
attributes

2/10/2021 Introduction to Data Mining, 2nd Edition 11 2/10/2021 Introduction to Data Mining, 2nd Edition 12
Improving KNN Efficiency

● Avoid having to compute distance to all objects in


the training set
– Multi-dimensional access methods (k-d trees)
– Fast approximate similarity search
– Locality Sensitive Hashing (LSH) Rule-based Classification
● Condensing
– Determine a smaller set of objects that give
the same performance
● Editing
– Remove objects to improve efficiency
2/10/2021 Introduction to Data Mining, 2nd Edition 13 2/10/2021 Introduction to Data Mining, 2nd Edition 14

Data Mining
Rule-Based Classifier
Classification: Alternative Techniques
● Classify records by using a collection of
Lecture Notes for Chapter 4 “if…then…” rules
● Rule: (Condition) → y
Rule-Based – where
◆ Condition is a conjunction of tests on attributes
◆ y is the class label
Introduction to Data Mining , 2nd Edition – Examples of classification rules:
by ◆ (Blood Type=Warm) ∧ (Lay Eggs=Yes) → Birds
Tan, Steinbach, Karpatne, Kumar ◆ (Taxable Income < 50K) ∧ (Refund=Yes) → Evade=No

2/10/2021 Introduction to Data Mining, 2nd Edition 15 2/10/2021 Introduction to Data Mining, 2nd Edition 16
Rule-based Classifier (Example) Application of Rule-Based Classifier

● A rule r covers an instance x if the attributes of


the instance satisfy the condition of the rule
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds


R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) →
The rule R1 covers a hawk => Bird
Mammals
The rule R3 covers the grizzly bear => Mammal
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (LiveIntroduction
2/10/2021
in Water to= Data
sometimes) → Amphibians
Mining, 2nd Edition 17 2/10/2021 Introduction to Data Mining, 2nd Edition 18

Rule Coverage and Accuracy How does Rule-based Classifier Work?

R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds


● Coverage of a rule:
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
– Fraction of records R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
that satisfy the R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
antecedent of a rule R5: (Live in Water = sometimes) → Amphibians

● Accuracy of a rule:
– Fraction of records
that satisfy the
antecedent that
A lemur triggers rule R3, so it is classified as a mammal
also satisfy the A turtle triggers both R4 and R5
consequent of a (Status=Single) → No A dogfish shark triggers none of the rules
rule Coverage = 40%, Accuracy = 50%

2/10/2021 Introduction to Data Mining, 2nd Edition 19 2/10/2021 Introduction to Data Mining, 2nd Edition 20
Characteristics of Rule Sets: Strategy 1 Characteristics of Rule Sets: Strategy 2

● Mutually exclusive rules ● Rules are not mutually exclusive


– Classifier contains mutually exclusive rules if – A record may trigger more than one rule
the rules are independent of each other – Solution?
– Every record is covered by at most one rule ◆ Ordered rule set
◆ Unordered rule set – use voting schemes
● Exhaustive rules
– Classifier has exhaustive coverage if it ● Rules are not exhaustive
accounts for every possible combination of – A record may not trigger any rules
attribute values – Solution?
– Each record is covered by at least one rule ◆ Use a default class

2/10/2021 Introduction to Data Mining, 2nd Edition 21 2/10/2021 Introduction to Data Mining, 2nd Edition 22

Ordered Rule Set Rule Ordering Schemes

● Rules are rank ordered according to their priority ● Rule-based ordering


– An ordered rule set is known as a decision list – Individual rules are ranked based on their quality
● When a test record is presented to the classifier ● Class-based ordering
– It is assigned to the class label of the highest ranked rule it has – Rules that belong to the same class appear together
triggered
– If none of the rules fired, it is assigned to the default class

R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds


R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) →
Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

2/10/2021 Introduction to Data Mining, 2nd Edition 23 2/10/2021 Introduction to Data Mining, 2nd Edition 24
Building Classification Rules Direct Method: Sequential Covering

● Direct Method: 1. Start from an empty rule


◆ Extract rules directly from data 2. Grow a rule using the Learn-One-Rule function
◆ Examples: RIPPER, CN2, Holte’s 1R
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping criterion
● Indirect Method: is met
◆ Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
◆ Examples: C4.5rules

2/10/2021 Introduction to Data Mining, 2nd Edition 25 2/10/2021 Introduction to Data Mining, 2nd Edition 26

Example of Sequential Covering Example of Sequential Covering…

2/10/2021 Introduction to Data Mining, 2nd Edition 27 2/10/2021 Introduction to Data Mining, 2nd Edition 28
Rule Growing Rule Evaluation
FOIL: First Order Inductive
● Two common strategies ● Learner – an early
rule-based learning
algorithm

2/10/2021 Introduction to Data Mining, 2nd Edition 29 2/10/2021 Introduction to Data Mining, 2nd Edition 30

Direct Method: RIPPER Direct Method: RIPPER

● For 2-class problem, choose one of the classes as ● Growing a rule:


positive class, and the other as negative class – Start from empty rule
– Learn rules for positive class – Add conjuncts as long as they improve FOIL’s
– Negative class will be default class information gain
– Stop when rule no longer covers negative examples
● For multi-class problem
– Prune the rule immediately using incremental reduced
– Order the classes according to increasing class error pruning
prevalence (fraction of instances that belong to a
– Measure for pruning: v = (p-n)/(p+n)
particular class)
◆ p: number of positive examples covered by the rule in
– Learn the rule set for smallest class first, treat the rest the validation set
as negative class ◆ n: number of negative examples covered by the rule in
the validation set
– Repeat with next smallest class as positive class
– Pruning method: delete any final sequence of
conditions that maximizes v
2/10/2021 Introduction to Data Mining, 2nd Edition 31 2/10/2021 Introduction to Data Mining, 2nd Edition 32
Direct Method: RIPPER Direct Method: RIPPER

● Building a Rule Set: ● Optimize the rule set:


– Use sequential covering algorithm – For each rule r in the rule set R
◆ Finds the best rule that covers the current set of ◆ Consider 2 alternative rules:
positive examples – Replacement rule (r*): grow new rule from scratch
◆ Eliminate both positive and negative examples – Revised rule(r′): add conjuncts to extend the rule r
covered by the rule ◆ Compare the rule set for r against the rule set for r*
and r′
– Each time a rule is added to the rule set,
◆ Choose rule set that minimizes MDL principle (see
compute the new description length
next slide)
◆ Stop adding new rules when the new description
length is d bits longer than the smallest description – Repeat rule generation and rule optimization
length obtained so far for the remaining positive examples

2/10/2021 Introduction to Data Mining, 2nd Edition 33 2/10/2021 Introduction to Data Mining, 2nd Edition 34

What is MDL (Mininum Description Length) Indirect Methods

● The minimum description length (MDL) principle


– Given a limited set of observed data, the best
explanation is the one that allows for the maximum
compression of the data.
● The minimum description length (MDL) principle is
very important in ML, pattern recognition, inductive
inference and statistical modeling.
● Used in scenarios where the models are very
complex, and overfitting is a serious issue
Source:
https://mitpress.mit.edu/9780262529631/the-minimum-description-length-principle/#:~:text=T
he%20minimum%20description%20length%20(MDL)%20principle%20is%20a%20powerful
%20method,greatest%20compression%20of%20the%20data.
2/10/2021 Introduction to Data Mining, 2nd Edition 35 2/10/2021 Introduction to Data Mining, 2nd Edition 36
Indirect Method: C4.5rules Indirect Method: C4.5rules

● Extract rules from an unpruned decision tree ● Instead of ordering the rules, order subsets of
● For each rule, r: A → y, rules (class ordering)
– consider an alternative rule r′: A′ → y where A′ – Each subset is a collection of rules with the
is obtained by removing one of the conjuncts same rule consequent (class)
in A
– Compare the pessimistic error rate for r
against all r’s
– Prune if one of the alternative rules has lower
pessimistic error rate
– Repeat until we can no longer improve
generalization error

2/10/2021 Introduction to Data Mining, 2nd Edition 37 2/10/2021 Introduction to Data Mining, 2nd Edition 38

Example C4.5 versus C4.5rules versus RIPPER


C4.5rules:
(Give Birth=No, Can Fly=Yes) → Birds
(Give Birth=No, Live in Water=Yes) → Fishes
(Give Birth=Yes) → Mammals
(Give Birth=No, Can Fly=No, Live in Water=No) → Reptiles
( ) → Amphibians

RIPPER:
(Live in Water=Yes) → Fishes
(Have Legs=No) → Reptiles
(Give Birth=No, Can Fly=No, Live In Water=No)
→ Reptiles
(Can Fly=Yes,Give Birth=No) → Birds
() → Mammals

2/10/2021 Introduction to Data Mining, 2nd Edition 39 2/10/2021 Introduction to Data Mining, 2nd Edition 40
C4.5 versus C4.5rules versus RIPPER Advantages of Rule-Based Classifiers

C4.5 and C4.5rules: ● Has characteristics quite similar to decision trees


– As highly expressive as decision trees
– Easy to interpret (if rules are ordered by class)
– Performance comparable to decision trees
◆Can handle redundant and irrelevant attributes
RIPPER: ◆ Variable interaction can cause issues (e.g., X-OR problem)
● Better suited for handling imbalanced classes
● Harder to handle missing values in the test set

2/10/2021 Introduction to Data Mining, 2nd Edition 41 2/10/2021 Introduction to Data Mining, 2nd Edition 42

Data Mining
Classification: Alternative Techniques

Imbalanced Class Problem


Imbalanced Class Problem
Introduction to Data Mining, 2nd Edition
by
Tan, Steinbach, Karpatne, Kumar

2/10/2021 Introduction to Data Mining, 2nd Edition 43 2/10/2021 Introduction to Data Mining, 2nd Edition 44
Class Imbalance Problem Confusion Matrix

● Lots of classification problems where the classes ● Confusion Matrix:


are skewed (more records from one class than
another) PREDICTED CLASS
– Credit card fraud Class=Yes Class=No
– Intrusion detection
– Defective products in manufacturing assembly line Class=Yes a b
ACTUAL
– COVID-19 test results on a random sample
CLASS Class=No c d

● Key Challenge:
a: TP (true positive)
– Evaluation measures such as accuracy are not b: FN (false negative)
well-suited for imbalanced class c: FP (false positive)
d: TN (true negative)

2/10/2021 Introduction to Data Mining, 2nd Edition 45 2/10/2021 Introduction to Data Mining, 2nd Edition 46

Accuracy Problem with Accuracy


● Consider a 2-class problem
PREDICTED CLASS – Number of Class NO examples = 990
– Number of Class YES examples = 10
Class=Yes Class=No
● If a model predicts everything to be class NO, accuracy is
990/1000 = 99 %
Class=Yes a b
ACTUAL (TP) (FN)
– This is misleading because this trivial model does not detect any class
YES example
CLASS
Class=No c d – Detecting the rare class is usually more interesting (e.g., frauds,
(FP) (TN) intrusions, defects, etc)

PREDICTED CLASS
● Most widely-used metric:
Class=Yes Class=No

Class=Yes 0 10
ACTUAL
CLASS Class=No 0 990
nd nd
2/10/2021 Introduction to Data Mining, 2 Edition 47 2/10/2021 Introduction to Data Mining, 2 Edition 48
Which model is better? Which model is better?

PREDICTED
Class=Yes Class=No PREDICTED
A ACTUAL Class=Yes 0 10 A Class=Yes Class=No
Class=No 0 990 ACTUAL Class=Yes 5 5
Accuracy: 99% Class=No 0 990

PREDICTED
PREDICTED
B Class=Yes Class=No B Class=Yes Class=No
ACTUAL Class=Yes 10 0
ACTUAL Class=Yes 10 0
Class=No 500 490
Class=No 500 490
Accuracy: 50%
2/10/2021 Introduction to Data Mining, 2nd Edition 49 2/10/2021 Introduction to Data Mining, 2nd Edition 50

Alternative Measures Alternative Measures

PREDICTED CLASS PREDICTED CLASS


Class=Yes Class=No Class=Yes Class=No

Class=Yes 10 0
Class=Yes a b ACTUAL
ACTUAL CLASS Class=No 10 980
CLASS Class=No c d

2/10/2021 Introduction to Data Mining, 2nd Edition 51 2/10/2021 Introduction to Data Mining, 2nd Edition 52
Alternative Measures Which of these classifiers is better?

PREDICTED CLASS PREDICTED CLASS


Class=Yes Class=No Class=Yes Class=No

Class=Yes 10 0
A Class=Yes 40 10
ACTUAL ACTUAL
CLASS Class=No 10 980 CLASS Class=No 10 40

PREDICTED CLASS PREDICTED CLASS


Class=Yes Class=No B Class=Yes Class=No

Class=Yes 1 9 Class=Yes 40 10
ACTUAL ACTUAL
CLASS Class=No 0 990 CLASS Class=No 1000 4000

2/10/2021 Introduction to Data Mining, 2nd Edition 53 2/10/2021 Introduction to Data Mining, 2nd Edition 54

Measures of Classification Performance Alternative Measures

A PREDICTED CLASS
PREDICTED CLASS Class=Yes Class=No
Yes No
ACTUAL Class=Yes 40 10
CLASS
Yes TP FN ACTUAL
No FP TN CLASS Class=No 10 40

α is the probability that we reject


the null hypothesis when it is
true. This is a Type I error or a
false positive (FP).
B PREDICTED CLASS
β is the probability that we
Class=Yes Class=No
accept the null hypothesis when
it is false. This is a Type II error
or a false negative (FN). Class=Yes 40 10
ACTUAL
CLASS Class=No 1000 4000

2/10/2021 Introduction to Data Mining, 2nd Edition 55 2/10/2021 Introduction to Data Mining, 2nd Edition 56
Which of these classifiers is better?

A PREDICTED CLASS
Class=Yes Class=No

Class=Yes 10 40
ACTUAL
Class=No 10 40
CLASS

B PREDICTED CLASS
Class=Yes Class=No
Class=Yes 25 25
ACTUAL Class=No 25 25
CLASS

C PREDICTED CLASS
Class=Yes Class=No

Class=Yes 40 10
ACTUAL
CLASS Class=No 40 10

2/10/2021 Introduction to Data Mining, 2nd Edition 57


Clustering_Part2
Quick recap on clustering and its
applications

Acknowledgments: The slides in this presentation are mostly the textbook slides
from the textbook “Data Mining: Concepts and Techniques” by Jiawei Han,
Micheline Kamber, and Jian Pei. The slides have been modified i.e., some new
slides have been added and some slides have been deleted. Information in this
presentation has also been obtained from a wide variety of publicly available
Internet sources.
1
Han/Eick: Clustering II 2

General Applications of Clustering Examples of Clustering Applications


■ Pattern Recognition ■ Marketing: Help marketers discover distinct groups in their
■ Spatial Data Analysis customer bases, and then use this knowledge to develop
■ create thematic maps in GIS by clustering feature targeted marketing programs
spaces ■ Land use: Identification of areas of similar land use in an
■ detect spatial clusters and explain them in spatial data
earth observation database
mining
■ Image Processing ■ Insurance: Identifying groups of motor insurance policy
■ Economic Science (especially market research) holders with a high average claim cost
■ WWW ■ City-planning: Identifying groups of houses according to
■ Document classification their house type, value, and geographical location
■ Cluster Weblog data to discover groups of similar ■ Earth-quake studies: Observed earth quake epicenters
access patterns should be clustered along continent faults

Han/Eick: Clustering II 3 Han/Eick: Clustering II 4


Requirements of Clustering in Data
What Is Good Clustering? Mining
■ A good clustering method will produce high quality ■ Scalability
clusters with ■ Ability to deal with different types of attributes
■ Discovery of clusters with arbitrary shape
■ high intra-class similarity
■ Minimal requirements for domain knowledge to
■ low inter-class similarity
determine input parameters
■ The quality of a clustering result depends on both the
■ Able to deal with noise and outliers
similarity measure used by the method and its
■ Insensitive to order of input records
implementation.
■ High dimensionality
■ The quality of a clustering method is also measured by its
■ Incorporation of user-specified constraints
ability to discover some or all of the hidden patterns.
■ Interpretability and usability

Han/Eick: Clustering II 5 Han/Eick: Clustering II 6

Conceptual note on REALLY learning


algorithms

■ In this course, you have seen several


algorithms for clustering & classification
Conceptual notes ■ What is more important?
A) Memorizing the key steps of an algorithm
B) Understanding the core idea of an algorithm

Han/Eick: Clustering II 7 Han/Eick: Clustering II 8


Conceptual note on REALLY learning
algorithms

■ In this course, you have seen several ■ If you memorize just the steps of an algorithm,
algorithms for clustering & classification you won’t remember very long
■ What is more important? ■ And more importantly, you won’t know how
A) Memorizing the key steps of an algorithm and where to apply it
B) Understanding the core idea of an algorithm
■ B) is the correct answer.
■ Focus on understanding, not memorizing

Han/Eick: Clustering II 9 Han/Eick: Clustering II 10

■ You should understand that most of these algorithms ■ Example: The classical algorithms you studied were
were developed to solve specific problems proposed more than a decade ago
■ At that time, main memory, computational power etc
■ Hence, when you see a new problem, simply using one were all not as good as now
algorithm may not be effective ■ But now lot of data management work including ML will
■ Need to combine ideas from different algorithms occur on mobile platforms, whereas those algorithms
were designed for fixed computing infrastructures such
■ As technology progresses and technological as centralized systems and clusters
environments keep changing, you would have to make ■ Mobile platforms means energy constraints, mobility
some modifications to these existing algorithms to issues, network partitioning issues, connectivity,
make them applicable to solving current problems distributed and autonomous settings
■ Hence, you would need to make changes to all these
algorithms to apply them to these new settings

Han/Eick: Clustering II 11 Han/Eick: Clustering II 12


Scalability issue

■ For solving real-world ML problems, you need ■ As you all know by now, datasets can be VERY
to understand which mix of ideas from different LARGE
algorithms you want to use ■ The usual techniques used for dealing in a
■ This is a judgment call, hence some amount scalable manner with VERY LARGE datasets
of thinking is required ■ Sampling

■ Partitioning (divide and conquer)

Han/Eick: Clustering II 13 Han/Eick: Clustering II 14

Examples for how to combine ideas from Examples for how to combine ideas from
different algorithms different algorithms

■ Suppose you need to cluster a very large ■ Suppose you need to cluster a very large
spatial dataset spatial dataset
■ You could first use a grid-based approach by
imposing a grid structure on the dataset
■ Then run any clustering algorithm within
each grid
■ You can define an epsilon factor to take
“fringe” objects into consideration
■ You could cluster for different grids in
parallel 🡪 faster execution time
■ This is essentially divide and conquer.

Han/Eick: Clustering II 15 Han/Eick: Clustering II 16


Examples for how to combine ideas from Examples for how to combine ideas from
different algorithms different algorithms

■ Suppose you need to cluster a very large ■ Instead of random sampling, can you use some
spatial dataset other sampling approach?
■ Yes!
■ You could use sampling

■ Instead of clustering the whole dataset, you


■ Better way of selecting samples would likely
lead to better clustering results
can select a sample of k representative
■ Can use domain knowledge to select “good”
points and then run any clustering algorithm
samples
only on these k points
■ Can use stratified sampling
■ Larger sample size generally leads to better
■ Can select the sample points as far away
accuracy from each other as possible
Han/Eick: Clustering II 17 Han/Eick: Clustering II 18

Examples for how to combine ideas from Examples for how to combine ideas from
different algorithms different algorithms

■ Suppose you need to cluster a very large ■ Suppose you need to cluster a very large
spatial dataset spatial dataset
■ You could simply use domain knowledge to
■ You could use hierarchical agglomerative

clustering on a sample of representative figure out which areas are dense and which
areas are sparse
points
■ Then use any clustering algorithm in each
■ Then use any clustering algorithm at the
dense area
desired level of the hierarchy ■ The sparse areas could be combined (if
appropriate) and then you can use any
clustering algorithm on the merged areas

Han/Eick: Clustering II 19 Han/Eick: Clustering II 20


How to cluster a dataset that has plenty Can you combine ideas from clustering
of dimensions? with association rule mining algorithms?

■ First figure out which dimensions are most relevant to ■ Suppose you need to do some data analysis on
your analysis
40,000 different items in a supermarket
■ Look at the question that you are trying to answer,
and you will know which dimensions are most ■ You could first run a clustering algorithm to
relevant to that question break those items into clusters
■ Can use any dimensionality reduction technique
■ Now on each cluster, you can do your
■ Doing clustering in high-dimensional space may give
you results, whose significance is hard to interpret analysis (this is more like divide and
■ At very high dimensions, similarity/dissimilarity
conquer)
among the points (objects) may get blurred
■ Bottomline: Do the clustering only on the dimensions
that are relevant to your analysis

Han/Eick: Clustering II 21 Han/Eick: Clustering II 22

Can you combine ideas from clustering


with association rule mining algorithms?

■ Suppose you need to do some data analysis on


40,000 different items in a supermarket
■ You could select a set of representative items
Broad Overview of clustering
first (could be based on domain knowledge algorithms
or could be found by clustering)
■ Then run your data analysis algorithm(s) only

on the representative items


■ Note that this approach is conceptually similar
to the approach on the previous slide.

Han/Eick: Clustering II 23 Han/Eick: Clustering II 24


Major Clustering Approaches
Major clustering approaches
■ Partitioning algorithms: Construct various partitions and ■ Note: The first three approaches to clustering
then evaluate them by some criterion (and especially using a permutation and
■ Hierarchy algorithms: Create a hierarchical decomposition
combination of ideas from these approaches)
are more than adequate to deal with almost all
of the set of data (or objects) using some criterion kinds of clustering problems you can think of
■ Grid-based: based on a multiple-level granularity structure ■ Can you think of any real-world clustering
■ Model-based: A model is hypothesized for each of the problem that cannot be solved by applying
ideas from the first three approaches?
clusters and the idea is to find the best fit of that model to
■ In fact, most clustering problems can be solved
each other
by using partitioning approaches like k-means.
■ Density-based: based on connectivity and density functions ■ K-means is among the most popular
clustering algorithms!
Han/Eick: Clustering II 25 Han/Eick: Clustering II 26

Partitioning Algorithms: Basic


Concept
■ Heuristic methods: k-means and k-medoids algorithms
■ Quick recap
■ k-means: Each cluster is represented by the center of
Partitioning algorithms the cluster
■ k-medoids or PAM (Partition around medoids): Each
cluster is represented by one of the objects in the
cluster

Han/Eick: Clustering II 27 Han/Eick: Clustering II 28


The K-Medoids Clustering Method Using sampling
■ Find representative objects, called medoids, in clusters
■ PAM (Partitioning Around Medoids, 1987) ■ PAM does not scale well to large data sets
■ starts from an initial set of medoids and iteratively ■ Can you use sampling?
replaces one of the medoids by one of the ■ YES!
non-medoids if it improves the total distance of the
■ Now we will discuss two variations of PAM
resulting clustering
which uses sampling
■ PAM works effectively for small data sets, but does not

scale well for large data sets


■ CLARA (Kaufmann & Rousseeuw, 1990)
■ CLARANS (Ng & Han, 1994): Randomized sampling

Han/Eick: Clustering II 29 Han/Eick: Clustering II 30

CLARA (Clustering Large Applications) (1990) CLARANS (“Randomized” CLARA) (1994)


■ CLARA (Kaufmann and Rousseeuw in 1990) ■ CLARANS (A Clustering Algorithm based on Randomized
■ Built in statistical analysis packages, such as S+ Search) (Ng and Han’94)

■ It draws multiple samples of the data set, applies PAM on ■ CLARANS draws sample of neighbors dynamically
each sample, and gives the best clustering as the output ■ The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
■ Strength: deals with larger data sets than PAM
set of k medoids
■ Weakness: ■ If the local optimum is found, CLARANS starts with new
■ Efficiency depends on the sample size randomly selected node in search for a new local optimum
■ A good clustering based on samples will not ■ It is more efficient and scalable than both PAM and CLARA
necessarily represent a good clustering of the whole
data set if the sample is biased
Han/Eick: Clustering II 31 Han/Eick: Clustering II 32
Hierarchical Clustering
■ Quick recap of hierarchical clustering
■ Use distance matrix as clustering criteria.
■ This method does not require the number of clusters k
Hierarchical algorithms as an input, but needs a termination condition
■ Pls revise from the lecture slides on hierarchical
clustering, especially using the hierarchical clustering
animation for agglomerative case.

Han/Eick: Clustering II 33 Han/Eick: Clustering II 34

Remarks on Hierarchical Clustering Methods Remarks on Hierarchical Clustering Methods

■ Major weakness of agglomerative clustering methods ■ Use hierarchical clustering to obtain a “rough cut”
2
■ do not scale well: time complexity of at least O(n ),
■ That is, don’t apply hierarchical clustering on a
where n is the number of total objects
very LARGE dataset because hierarchical
■ can never undo what was done previously
clustering algorithms are generally not very
■ Integration of hierarchical with distance-based clustering
scalable
■ BIRCH (1996): uses CF-tree and incrementally adjusts
■ Look at the hierarchical clustering animation and you
the quality of sub-clusters
will understand the reason for this
■ CURE (1998): selects well-scattered points from the

cluster and then shrinks them towards the center of the ■ Since hierarchical clustering is such that undo is
cluster by a specified fraction not possible, your initial set on which you want to
■ CHAMELEON (1999): hierarchical clustering using do the clustering must ensure that no undo would
dynamic modeling actually be required
Han/Eick: Clustering II 35 Han/Eick: Clustering II 36
Remarks on Hierarchical Clustering Methods

■ Once you have obtained a “rough cut” by doing


hierarchical clustering, you can use any
partitioning approach to do clustering in finer
granularity Grid-based clustering algorithms

Han/Eick: Clustering II 37 Han/Eick: Clustering II 38

Advantages of Grid-based Clustering


Steps of Grid-based Clustering Algorithms Algorithms

■ fast:
Basic Grid-based Algorithm
■ No distance computations
1. Define a set of grid-cells
■ Clustering is performed on summaries and not
2. Assign objects to the appropriate grid cell and
individual objects; complexity is usually
compute the density of each cell.
O(#-populated-grid-cells) and not O(#objects)
3. Eliminate cells, whose density is below a
■ Easy to determine which clusters are
certain threshold τ.
neighboring
4. Form clusters from contiguous (adjacent)
■ Shapes are limited to union of grid-cells
groups of dense cells (usually minimizing a
given objective function)

Han/Eick: Clustering II 39 Han/Eick: Clustering II 40


STING: A Statistical Information Grid
Grid-Based Clustering Methods Approach
■ Using multi-resolution grid data structure ■ Wang, Yang and Muntz (VLDB’97)
■ Clustering complexity depends on the number of ■ The spatial area area is divided into rectangular cells
■ There are several levels of cells corresponding to different
populated grid cells and not on the number of objects in levels of resolution
the dataset
■ Several interesting methods (in addition to the basic
grid-based algorithm)
■ STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)

Han/Eick: Clustering II 41 Han/Eick: Clustering II 42

STING: A Statistical Information STING: A Statistical Information


Grid Approach (2) Grid Approach (3)
■ Each cell at a high level is partitioned into a number of smaller ■ Advantages:
cells in the next lower level ■ Query-independent, easy to parallelize, incremental

■ Statistical info of each cell is calculated and stored beforehand update


and is used to answer queries ■ O(K), where K is the number of grid cells at the
■ Parameters of higher level cells can be easily calculated from lowest level
parameters of lower level cell
■ Disadvantages:
■ count, mean, s, min, max
■ All the cluster boundaries are either horizontal or
■ type of distribution—normal, uniform, etc.

■ Use a top-down approach to answer spatial data queries vertical, and no diagonal boundary is detected

Han/Eick: Clustering II 43 Han/Eick: Clustering II 44


Self-organizing feature maps
(SOMs)

■ Clustering is also performed by having several units


competing for the current object
■ The unit whose weight vector is closest to the current
object wins
Model-based clustering algorithms ■ The winner and its neighbors learn by having their
weights adjusted
■ SOMs are believed to resemble processing that can
occur in the brain
■ Useful for visualizing high-dimensional data in 2- or 3-D
space

Han/Eick: Clustering II 45 Han/Eick: Clustering II 46

Model-Based Clustering Methods COBWEB Clustering Method


■ Attempt to optimize the fit between the data and some
mathematical model A classification tree
■ Statistical and AI approach
■ Conceptual clustering

■ A form of clustering in machine learning


■ Produces a classification scheme for a set of unlabeled objects
■ Finds characteristic description for each concept (class)
■ COBWEB (Fisher’87)
■ A popular a simple method of incremental conceptual learning
■ Creates a hierarchical clustering in the form of a classification
tree
■ Each node refers to a concept and contains a probabilistic
description of that concept

Han/Eick: Clustering II 47 Han/Eick: Clustering II 48


More on Statistical-Based Clustering

■ Limitations of COBWEB
■ The assumption that the attributes are independent
of each other is often too strong because correlation
may exist
■ Not suitable for clustering large database data – Outlier discovery
skewed tree and expensive probability distributions

Han/Eick: Clustering II 49 Han/Eick: Clustering II 50

Outlier Discovery:
What Is Outlier Discovery? Statistical Approaches
■ What are outliers?
■ The set of objects are considerably dissimilar from
the remainder of the data ● Assume a model underlying distribution that generates
■ Example: Sports: Michael Jordon, Wayne Gretzky,
data set (e.g. normal distribution)
...
■ Use discordancy tests depending on
■ Problem
■ data distribution
■ Find top n outlier points
■ distribution parameter (e.g., mean, variance)
■ Applications:
■ Credit card fraud detection ■ number of expected outliers

■ Telecom fraud detection ■ Drawbacks


■ Customer segmentation ■ most tests are for single attribute

■ Medical analysis ■ In many cases, data distribution may not be known

Han/Eick: Clustering II 51 Han/Eick: Clustering II 52


Outlier Discovery:
Deviation-Based Approach Summary

■ Identifies outliers by examining the main characteristics ■ Cluster analysis groups objects based on their similarity
of objects in a group and has wide applications
■ Objects that “deviate” from this description are ■ Measure of similarity can be computed for various types
considered outliers of data
■ Clustering algorithms can be categorized into partitioning
■ sequential exception technique
methods, hierarchical methods, density-based methods,
■ simulates the way in which humans can distinguish grid-based methods, and model-based methods
unusual objects from among a series of supposedly ■ Outlier detection and analysis are very useful for fraud
like objects detection, etc. and can be performed by statistical,
■ OLAP data cube technique distance-based or deviation-based approaches
■ uses data cubes to identify regions of anomalies in ■ There are still lots of research issues on cluster analysis,
large multidimensional data such as constraint-based clustering

Han/Eick: Clustering II 53 Han/Eick: Clustering II 54

Problems and Challenges References (1)


■ R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of
■ Considerable progress has been made in scalable high dimensional data for data mining applications. SIGMOD'98
clustering methods ■ M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
■ M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify
■ Partitioning: k-means, k-medoids, CLARANS the clustering structure, SIGMOD’99.
■ P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996
■ Hierarchical: BIRCH, CURE ■ M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
■ Density-based: DBSCAN, CLIQUE, OPTICS clusters in large spatial databases. KDD'96.
■ M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases:
■ Grid-based: STING, WaveCluster Focusing techniques for efficient class identification. SSD'95.
D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning,
■ Model-based: Autoclass, Denclue, Cobweb ■

2:139-172, 1987.
■ Current clustering techniques do not address all the ■ D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based
on dynamic systems. In Proc. VLDB’98.
requirements adequately
■ S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large
■ Constraint-based clustering analysis: Constraints exist in databases. SIGMOD'98.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
data space (bridges and highways) or in user queries ■

Han/Eick: Clustering II 55 Han/Eick: Clustering II 56


References (2)
■ L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
■ E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets.
VLDB’98.
■ G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
■ P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
■ R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
■ E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
■ G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
■ W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB’97.
■ T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method
for very large databases. SIGMOD'96.

Han/Eick: Clustering II 57

You might also like