Professional Documents
Culture Documents
MergedPDF Iml
MergedPDF Iml
Introduction to Support
received.
Vector Machines
Thanks:
Andrew Moore
CMU
And
Martin Law
Michigan State University
* 2
* 3 * 4
α α
Linear Classifiers Linear Classifiers
x f y est x f yest
f(x,w,b) = sign(w. x - b) f(x,w,b) = sign(w. x - b)
denotes +1 denotes +1
denotes -1 denotes -1
* 5 * 6
α α
Linear Classifiers Linear Classifiers
x f y est x f yest
f(x,w,b) = sign(w. x - b) f(x,w,b) = sign(w. x - b)
denotes +1 denotes +1
denotes -1 denotes -1
..but which is
best?
* 7 * 8
α α
Classifier Margin Maximum Margin
x f y est x f yest
f(x,w,b) = sign(w. x - b) f(x,w,b) = sign(w. x - b)
denotes +1 denotes +1
denotes -1 Define the margin denotes -1 The maximum
of a linear margin linear
classifier as the classifier is the
width that the linear classifier
boundary could be with the, um,
increased by maximum margin.
before hitting a
This is the
datapoint.
simplest kind of
SVM (Called an
LSVM)
Linear SVM
* 9 * 10
α
Maximum Margin Why Maximum Margin?
x f yest
f(x,w,b) = sign(w. x + b) f(x,w,b) = sign(w. x - b)
denotes +1 denotes +1
denotes -1 The maximum denotes -1 The maximum
margin linear margin linear
classifier is the classifier is the
linear classifier linear classifier
Support Vectors with the, um, Support Vectors with the, um,
are those are those
datapoints that maximum margin. datapoints that maximum margin.
the margin This is the the margin This is the
pushes up pushes up
against
simplest kind of against
simplest kind of
SVM (Called an SVM (Called an
LSVM) LSVM)
Linear SVM
* 11 * 12
How to calculate the distance from a point
to a line? Estimate the Margin
denotes +1 denotes +1
denotes -1
x
wx +b = 0 denotes -1
x
wx +b = 0
X – Vector X – Vector
W W
W – Normal Vector W – Normal Vector
b – Scale Value b – Scale Value
* 13 * 14
Class 1
m
* 15 * 16
Next step… Optional The Dual Problem (we ignore the derivation)
■ Converting SVM to a form we can solve ■ The new objective function is in terms of αi only
■ Dual form ■ It is known as the dual problem: if we know w, we know
■ Allowing nonlinear boundary ■ The objective function of the dual problem needs to be
* 19 * 20
A Geometrical Interpretation Allowing errors in our solutions
■We allow “error” ξi in classification; it is based on the
Class 2 output of the discriminant function wTx+b
■ ξ approximates the number of misclassified samples
i
α8=0.6 α10=0
α7=0 Class 2
α2=0
α5=0
α1=0.8
α4=0
α6=1.4
α9=0
α3=0
Class 1
Class 1
* 21 * 22
* 25 * 26
* 27 * 28
More on Kernel Functions Examples of Kernel Functions
■ Not all similarity measures can be used as kernel
function, however ■ Polynomial kernel with degree d
■ The kernel function needs to satisfy the Mercer function,
i.e., the function is “positive-definite”
■ This implies that ■ Radial basis function kernel with width σ
■ the n by n kernel matrix,
■ in which the (i,j)-th entry is the K(xi, xj), is always positive
definite
■ Closely related to radial basis function neural networks
■ This also means that optimization problem can be solved ■ The feature space is infinite-dimensional
in polynomial time!
■ Sigmoid with parameter κ and θ
* 29 * 30
* 31 * 32
Example Example
■ By using a QP solver, we get
■ α1=0, α2=2.5, α3=0, α4=7.333, α5=4.833 Value of discriminant function
■ Note that the constraints are indeed satisfied
■ The support vectors are {x2=2, x4=5, x5=6}
■ The discriminant function is
* 33 * 34
* 35 * 36
Software Summary: Steps for Classification
■ A list of SVM implementation can be found at ■ Prepare the pattern matrix
http://www.kernel-machines.org/software.html ■ Select the kernel function to use
■ Some implementation (such as LIBSVM) can handle ■ Select the parameter of the kernel function and the
multi-class classification value of C
■ SVMLight is among one of the earliest implementation of ■ You can use the values suggested by the SVM software, or
SVM you can set apart a validation set to determine the values
■ Several Matlab toolboxes for SVM are also available of the parameter
■ Execute the training algorithm and obtain the αi
■ Unseen data can be classified using the α and the
i
support vectors
* 37 * 38
Conclusion Resources
■ SVM is a useful alternative to neural networks ■ http://www.kernel-machines.org/
■ Two key concepts of SVM: maximize the margin and the ■ http://www.support-vector.net/
* 39 * 40
Appendix: Distance from a point to a line Distance and margin
■ The two vectors (P2-P1) is orthogonal to P3-u: P ■ The distance therefore between the point P3 and the
■ That is, line is the distance between P=(x,y) above and P3
■ (P3-P) dot (P2-P1) =0 ■ Thus,
■ P=P1+u(P2-P1) ■ d= |(P3-P)|=
P1=(x1,y1),P2=(x2,y2),P3=(x3,y3) P3
■
P1
* 41 * 42
Linear Regression: Part 2 Introduction to Linear Regression (cont.)
1 2
3 4
The big picture of MLE What is a model?
• Take a look at the data in detail • A model is a way to represent your beliefs, assumptions
etc. about how some event or process is. It is a formal
• Try to figure out / determine a model about how
way to represent how you view that event or process.
the data could have been created
• Are models a perfect representation of the real-world?
• Assign values to the parameters of the model Generally not
such that the likelihood of the parameters is • Models are usually approximate because real-world
maximized w.r.t. the data scenarios are hard to model perfectly
• You can have simple models to represent an event or a
process OR you can have models of much higher
Now let’s delve a bit deeper about what constitutes complexity to represent the exact same event or process
a model
5 6
9 10
13 14
Source: towardsdatascience •
probability density for each data point. We need best fit model to our data.
15 16
to maximize this likelihood.
Choose parameterized form for P(Y|X; θ)
Regression
Wish to learn f:X🡪Y, where Y is real, given {<x1,y1>…<xn,yn>}
Approach:
1. Choose some parameterized form for P(Y|X; θ) Assume Y is some deterministic f(X), plus random noise
( θ is the vector of parameters)
2. derive learning algorithm as MLE or MAP estimate for θ Therefore Y is a random variable that follows the distribution
Example:
Consider developing a model to forecast a company's stock price. You
noticed that the stock price rose significantly throughout the previous
night. There could be a variety of causes for it. Maximum Likelihood
Notation: to make our parameters explicit, let’s write
Estimation seeks to determine the probability of the most likely cause.
This idea is applied, among other things, to satellite imaging, MRIs, and
economics.
MLE can be defined as a method for estimating population parameters (such as the
mean and variance for Normal, rate (lambda) for Poisson, etc.) from sample data such
19
that the probability (likelihood) of obtaining the observed data is maximized.
Training Linear Regression Training Linear Regression
so:
where
21 22
•MLE is powerful when you have enough data. However, it doesnʼt work well
Learn Maximum Conditional Likelihood Estimate when observed data size is small. For example, if Liverpool only had 2
matches and they won the 2 matches, then the estimated value of Ɵ by MLE
is 2/2 = 1. It means that the estimation says Liverpool wins 100%, which is
unrealistic estimation. MAP can help dealing with this issue.
Can we derive gradient descent rule for training? •Assume that we have a prior knowledge that Liverpoolʼs winning percentage
for the past few seasons were around 50%.
Then, without the data from this season, we already have somewhat idea of
potential value of Ɵ. Based (only) on the prior knowledge, the value of Ɵ is
most likely to be 0.5, and less likely to be 0 or 1.
23 24
• On the other words, the probability of Ɵ = 0.5 is higher than
Ɵ = 0 or 1. Calling this as the prior probability P(Ɵ), and if we Regression – key points
visualise this Under general assumption
• A partial correlation measures the relationship between • For example, there probably is no underlying relationship
two variables (X and Y) while eliminating the influence of between weight and mathematics skill for elementary
a third variable (Z). school children.
• Partial correlations are used to reveal the real, • However, both of these variables are positively related to
underlying relationship between two variables when age: Older children weigh more and, because they have
researchers suspect that the apparent relation may be spent more years in school, have higher mathematics
distorted by a third variable. skills.
27 28
Partial Correlation Error Minimization
• “Minimization”, “Maximization”- both forms of “Optimization”
• As a result, weight and mathematics skill will show a
Given a function f(x) [In our case, the mean-squared error
positive correlation for a sample of children that includes
•
several different ages. function while fitting a regression line with points], how do we
minimize or maximize it?
• A partial correlation between weight and mathematics – Least Squares Optimization
skill, holding age constant, would eliminate the influence
Lagrange Multipliers (for continuous and
of age and show the true correlation which is near zero.
–
partially-differentiable functions)
– Convex optimization techniques
– Greedy algorithms like Gradient Descent
29 30
31 32
Bottomline: Gradient descent algorithm essentially
arrives at the least squared regression line by
doing multiple iterations for minimizing the sum of
squared errors
33
2
CONTENTS
• Revisiting Biology
NEURAL NETWORKS • Intelligence: Biological vs Artificial
• History of AI
• Why Deep Learning
• Neural roots of Deep Learning
• Artificial Neural Network
• Revisiting and understanding Neurons
• Neurons in Artificial Neural Networks
• Activation Functions in Artificial Neural Networks
• An illustrative example
Acknowledgments: Information in this presentation has been obtained from a
wide variety of publicly available Internet sources. Slides created by Mr. Saransh
Gupta for academic use only, as part of course material for Introduction to
Machine Learning course.
1
HISTORY OF AI HISTORY OF AI
The birth of Artificial Intelligence (1952-1956) Golden years: Early enthusiasm & optimism (1956-1972)
•• 1955: Allen Newell, J.C. Shaw and Herbert A. Simon created • 1956-1960: High-level computer languages such as FORTRAN, LISP, or COBOL
the “first AI program“ which was named as ”Logic Theorist”. It were invented in the same decade soon after this and the excitement and optimism
proved 38 of the first 52 theorems in Bertrand for AI was very high at that time.
Russell and Alfred Whitehead's Principia Mathematica, and • 1965: The researchers emphasized developing algorithms which can solve
found new and more elegant proofs for some.
mathematical problems. Joseph Weizenbaum created the first natural language
processing computer program in 1966, which was named as ELIZA at the MIT AI
•• Year 1956: The Dartmouth College summer AI conference was lab.
organized by John McCarthy, Marvin Minsky, Nathan
Rochester of IBM and Claude Shannon. McCarthy coined the • 1972: The first intelligent anthropomorphic robot was built in Japan which was
term artificial intelligence for the conference. named as WABOT-1. It consisted of a limb-control system, a vision system and a
conversation system. The WABOT-1 was able to communicate-with a person in
•• Year 1959: The ”General Problem Solver (GPS)” was created Japanese and to measure distances and directions to the objects using external
by Newell, Shaw and Simon while at CMU. John receptors, artificial ears and eyes, and an artificial mouth.
McCarthy and Marvin Minsky founded the MIT AI Lab. 7
8
HISTORY OF AI HISTORY OF AI
The first AI winter (1972-1980) AI re-emerges (1980-1987)
• 1969: Marvin Minsky and Seymour Papert published ’Perceptrons’, • 1980: After AI winter, AI re-emerged publicly with the development and
demonstrating previously unrecognized limits of the feed-forward marketing of Lisp machines and the offering of "Expert System“ shells and
two-layered structure. commercial applications. Expert systems were programmed to emulate the
decision-making ability of a human expert. The first national conference of the
• 1970: Seppo Linnainmaa published the reverse mode of automatic American Association of Artificial Intelligence was held at Stanford University in
differentiation which became later known as backpropagation, and is heavily the same year.
used to train artificial neural networks.
• mid-1980s: Neural Networks became widely used with the Backpropagation
• late-1970s: The duration between years 1974 to 1980 was the first AI Algorithm which was published by Seppo Linnainmaa in 1970 and were applied
winter duration. AI winter refers to the time period where computer scientists to neural networks by Paul Werbos.
dealt with a severe shortage of both confidence and funding from
Government for AI research. During AI winters, an interest in the publicity on • 1983: DARPA again began to fund AI research through the Strategic Computing
artificial intelligence reduced significantly. Initiative.
HISTORY OF AI HISTORY OF AI
AI rises re-emerges again with intelligent agents (1993-2009)
The second AI winter (1987-1993)
• By early 1990s: The earliest successful expert systems, such as XCON, proved •• early 1990s: TD-Gammon, a backgammon program written by
too expensive to maintain and the few remaining expert system shell Gerry Tesauro, demonstrated that reinforcement (learning) is
companies were forced to downsize and search for new markets. powerful enough to create a championship-level game-playing
program by competing favorably with world-class players.
• 1987: Expert systems were dismissed as "clever programming" and DARPA
changed its strategy to focus its funding only on those technologies which
showed the most promise believing strongly that AI was not "the next wave". •• 1997: IBM Deep Blue defeated the world chess champion,
Gary Kasparov, and became the first computer to beat a world
• late 1980s: Investors and Government stopped the funding for AI research chess champion.
once again owing to the high cost and not as many results which were
promised optimistically in the earlier years. Expectations had run much higher
than what was actually possible.
•• late 1990s: Web crawlers and other AI-based information
extraction programs became essential in widespread use of
11
the World Wide Web. 12
HISTORY OF AI HISTORY OF AI
AI rises re-emerges again with intelligent agents (1993-2009)
Deep learning, big data and artificial general
intelligence (2011- present)
HISTORY OF AI HISTORY OF AI
Deep learning, big data and artificial general Deep learning, big data and artificial general
intelligence (2011- present) intelligence (2011- present)
•• 2015: Google DeepMind's AlphaGo defeated three-time • early-2020: Microsoft introduced its Turing Natural Language
European Go champion and professional Fan Hui by 5 games Generation (T-NLG), which was then the "largest language
to 0. model ever published at 17 billion parameters."
•• 2018: Alibaba language processing AI outscores top humans at • mid-2020: OpenAI's GPT-3, a state-of-the-art autoregressive
a Stanford University reading and comprehension test, scoring language model that uses deep learning to produce a variety
82.44 against 82.304 on a set of 100,000 questions. of computer codes, poetry and other language tasks
exceptionally similar, and almost indistinguishable from those
•• 2018: Google announced "Duplex" which is a service to allow written by humans. Its capacity was ten times greater than that
an AI assistant to book appointments over the phone on user’s of the T-NLG.
15 16
behalf in a manner indiscernible to that of humans.
WHY DEEP LEARNING? ARTIFICIAL NEURAL NETWORK
Neural Roots of Deep Learning
• This has changed over time, which has led to deep learning’s
prominence today. 21 22
23 24
NEURONS IN ARTIFICIAL NEURAL NETWORKS ACTIVATION FUNCTIONS IN ARTIFICIAL NEURAL NETWORKS
• Once a neuron receives its inputs from the neurons in the preceding layer of • Activation functions allow neurons in a neural network to
the model, it adds up each signal multiplied by its corresponding weight and communicate with each other through their synapses.
passes them on to an activation function.
• The activation function calculates the output value for the neuron. This output • We have established that neurons receive input signals from the
value is then passed on to the next layer of the neural network through preceding layer of a neural network. A weighted sum of these
another synapse. signals is fed into the neuron's activation function. Then the
activation function's output is passed onto the next layer of the
network.
ACTIVATION FUNCTIONS IN ARTIFICIAL NEURAL NETWORKS ACTIVATION FUNCTIONS IN ARTIFICIAL NEURAL NETWORKS
REFERENCES
• https://machinelearningmastery.com/what-is-deep-learning/
• https://www.javatpoint.com/history-of-artificial-intelligence
• https://en.wikipedia.org/wiki/Timeline_of_artificial_intelligence
• https://en.wikipedia.org/wiki/AI_winter
• https://www.humanoid.waseda.ac.jp/booklet/kato_2.html
• https://www.freecodecamp.org/news/deep-learning-neural-networks-explained-in-plain-englis
h/
31
Recap:
3 4
Simple Regression Model Multiple Regression Model
5 6
7 8
Categorical Explanatory Variables in
Regression Models
Introduction to Data
Introduction to Logistic
• Categorical independent variables can be incorporated Analysis
Regression
into a regression model by converting them into 0/1
(“dummy”) variables
• For binary variables, code dummies “0” for “no” and 1 for
“yes”
9
Slides are adapted from SPIA, University of Georgia
• Thus after the video has ended you should have two totals, one for
aerial passes by white shirts and one for bounce passes by white
shirts.
“Gorillas in our midst” (1)
http://viscog.beckman.uiuc.edu/grafs/demos/1
5.html
http://viscog.beckman.uiuc.edu/grafs/demos/15.
html
• This is a real bit of psychology research by Simons and • Our dependent variable is just like the variables we were using
Chabris (1999) at Harvard. earlier.
• They find that the harder the task, the more likely it is
that people don’t spot the gorilla.
• But let’s say with this example we want to predict whether the
• Only 50% of his subjects spotted the gorilla…
gorilla will be spotted by a person with a particular set of
characteristics.
• How is this relevant to us?
• In this case, let’s say with a particular concentration span
• Imagine we wanted to predict whether someone saw the
(measured on a 1-100 scale).
gorilla or not, this is a binary dependent variable.
• We might have independent variables like concentration
span, difficulty of the task, time of day and so on. • Since our independent variable is interval level data we can’t use
cross-tabs.
Predicting gorilla sightings (2) What’s wrong with SLR?
• So, what we want to know is the probability that any person • We want to predict a probability, this can only vary between
will be a gorilla spotter or not for any value of concentration zero and 1.
span.
• Remember if we know this, we will know the • But our SLR may predict values that are below zero or above
proportion of people that will spot the gorilla at each 1…
level of concentration of span on average.
• Let’s quickly fit a SLR to our example.
• We could use simple linear regression (SLR) here, with the • Our sample here is the 108 subjects that Simons and
dependent variable coded as 0 (no gorilla spot) or 1 (gorilla Chabris used. I’ve added some extra data on their
spotted). concentration spans.
• Well, why can’t we…? • A scatter-plot isn’t all that much use here.
• But the relationship on the graph is actually described This is just the odds.
by: As the probability increases (from zero So if β is ‘large’ then as X increases
to 1), the odds increase from 0 to the log of the odds will increase
infinity. steeply.
The log of the odds then increases from The steepness of the curve will
–infinity to +infinity. therefore increase as β gets bigger.
Intercept 4.01 0.83 0.000 • A linear model would have two parallel lines for each type of
person (monkey or none) by CS. Our lines are NOT
parallel.
• Remember for linear regression we looked at how the • When we were using OLS regression, we were trying to
adjusted R2 changed. If there was a significant increase minimize the sum of squares, for logistic regression we
are trying to maximize something called the likelihood
when we added another variable (or interaction) then we
function (normally called L).
thought the model had improved.
• To see whether our model has improved by adding a
• For logistic regression there are a variety of ways of looking variable (or interaction, or squared term), we can compare
model improvement. the maximum of the likelihood function for each model
(just like we compared the R2 before for OLS
regressions).
Comparing models (3)
– generative
Discriminative vs Generative
Models
Acknowledgments: The information in the slides in this presentation have been obtained
from a wide variety of publicly available Internet sources such as:
https://www.baeldung.com/cs/ml-generative-vs-discriminative
Some of the slides have been modified.
1 2
5 Source: https://www.baeldung.com/cs/ml-generative-vs-discriminative 6
Source: https://www.baeldung.com/cs/ml-generative-vs-discriminative 8
Why is feature selection important?
• Real-world datasets can have lots of features
– Features can also be sometimes referred to as
attributes, dimensions etc.
• When trying to design an ML model for doing
Feature Selection predictions in real-world scenarios, some
features will be relevant; other features will be
irrelevant
• Think of it like filtering out the noise from the data
Acknowledgments: The slides in this presentation have been obtained from a wide variety of • Remember that the features that you consider to
publicly available Internet sources such as:
https://www.geeksforgeeks.org/feature-selection-techniques-in-machine-learning/. Some of be irrelevant to your analysis or ML model, others
the slides have been modified.
1
may consider those features to be relevant 2
5 6
7 8
Techniques used in filter methods Techniques used in filter methods
• Dispersion Ratio
• Variance Threshold – Computes the ratio of the Arithmetic mean to Geometric mean for a
– Core idea: Higher variance features usually contain specific feature.
more information – Higher value of dispersion ratio means that the feature is more relevant
from a feature selection perspective
– Sets a threshold for variance and gets rid of features
• Mutual Dependence
that do not satisfy this variance threshold – Computes if two variables are mutually dependent,
• Mean Absolute Difference (MAD) – If a particular feature is present/absent, how much information does
– Similar to the variance threshold method that feature contribute to the prediction that you are trying to do in your
ML model
– Computes the mean absolute difference from the
• Relief
mean value.
– Measures the quality of attributes by means of random sampling of
instances from the dataset
9 10
CONTENTS
• Computational Learning Theory
PAC LEARNING AND VC DIMENSION • PAC Learning
• VC Dimension
• VC Dimension: Learners and Complexity
• VC Dimension continued
• VC Dimension: Shattering
• Using VC Dimension
• Training Error vs Prediction/Test Error
• No-free-lunch Theorem
• Occam’s Razor (Principle of Parsimony)
Acknowledgments: Information in this presentation has been obtained from a
wide variety of publicly available Internet sources. Slides created by Mr. Saransh
Gupta for academic use only, as part of course material for Introduction to
Machine Learning course.
1
• The division between ‘Learning tasks’ and ‘Learning algorithms’ is arbitrary and
Computational learning theory, or statistical in practice, there is a lot of overlap between the two fields:
learning theory, refers to mathematical • Computational Learning Theory (CoLT): Formal study of learning tasks.
• Statistical Learning Theory (SLT): Formal study of learning algorithms.
frameworks for quantifying learning tasks
and algorithms. • CoLT and SLT are largely synonymous in modern usage.
3
COMPUTATIONAL LEARNING THEORY COMPUTATIONAL LEARNING THEORY
• Hypothesis Space: It is the set of all the possible legal hypothesis.
This is the set from which the ML algorithm would determine the
best possible (only one) hypothesis which would best describe the
target function or the outputs.
• Hypothesis: A hypothesis is a function that best describes the
target in supervised machine learning. The hypothesis that The main unanswered question in learning is this:
an algorithm would come up with depends upon the data How can we be sure that our learning algorithm
and the restrictions and bias that we have imposed on the
has produced a hypothesis that will predict the
data.
Every learning algorithm requires assumptions about the hypothesis space. correct value for previously unseen inputs?
• Eg: “My hypothesis space is
• …linear”
• …decision trees with 5 nodes”
• …a three layer neural network with rectifier hidden units”
6
5
7 8
COMPUTATIONAL LEARNING THEORY COMPUTATIONAL LEARNING THEORY
Questions explored in computational learning theory might include:
Among many subfields of CoLT, two of the most widely discussed
areas of study are PAC Learning and VC Dimension.
• How do we know a model has a good approximation for the
target function? Crudely speaking,
• What hypothesis space should be used?
PAC (Probably Approximately Correct) Learning is the theory of
• How do we know if we have a local or globally good solution?
machine learning problems
• How do we avoid overfitting?
• How many data examples are needed? whereas
• … and so on. VC (Vapnik–Chervonenkis) Dimension is the theory of machine
learning algorithms.
The main unanswered question in learning is this: How can we be
sure that our learning algorithm has produced a hypothesis that
will predict the correct value for previously unseen inputs?
— Page 713, Artificial Intelligence: A Modern Approach, 3rd 9 10
edition, 2009.
13 14
15 16
PAC LEARNING PAC LEARNING
What does the PAC Learning theory say (in simple words)?
The underlying principle is that any hypothesis that is seriously
• The idea is that a bad hypothesis will be found out based on wrong will almost certainly be “found out” with high probability
after a small number of examples, because it will make an
the predictions it makes on new data, i.e. based on its incorrect prediction. Thus, any hypothesis that is consistent with a
generalization error. sufficiently large set of training examples is unlikely to be seriously
wrong: that is, TELY it must be probably approximately correct.
• A hypothesis that gets most or a large number of predictions
correct, i.e. has a small generalization error, is probably a
— Page 714, Artificial Intelligence: A Modern Approach, 3rd
good approximation for the target function.
edition, 2009.
• This probabilistic language gives the theorem its name:
“probably approximately correct.” That is, a hypothesis seeks to
“approximate” a target function and is “probably” good if it
has a low generalization error. 17 18
23 24
VC DIMENSION VC DIMENSION
Why do we need VC Dimension?
A key quantity in PAC learning is the
• One way to consider the complexity of a hypothesis space (space of
Vapnik-Chervonenkis dimension, or VC dimension,
models that could be fit) is based on the number of distinct hypotheses which provides a measure of the complexity of a
it contains and perhaps how the space might be navigated. The VC space of functions, and which allows the PAC
dimension is a clever approach that instead measures the number of
examples from the target problem that can be discriminated by
framework to be extended to spaces containing an
hypotheses in the space. infinite number of functions.
• The VC dimension estimates the capability or capacity of a
classification machine learning algorithm for a specific dataset — Page 344, Pattern Recognition and Machine
(number and dimensionality of examples).
Learning, 2006.
• The VC dimension is used as part of the PAC learning framework.
25 26
…
2 2
xn xn
1 1
0 0
-1
Example: -1
Example:
-2 -2
-3
-2 -1 0 1 2 3
(c) Alexander Ihler -3
-2 -1 0 1 2 3
(c) Alexander Ihler
-3 -3
VC DIMENSION: LEARNERS AND COMPLEXITY VC DIMENSION: LEARNERS AND COMPLEXITY
• We’ve seen many versions of underfit/overfit trade-off • We’ve seen many versions of underfit/overfit trade-off
– Complexity of the learner – Complexity of the learner
– “Representational Power”
– “Representational Power”
• Different learners have different power • Different learners have different power
xn
• How can we quantify representational power?
– Not easily…
Example: – One solution is VC (Vapnik-Chervonenkis) dimension
• Formally, the VC dimension is the largest number of examples from • Any placement of three points on a 2d plane with class labels 0 or 1
the training dataset that the space of hypothesis from the algorithm can be “correctly” split by label with a line, e.g. shattered. But, there
can “shatter.” exists placements of four points on plane with binary class labels that
cannot be correctly split by label with a line, e.g. cannot be shattered.
• Shatter or a shattered set, in the case of a dataset, means points in Instead, another “algorithm” must be used, such as ovals.
the feature space can be selected or separated from each other using
hypotheses in the space such that the labels of examples in the
separate groups are correct (whatever they happen to be).
31 32
VC DIMENSION CONTINUED VC DIMENSION: SHATTERING
What does VC Dimension mean though? • We say a classifier f(x) can shatter points x(1)…x(h) iff For
• Therefore, the VC dimension of a machine learning algorithm is the all y(1)…y(h), f(x) can achieve zero error on training data
largest number of data points in a dataset that a specific (x(1),y(1)), (x(2),y(2)), … (x(h),y(h))
configuration of the algorithm (hyperparameters) or specific fit model
can shatter. (i.e., there exists some θ that gets zero error)
• A classifier that predicts the same value in all cases will have a VC
dimension of 0, no points. • Can f(x;θ) = sign(θ0 + θ1x1 + θ2x2) shatter these
points?
• A large VC dimension indicates that an algorithm is very flexible,
although the flexibility may come at the cost of additional risk of
overfitting.
33 34
(i.e., there exists some θ that gets zero error) (i.e., there exists some θ that gets zero error)
35 36
VC DIMENSION: SHATTERING VC DIMENSION
• We say a classifier f(x) can shatter points x(1)…x(h) iff • The VC dimension H is defined as the maximum number
For all y(1)…y(h), f(x) can achieve zero error on of points h that can be arranged so that f(x) can shatter
training data (x(1),y(1)), (x(2),y(2)), … (x(h),y(h)) them
37 38
• VC dim >= 4?
39 40
VC DIMENSION: SHATTERING VC DIMENSION: SHATTERING
• EXAMPLE:WHAT’S THE VC DIMENSION OF THE TWO-DIMENSIONAL • EXAMPLE:WHAT’S THE VC DIMENSION OF THE TWO-DIMENSIONAL
LINE, F(X;Θ) = SIGN(Θ1 X1 + Θ2 X2 + Θ0)? • LINE, F(X;Θ) = SIGN(Θ1 X1 + Θ2 X2 + Θ0)?
Any line through these points Any line through these points VC dim = d+1
must split one pair (by must split one pair (by
crossing one of the lines) crossing one of the lines)
41 42
43 44
(c) Alexander Ihler
USING VC DIMENSION USING VC DIMENSION
• Use validation / cross-validation to select complexity • Use validation / cross-validation to select complexity
• Use VC dimension based bound on test error similarly • Use VC dimension based bound on test error similarly
• Other Alternatives
• “Structural Risk Minimization” (SRM) – Probabilistic models: likelihood under model (rather than
classification error)
– AIC (Aikike Information Criterion)
# Params Train Error VC VC Test Bound • Log-likelihood of training data - # of parameters
f1 Term – BIC (Bayesian Information Criterion)
• Log-likelihood of training data - (# of parameters)*log(m)
f2 f3 f4 f5 f6
• Similar to VC dimension: performance + penalty
47
MSEtrain
47 48
OCCAM’S RAZOR (PRINCIPLE OF PARSIMONY) REFERENCES
• This philosophical idea in the context of ML suggests that all else being equal, • https://machinelearningmastery.com/introduction-to-computational-learning-t
a simpler model is be preferred over a more complex model.
heory
• It does not mean that simpler models are universally better than complex • Artificial Intelligence: A Modern Approach, 3rd edition, 2009 (Book)
models, but rather that a model must be complex enough to learn the patterns
• The Nature of Statistical Learning Theory, 1999 (Book)
in a dataset and to avoid underfitting but simple enough to avoid overfitting.
• Pattern Recognition and Machine Learning, 2006 (Book)
• When choosing between two models, we can only say a simpler model is • Machine Learning, 1997 (Book)
better if it’s generalization error is equal to or less than that of the more
complex model. • Slides of Andrew W. Moore (Associate Professor, School of Computer
Science, Carnegie Mellon University
• An Introduction to Computational Learning Theory by Keanrs and Vazirani
50
49
Data Mining: Bayesian Belief Networks
Concepts and Techniques ■ Bayesian belief networks (also known as Bayesian
networks, probabilistic networks): allow class
(3rd ed.)
conditional independencies between subsets of variables
— Chapter 9 —
Classification: Advanced Methods ■ A (directed acyclic) graphical model of causal relationships
Jiawei Han, Micheline Kamber, and Jian Pei ■ Represents dependency among the variables
University of Illinois at Urbana-Champaign & ■ Gives a specification of joint probability distribution
Simon Fraser University
❑ Nodes: random variables
©2011 Han, Kamber & Pei. All rights reserved.
❑ Links: dependency
X Y
Acknowledgments: The slides in this presentation are mostly the textbook slides ❑ X and Y are the parents of Z, and Y is the
from the data mining textbook “Data Mining: Concepts and Techniques” by Jiawei parent of P
Han, Micheline Kamber, and Jian Pei. The slides have been modified i.e., some Z
new slides have been added and some slides have been deleted. Information in P ❑ No dependency between Z and P
this presentation has also been obtained from a wide variety of publicly available ❑ Has no loops/cycles
Internet sources. 1 2
7 8
Defining a Network Topology Backpropagation
■ Iteratively process a set of training tuples & compare the network's
■ Decide the network topology: Specify # of units in the
prediction with the actual known target value
input layer, # of hidden layers (if > 1), # of units in each
■ For each training tuple, the weights are modified to minimize the
hidden layer, and # of units in the output layer
mean squared error between the network's prediction and the actual
■ Normalize the input values for each attribute measured in
target value
the training tuples to [0.0—1.0]
■ Modifications are made in the “backwards” direction: from the output
■ One input unit per domain value, each initialized to 0 layer, through each hidden layer down to the first hidden layer, hence
■ Output, if for classification and more than two classes, “backpropagation”
one output unit per class is used ■ Steps
■ Once a network has been trained and its accuracy is ■ Initialize weights to small random numbers, associated with biases
unacceptable, repeat the training process with a different ■ Propagate the inputs forward (by applying activation function)
network topology or a different set of initial weights ■ Backpropagate the error (by updating weights and biases)
■ Terminating condition (when error is very small, etc.)
9 10
19
2
CONTENTS
BIAS VARIANCE TRADE-OFF • Bias and Variance: Introduction, why and what?
• Bias and Variance: Conceptual Definition
• Bias-Variance: Tradeoff
• Bias and Variance: Graphical Definition
• An Illustrative Example: Voting Intentions
• An Illustrative Example: Voting Intentions Caveats
• Bias and Variance: Mathematical Definition
• Bias-Variance: Tradeoff Revisited
• ML Model Space, Hypothesis Space & Hyperparameters
• Generalization and Bias Variance Tradeoff Intuitions
Acknowledgments: Information in this presentation has been obtained from a
wide variety of publicly available Internet sources. Slides created by Mr. Saransh • Error comparisons, No Free Lunch Theorem and Occam’s Razor
Gupta for academic use only, as part of course material for Introduction to • Going beyond theory and Managing Bias and Variance
Machine Learning course. • Bias-Variance Tradeoff: Overfitting/Underfitting & Summary
1 • More about Managing Bias and Variance practically
• An Applied Example: Voter Party Registration
BIAS AND VARIANCE: INTRODUCTION BIAS AND VARIANCE: WHY AND WHAT?
Understanding these two types of error i.e. the error due to “Bias”
and the error due to “Variance” can help us diagnose model results
and avoid the mistake of over- fitting or under-fitting.
• Think Conceptually
• Think Graphically
• Think Mathematically
3
4
BIAS AND VARIANCE:
BIAS-VARIANCE: TRADEOFF
CONCEPTUAL DEFINITION
If you repeat the entire model building process multiple times, the
variance tells us how much the predictions for a given point vary Note: We will revisit this again in detail in this slide deck.
between different realizations of the model.
5 6
Voting Republican Voting Democratic Non-Respondent Total That certainly reflects poorly on us!
13 16 21 50
For instance, in general the data set used to build the model is
provided prior to model construction and the modeler cannot
simply say, "Let's increase the sample size to reduce variance."
15
25 26
27
MSEtrain
28
30
• Clearly, a decision tree is a much better model than a neural network here as it
is likely a smaller model with faster inference times and is much easier to
explain than a neural network.
29 • Unless the 1% drop in accuracy is immensely significant in such a problem, the
practical decision would be to choose the decision tree over the neural network.
• Tradeoff between bias and variance:
• Simple Models: High Bias, Low Variance
• Complex Models: Low Bias, High Variance
• If our model complexity exceeds this sweet spot, we are in effect Mean Squared Error (true risk)
over-fitting our model; while if our complexity falls short of the
sweet spot, we are under-fitting the model.
• SVMs
• Higher degree polynomial kernels decreases bias, increases variance
• Stronger regularization increases bias, decreases variance We want to predict
• Neural networks
voter registration
• Deeper models can increase variance, but decrease bias using wealth and
religiousness as
• K- Nearest Neighbors
• Increasing k generally increases bias, reduces variance 37 predictors.
Red circles: Republican voters
Blue circles: Democratic voters
38
AN APPLIED EXAMPLE: 40
• Let us try experimenting with the value of k to find the best prediction
algorithm that matches up well with the black boundary line.
43 44
47 48
towardsdatascience.com/a-blog-about-lunch-and-data-science-how-there-is-no-such-a-thing-as-f
ree-lunch-e46fd57c7f27.
machinelearningmastery.com/no-free-lunch-theorem-for-machine-learning/.
towardsdatascience.com/what-occams-razor-means-in-machine-learning-53f07effc97c.
www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/.
www.geeksforgeeks.org/ml-understanding-hypothesis/.
19 July 2022,
medium.com/@jwbtmf/generalization-error-in-machine-learning-4617141932b7.
• www.inf.ed.ac.uk/teaching/courses/mlsc/Notes/Lecture4/BiasVariance.pdf.
Data Mining Ensemble Methods
Ensemble Techniques
● Predict class label of test records by combining
the predictions made by multiple classifiers (e.g.,
Introduction to Data Mining, 2nd Edition by taking majority vote)
by
Tan, Steinbach, Karpatne, Kumar
Acknowledgments: The slides in this presentation are mostly the textbook slides from the
data mining textbook “Introduction to Data Mining (2nd edition)” by Tan, Steinbach,
Karpatne, Kumar. The slides have been modified i.e., some new slides have been added and
some slides have been deleted. Information in this presentation has also been obtained from
a wide variety of publicly available Internet sources.
10/11/2021 Introduction to Data Mining, 2nd Edition 1 10/11/2021 Introduction to Data Mining, 2nd Edition 2
Example: Why Do Ensemble Methods Work? Necessary Conditions for Ensemble Methods
10/11/2021 Introduction to Data Mining, 2nd Edition 3 10/11/2021 Introduction to Data Mining, 2nd Edition 4
Rationale for Ensemble Learning Bias-Variance Decomposition
10/11/2021 Introduction to Data Mining, 2nd Edition 5 10/11/2021 Introduction to Data Mining, 2nd Edition 6
Overfitting
11
10/11/2021 Introduction to Data Mining, 2nd Edition 12
Bagging Algorithm
Big picture of Bagging
• From the dataset, create multiple subsets (samples)
with equal number of tuples (with replacement)
• Build classifier model on each sample
– Observe how each model is learned in parallel and
independently
• Combine predictions from all models based on voting
mechanism
13
10/11/2021 Introduction to Data Mining, 2nd Edition 14
Boosting
Boosting
● An iterative procedure to adaptively change
• The core idea is to build a model using the training data, and
then build another model that attempts to rectify the errors of
distribution of training data by focusing more on
the first model previously misclassified records
• This is an iterative process with stopping conditions such as: – Initially, all N records are assigned equal
– stop when you have already added the maximum number weights (for being selected for training)
of models or
– Unlike bagging, weights may change at the
– stop when the complete training set has been correctly
end of each boosting round
predicted
• Observe how the learning in case of boosting is sequential
(not parallel) and adaptive
– Observe how this is different from bagging, where the
learning is happening in parallel and independently
15
10/11/2021 Introduction to Data Mining, 2nd Edition 16
Boosting
Boosting
● Records that are wrongly classified will have their
• There are many boosting algorithms
weights increased in the next round
• AdaBoost algorithm by Robert Schapire and Yoav Freund
● Records that are classified correctly will have • Adaptive boosting algorithm is among the most popular
their weights decreased in the next round boosting algorithms
• Converts multiple weak classifiers to create one strong
classfier
Source:
https://www.geeksforgeeks.org/bagging-vs-boosting-in-machine-learning/ 18
10/11/2021 Introduction to Data Mining, 2nd Edition 17
AdaBoost
Big picture of AdaBoost
● Base classifiers: C1, C2, …, CT
• Initialize the dataset and assign equal weight to each of the
data point ● Error rate of a base classifier:
• Provide this as input to the model and identify the wrongly
classified data points
• Increase the weight of the wrongly classified data points and
decrease the weights of correctly classified data points. And
then normalize the weights of all data points.
• If (required results have been obtained) then end, otherwise ● Importance of a classifier:
continue iteratively
Source:
https://www.geeksforgeeks.org/bagging-vs-boosting-in-machine-l
earning/ (Adapted from the above source with some minor
modifications) 19
10/11/2021 Introduction to Data Mining, 2nd Edition 20
AdaBoost Algorithm AdaBoost Algorithm
● Weight update:
10/11/2021 Introduction to Data Mining, 2nd Edition 21 10/11/2021 Introduction to Data Mining, 2nd Edition 22
10/11/2021 Introduction to Data Mining, 2nd Edition 23 10/11/2021 Introduction to Data Mining, 2nd Edition 24
Gradient Boosting
2/10/2021 Introduction to Data Mining, 2nd Edition 3 2/10/2021 Introduction to Data Mining, 2nd Edition 4
How to Determine the class label of a Test Sample? Choice of proximity measure matters
●
● For documents, cosine is better than correlation or
Euclidean
111111111110 000000000001
vs
011111111111 100000000000
2/10/2021 Introduction to Data Mining, 2nd Edition 5 2/10/2021 Introduction to Data Mining, 2nd Edition 6
2/10/2021 Introduction to Data Mining, 2nd Edition 7 2/10/2021 Introduction to Data Mining, 2nd Edition 8
Nearest-neighbor classifiers Nearest Neighbor Classification…
2/10/2021 Introduction to Data Mining, 2nd Edition 9 2/10/2021 Introduction to Data Mining, 2nd Edition 10
K-NN Classifiers…
Handling Irrelevant and Redundant Attributes K-NN Classifiers: Handling attributes that are interacting
2/10/2021 Introduction to Data Mining, 2nd Edition 11 2/10/2021 Introduction to Data Mining, 2nd Edition 12
Improving KNN Efficiency
Data Mining
Rule-Based Classifier
Classification: Alternative Techniques
● Classify records by using a collection of
Lecture Notes for Chapter 4 “if…then…” rules
● Rule: (Condition) → y
Rule-Based – where
◆ Condition is a conjunction of tests on attributes
◆ y is the class label
Introduction to Data Mining , 2nd Edition – Examples of classification rules:
by ◆ (Blood Type=Warm) ∧ (Lay Eggs=Yes) → Birds
Tan, Steinbach, Karpatne, Kumar ◆ (Taxable Income < 50K) ∧ (Refund=Yes) → Evade=No
2/10/2021 Introduction to Data Mining, 2nd Edition 15 2/10/2021 Introduction to Data Mining, 2nd Edition 16
Rule-based Classifier (Example) Application of Rule-Based Classifier
● Accuracy of a rule:
– Fraction of records
that satisfy the
antecedent that
A lemur triggers rule R3, so it is classified as a mammal
also satisfy the A turtle triggers both R4 and R5
consequent of a (Status=Single) → No A dogfish shark triggers none of the rules
rule Coverage = 40%, Accuracy = 50%
2/10/2021 Introduction to Data Mining, 2nd Edition 19 2/10/2021 Introduction to Data Mining, 2nd Edition 20
Characteristics of Rule Sets: Strategy 1 Characteristics of Rule Sets: Strategy 2
2/10/2021 Introduction to Data Mining, 2nd Edition 21 2/10/2021 Introduction to Data Mining, 2nd Edition 22
2/10/2021 Introduction to Data Mining, 2nd Edition 23 2/10/2021 Introduction to Data Mining, 2nd Edition 24
Building Classification Rules Direct Method: Sequential Covering
2/10/2021 Introduction to Data Mining, 2nd Edition 25 2/10/2021 Introduction to Data Mining, 2nd Edition 26
2/10/2021 Introduction to Data Mining, 2nd Edition 27 2/10/2021 Introduction to Data Mining, 2nd Edition 28
Rule Growing Rule Evaluation
FOIL: First Order Inductive
● Two common strategies ● Learner – an early
rule-based learning
algorithm
2/10/2021 Introduction to Data Mining, 2nd Edition 29 2/10/2021 Introduction to Data Mining, 2nd Edition 30
2/10/2021 Introduction to Data Mining, 2nd Edition 33 2/10/2021 Introduction to Data Mining, 2nd Edition 34
● Extract rules from an unpruned decision tree ● Instead of ordering the rules, order subsets of
● For each rule, r: A → y, rules (class ordering)
– consider an alternative rule r′: A′ → y where A′ – Each subset is a collection of rules with the
is obtained by removing one of the conjuncts same rule consequent (class)
in A
– Compare the pessimistic error rate for r
against all r’s
– Prune if one of the alternative rules has lower
pessimistic error rate
– Repeat until we can no longer improve
generalization error
2/10/2021 Introduction to Data Mining, 2nd Edition 37 2/10/2021 Introduction to Data Mining, 2nd Edition 38
RIPPER:
(Live in Water=Yes) → Fishes
(Have Legs=No) → Reptiles
(Give Birth=No, Can Fly=No, Live In Water=No)
→ Reptiles
(Can Fly=Yes,Give Birth=No) → Birds
() → Mammals
2/10/2021 Introduction to Data Mining, 2nd Edition 39 2/10/2021 Introduction to Data Mining, 2nd Edition 40
C4.5 versus C4.5rules versus RIPPER Advantages of Rule-Based Classifiers
2/10/2021 Introduction to Data Mining, 2nd Edition 41 2/10/2021 Introduction to Data Mining, 2nd Edition 42
Data Mining
Classification: Alternative Techniques
2/10/2021 Introduction to Data Mining, 2nd Edition 43 2/10/2021 Introduction to Data Mining, 2nd Edition 44
Class Imbalance Problem Confusion Matrix
● Key Challenge:
a: TP (true positive)
– Evaluation measures such as accuracy are not b: FN (false negative)
well-suited for imbalanced class c: FP (false positive)
d: TN (true negative)
2/10/2021 Introduction to Data Mining, 2nd Edition 45 2/10/2021 Introduction to Data Mining, 2nd Edition 46
PREDICTED CLASS
● Most widely-used metric:
Class=Yes Class=No
Class=Yes 0 10
ACTUAL
CLASS Class=No 0 990
nd nd
2/10/2021 Introduction to Data Mining, 2 Edition 47 2/10/2021 Introduction to Data Mining, 2 Edition 48
Which model is better? Which model is better?
PREDICTED
Class=Yes Class=No PREDICTED
A ACTUAL Class=Yes 0 10 A Class=Yes Class=No
Class=No 0 990 ACTUAL Class=Yes 5 5
Accuracy: 99% Class=No 0 990
PREDICTED
PREDICTED
B Class=Yes Class=No B Class=Yes Class=No
ACTUAL Class=Yes 10 0
ACTUAL Class=Yes 10 0
Class=No 500 490
Class=No 500 490
Accuracy: 50%
2/10/2021 Introduction to Data Mining, 2nd Edition 49 2/10/2021 Introduction to Data Mining, 2nd Edition 50
Class=Yes 10 0
Class=Yes a b ACTUAL
ACTUAL CLASS Class=No 10 980
CLASS Class=No c d
2/10/2021 Introduction to Data Mining, 2nd Edition 51 2/10/2021 Introduction to Data Mining, 2nd Edition 52
Alternative Measures Which of these classifiers is better?
Class=Yes 10 0
A Class=Yes 40 10
ACTUAL ACTUAL
CLASS Class=No 10 980 CLASS Class=No 10 40
Class=Yes 1 9 Class=Yes 40 10
ACTUAL ACTUAL
CLASS Class=No 0 990 CLASS Class=No 1000 4000
2/10/2021 Introduction to Data Mining, 2nd Edition 53 2/10/2021 Introduction to Data Mining, 2nd Edition 54
A PREDICTED CLASS
PREDICTED CLASS Class=Yes Class=No
Yes No
ACTUAL Class=Yes 40 10
CLASS
Yes TP FN ACTUAL
No FP TN CLASS Class=No 10 40
2/10/2021 Introduction to Data Mining, 2nd Edition 55 2/10/2021 Introduction to Data Mining, 2nd Edition 56
Which of these classifiers is better?
A PREDICTED CLASS
Class=Yes Class=No
Class=Yes 10 40
ACTUAL
Class=No 10 40
CLASS
B PREDICTED CLASS
Class=Yes Class=No
Class=Yes 25 25
ACTUAL Class=No 25 25
CLASS
C PREDICTED CLASS
Class=Yes Class=No
Class=Yes 40 10
ACTUAL
CLASS Class=No 40 10
Acknowledgments: The slides in this presentation are mostly the textbook slides
from the textbook “Data Mining: Concepts and Techniques” by Jiawei Han,
Micheline Kamber, and Jian Pei. The slides have been modified i.e., some new
slides have been added and some slides have been deleted. Information in this
presentation has also been obtained from a wide variety of publicly available
Internet sources.
1
Han/Eick: Clustering II 2
■ In this course, you have seen several ■ If you memorize just the steps of an algorithm,
algorithms for clustering & classification you won’t remember very long
■ What is more important? ■ And more importantly, you won’t know how
A) Memorizing the key steps of an algorithm and where to apply it
B) Understanding the core idea of an algorithm
■ B) is the correct answer.
■ Focus on understanding, not memorizing
■ You should understand that most of these algorithms ■ Example: The classical algorithms you studied were
were developed to solve specific problems proposed more than a decade ago
■ At that time, main memory, computational power etc
■ Hence, when you see a new problem, simply using one were all not as good as now
algorithm may not be effective ■ But now lot of data management work including ML will
■ Need to combine ideas from different algorithms occur on mobile platforms, whereas those algorithms
were designed for fixed computing infrastructures such
■ As technology progresses and technological as centralized systems and clusters
environments keep changing, you would have to make ■ Mobile platforms means energy constraints, mobility
some modifications to these existing algorithms to issues, network partitioning issues, connectivity,
make them applicable to solving current problems distributed and autonomous settings
■ Hence, you would need to make changes to all these
algorithms to apply them to these new settings
■ For solving real-world ML problems, you need ■ As you all know by now, datasets can be VERY
to understand which mix of ideas from different LARGE
algorithms you want to use ■ The usual techniques used for dealing in a
■ This is a judgment call, hence some amount scalable manner with VERY LARGE datasets
of thinking is required ■ Sampling
Examples for how to combine ideas from Examples for how to combine ideas from
different algorithms different algorithms
■ Suppose you need to cluster a very large ■ Suppose you need to cluster a very large
spatial dataset spatial dataset
■ You could first use a grid-based approach by
imposing a grid structure on the dataset
■ Then run any clustering algorithm within
each grid
■ You can define an epsilon factor to take
“fringe” objects into consideration
■ You could cluster for different grids in
parallel 🡪 faster execution time
■ This is essentially divide and conquer.
■ Suppose you need to cluster a very large ■ Instead of random sampling, can you use some
spatial dataset other sampling approach?
■ Yes!
■ You could use sampling
Examples for how to combine ideas from Examples for how to combine ideas from
different algorithms different algorithms
■ Suppose you need to cluster a very large ■ Suppose you need to cluster a very large
spatial dataset spatial dataset
■ You could simply use domain knowledge to
■ You could use hierarchical agglomerative
clustering on a sample of representative figure out which areas are dense and which
areas are sparse
points
■ Then use any clustering algorithm in each
■ Then use any clustering algorithm at the
dense area
desired level of the hierarchy ■ The sparse areas could be combined (if
appropriate) and then you can use any
clustering algorithm on the merged areas
■ First figure out which dimensions are most relevant to ■ Suppose you need to do some data analysis on
your analysis
40,000 different items in a supermarket
■ Look at the question that you are trying to answer,
and you will know which dimensions are most ■ You could first run a clustering algorithm to
relevant to that question break those items into clusters
■ Can use any dimensionality reduction technique
■ Now on each cluster, you can do your
■ Doing clustering in high-dimensional space may give
you results, whose significance is hard to interpret analysis (this is more like divide and
■ At very high dimensions, similarity/dissimilarity
conquer)
among the points (objects) may get blurred
■ Bottomline: Do the clustering only on the dimensions
that are relevant to your analysis
■ It draws multiple samples of the data set, applies PAM on ■ CLARANS draws sample of neighbors dynamically
each sample, and gives the best clustering as the output ■ The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
■ Strength: deals with larger data sets than PAM
set of k medoids
■ Weakness: ■ If the local optimum is found, CLARANS starts with new
■ Efficiency depends on the sample size randomly selected node in search for a new local optimum
■ A good clustering based on samples will not ■ It is more efficient and scalable than both PAM and CLARA
necessarily represent a good clustering of the whole
data set if the sample is biased
Han/Eick: Clustering II 31 Han/Eick: Clustering II 32
Hierarchical Clustering
■ Quick recap of hierarchical clustering
■ Use distance matrix as clustering criteria.
■ This method does not require the number of clusters k
Hierarchical algorithms as an input, but needs a termination condition
■ Pls revise from the lecture slides on hierarchical
clustering, especially using the hierarchical clustering
animation for agglomerative case.
■ Major weakness of agglomerative clustering methods ■ Use hierarchical clustering to obtain a “rough cut”
2
■ do not scale well: time complexity of at least O(n ),
■ That is, don’t apply hierarchical clustering on a
where n is the number of total objects
very LARGE dataset because hierarchical
■ can never undo what was done previously
clustering algorithms are generally not very
■ Integration of hierarchical with distance-based clustering
scalable
■ BIRCH (1996): uses CF-tree and incrementally adjusts
■ Look at the hierarchical clustering animation and you
the quality of sub-clusters
will understand the reason for this
■ CURE (1998): selects well-scattered points from the
cluster and then shrinks them towards the center of the ■ Since hierarchical clustering is such that undo is
cluster by a specified fraction not possible, your initial set on which you want to
■ CHAMELEON (1999): hierarchical clustering using do the clustering must ensure that no undo would
dynamic modeling actually be required
Han/Eick: Clustering II 35 Han/Eick: Clustering II 36
Remarks on Hierarchical Clustering Methods
■ fast:
Basic Grid-based Algorithm
■ No distance computations
1. Define a set of grid-cells
■ Clustering is performed on summaries and not
2. Assign objects to the appropriate grid cell and
individual objects; complexity is usually
compute the density of each cell.
O(#-populated-grid-cells) and not O(#objects)
3. Eliminate cells, whose density is below a
■ Easy to determine which clusters are
certain threshold τ.
neighboring
4. Form clusters from contiguous (adjacent)
■ Shapes are limited to union of grid-cells
groups of dense cells (usually minimizing a
given objective function)
■ Use a top-down approach to answer spatial data queries vertical, and no diagonal boundary is detected
■ Limitations of COBWEB
■ The assumption that the attributes are independent
of each other is often too strong because correlation
may exist
■ Not suitable for clustering large database data – Outlier discovery
skewed tree and expensive probability distributions
Outlier Discovery:
What Is Outlier Discovery? Statistical Approaches
■ What are outliers?
■ The set of objects are considerably dissimilar from
the remainder of the data ● Assume a model underlying distribution that generates
■ Example: Sports: Michael Jordon, Wayne Gretzky,
data set (e.g. normal distribution)
...
■ Use discordancy tests depending on
■ Problem
■ data distribution
■ Find top n outlier points
■ distribution parameter (e.g., mean, variance)
■ Applications:
■ Credit card fraud detection ■ number of expected outliers
■ Identifies outliers by examining the main characteristics ■ Cluster analysis groups objects based on their similarity
of objects in a group and has wide applications
■ Objects that “deviate” from this description are ■ Measure of similarity can be computed for various types
considered outliers of data
■ Clustering algorithms can be categorized into partitioning
■ sequential exception technique
methods, hierarchical methods, density-based methods,
■ simulates the way in which humans can distinguish grid-based methods, and model-based methods
unusual objects from among a series of supposedly ■ Outlier detection and analysis are very useful for fraud
like objects detection, etc. and can be performed by statistical,
■ OLAP data cube technique distance-based or deviation-based approaches
■ uses data cubes to identify regions of anomalies in ■ There are still lots of research issues on cluster analysis,
large multidimensional data such as constraint-based clustering
2:139-172, 1987.
■ Current clustering techniques do not address all the ■ D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based
on dynamic systems. In Proc. VLDB’98.
requirements adequately
■ S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large
■ Constraint-based clustering analysis: Constraints exist in databases. SIGMOD'98.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
data space (bridges and highways) or in user queries ■
Han/Eick: Clustering II 57