Intelligent System: Lecture Notes For Chapter 7

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 25

Intelligent System

Lecture Notes for Chapter 7


Cluster Analysis: Basic Concepts
and Algorithms

02/14/2018 Introduction to Data Mining, 2nd Edition 1


Difference between KNN and K mean

 K-NN is a Supervised machine learning while K-


means is an unsupervised machine learning.
 K-NN is a classification or regression machine
learning algorithm while K-means is
a clustering machine learning algorithm.
 K-NN is a lazy learner while K-Means is
an eager learner. An eager learner has a model
fitting that means a training step but a lazy
learner does not have a training phase.
 K-NN performs much better if all of the data have
the same scale but this is not true for K-means.
02/14/2018 Introduction to Data Mining, 2nd Edition 2
What is Cluster Analysis?
 Cluster analysis divides data into groups (clusters) that are meaningful,
useful, or both
 Finding groups of objects such that the objects in a group will be similar (or
related) to one another and different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

02/14/2018 Introduction to Data Mining, 2nd Edition 3


Applications of Cluster Analysis
Discovered Clusters Industry Group
 Understanding Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,

– Group related documents 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,


DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
for browsing, group genes Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,

and proteins that have 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,


Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Technology2-DOWN
similar functionality, or Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
group stocks with similar 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

price fluctuations Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP

 Data Summarization
– Reduce the size of large
data sets

Clustering precipitation
in Australia

02/14/2018 Introduction to Data Mining, 2nd Edition 4


Cluster Analysis

 Clusters are potential classes and cluster analysis is


the study of techniques for automatically finding
classes. Examples are:
 Biology: clustering has been used to find groups of genes that have
similar functions.
 Information Retrieval: Clustering can be used to group the search
results into a small number of clusters, each of which captures a
particular aspect of the query.
 Climate: cluster analysis has been applied to find patterns in the
atmospheric pressure of polar regions and areas of the ocean that have
a significant impact on land climate.
 Psychology and Medicine: identify different types of depression.
Cluster analysis can also be used to detect patterns in the spatial or
temporal distribution of a disease.
 Business: Clustering can be used to segment customers into a small
number of groups for additional analysis and marketing activities.
02/14/2018 Introduction to Data Mining, 2nd Edition 5
Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters

02/14/2018 Introduction to Data Mining, 2nd Edition 6


What is Clustering in Data Mining?
Clustering
Clusteringisisaaprocess
processof
ofpartitioning
partitioningaaset
setof
ofdata
data(or
(orobjects)
objects)in
inaaset
set
of
ofmeaningful
meaningfulsub-classes,
sub-classes,called
calledclusters
clusters

Helps users understand the natural grouping or structure in a data set


 Cluster:
– a collection of data objects that are
“similar” to one another and thus can be
treated collectively as one group
– but as a collection, they are sufficiently
different from other groups
 Clustering
– clustering can be regarded as a form of
classification in that it creates a labeling
of objects with class (cluster) labels.
– unsupervised classification
– no predefined classes

02/14/2018 Introduction to Data Mining, 2nd Edition 7


Types of Clusterings
 Partitional and Hierarchical Clustering
– A partitional clustering is simply a division of the set of data
objects into non-overlapping subsets (clusters) such that each
data object is in exactly one subset.
– A set of nested clusters organized as a hierarchical tree.

 Exclusive versus non-exclusive


– Exclusive assign each object to a single cluster
– In non-exclusive clusterings, points may belong to multiple
clusters.
– Can represent multiple classes or ‘border’ points

02/14/2018 Introduction to Data Mining, 2nd Edition 8


Types of Clusterings

 Fuzzy versus non-fuzzy


– In fuzzy clustering, a point belongs to every cluster with some
weight between 0(absolutely does not belong) and
1(absolutely belong)
– Weights must sum to 1
– Probabilistic clustering has similar characteristics

 Partial versus complete


– complete clustering assigns every object to a cluster
– In some cases, we only want to cluster some of the data
(partial)

02/14/2018 Introduction to Data Mining, 2nd Edition 9


Partitional Clustering

Original Points A Partitional Clustering

02/14/2018 Introduction to Data Mining, 2nd Edition 10


Hierarchical Clustering

p1
p3 p4
p2
p1 p2 p3 p4

Traditional Hierarchical Clustering Traditional Dendrogram

p1
p3 p4
p2
p1 p2 p3 p4

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

02/14/2018 Introduction to Data Mining, 2nd Edition 11


Types of Clusters

Clustering aims to find useful groups of objects (clusters)

 Well-separated clusters
 Center-based (Prototype-based) clusters
 Contiguous-based (Graph-based) clusters
 Density-based clusters
 Property or Conceptual
 Described by an Objective Function

02/14/2018 Introduction to Data Mining, 2nd Edition 12


Types of Clusters: Well-Separated

 Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster is closer (or more
similar) to every other point in the cluster than to any point not in the cluster.
– Sometimes a threshold is used to specify that all the objects in a cluster must
be sufficiently close (or similar) to one another.
– This idealistic definition of a cluster is satisfied only when the data contains
natural clusters that are quite far from each other.

3 well-separated clusters :
Each point is closer to all of the
points in its cluster than to any
point in another cluster.

– The distance between any two points in different groups is larger than the
distance between any two points within a group.
– Well-separated clusters do not need to be globular, but can have any shape

02/14/2018 Introduction to Data Mining, 2nd Edition 13


Types of Clusters: Center-Based

 Center-based (Prototype-based)
– A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster.
– The center of a cluster is often a centroid, the average of all
the points in the cluster, or a medoid, the most “representative”
point of a cluster.

4 center-based clusters: Each point is closer to the center of its


cluster than to the center of any other cluster.

02/14/2018 Introduction to Data Mining, 2nd Edition 14


Types of Clusters: Contiguity-Based

 Contiguous Cluster (Nearest neighbor or


Transitive) (Graph based)
– two objects are connected only if they are within a specified distance of each
other
– A cluster is a set of points such that a point in a cluster is closer (or more
similar) to one or more other points in the cluster than to any point not in the
cluster.
– two objects are connected only if they are within a specified distance of each
other. This implies that each object in a contiguity-based cluster is closer to
some other object in the cluster than to any point in a different cluster
 When noise is present, a small bridge of points can merge two distinct clusters.

8 contiguous clusters

02/14/2018 Introduction to Data Mining, 2nd Edition 15


Types of Clusters: Density-Based

 Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
– The two circular clusters are not merged, in previous fig
because the bridge between them fades into the noise.

6 density-based clusters

02/14/2018 Introduction to Data Mining, 2nd Edition 16


Types of Clusters: Conceptual Clusters

 Shared Property or Conceptual Clusters


 Consider the clusters shown in Figure.
 A triangular area (cluster) is adjacent to a rectangular one,
 and there are two intertwined circles (clusters).
 In both cases, a clustering algorithm would need a very specific
concept of a cluster to successfully detect these clusters. The
process of finding such clusters is called conceptual clustering.
However, too sophisticated a notion of a cluster would take us into
the area of pattern recognition.

2 Overlapping Circles
02/14/2018 Introduction to Data Mining, 2nd Edition 17
Clustering Algorithms

 K-means and its variants


– K mean clustering is an unsupervised learning
algorithm that will attempt to group similar clusters
together in your data.

 Agglomreative Hierarchical clustering

 Density-based clustering (DBSCAN)

02/14/2018 Introduction to Data Mining, 2nd Edition 18


K-means Clustering

 Partitional clustering approach


 Number of clusters, K, must be specified
 Each cluster is associated with a centroid (center point)
 Each point is assigned to the cluster with the closest
centroid
 The basic algorithm is very simple

02/14/2018 Introduction to Data Mining, 2nd Edition 19


Example of K-means Clustering
Iteration 6
1
2
3
4
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


02/14/2018 Introduction to Data Mining,x2nd Edition 20
Example of K-means Clustering
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

02/14/2018 Introduction to Data Mining, 2nd Edition 21


Choosing a K value
 There is no easy answer for choosing a ‘best” K value
 One way is the elbow method

 First of all, compute the sum of the square error (SSE) for some values
of k (for example 2, 4, 6, 8, etc.).

 The SSE is defined as the sum of the square distance between each
member of the cluster and its centroid.

K
SSE    dist 2 ( mi , x )
i 1 xCi

02/14/2018 Introduction to Data Mining, 2nd Edition 22


Evaluating K-means Clusters

 Most common measure is Sum of Squared Error (SSE)


– For each point, the error is the distance to the nearest cluster
– To get SSE, we square these errors and sum them.
K
SSE    dist 2 ( mi , x )
i 1 xCi

– x is a data point in cluster Ci and mi is the representative point for


cluster Ci
 can show that mi corresponds to the center (mean) of the cluster
– Given two sets of clusters, we prefer the one with the smallest
error
– One easy way to reduce SSE is to increase K, the number of
clusters
 A good clustering with smaller K can have a lower SSE than a
poor clustering with higher K
02/14/2018 Introduction to Data Mining, 2nd Edition 23
Choosing a K Value (Elbow method)

02/14/2018 Introduction to Data Mining, 2nd Edition 24


Limitations of K-means

 K-means has problems when clusters are of


differing
– Sizes
– Densities
– Non-globular shapes

 K-means has problems when the data contains


outliers.

02/14/2018 Introduction to Data Mining, 2nd Edition 25

You might also like