Intelligent System: Lecture Notes For Chapter 7

Intelligent System
Lecture Notes for Chapter 7

Cluster Analysis: Basic Concepts
and Algorithms
02/14/2018 Introduction to Data Mining, 2nd Edition 1

Difference between KNN and K mean
 K-NN is a Supervised machine learning while K-

means is an unsupervised machine learning.
 K-NN is a classification or regression machine
learning algorithm while K-means is
a clustering machine learning algorithm.
 K-NN is a lazy learner while K-Means is
an eager learner. An eager learner has a model
fitting that means a training step but a lazy
learner does not have a training phase.
 K-NN performs much better if all of the data have
the same scale but this is not true for K-means.
What is Cluster Analysis?
 Cluster analysis divides data into groups (clusters) that are meaningful,
useful, or both
 Finding groups of objects such that the objects in a group will be similar (or
related) to one another and different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

Applications of Cluster Analysis
Discovered Clusters Industry Group
 Understanding Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
– Group related documents 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
for browsing, group genes Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
and proteins that have 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,

Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Technology2-DOWN
similar functionality, or Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
group stocks with similar 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
price fluctuations Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP
 Data Summarization
– Reduce the size of large
data sets
Clustering precipitation
in Australia

Cluster Analysis
 Clusters are potential classes and cluster analysis is

the study of techniques for automatically finding
classes. Examples are:
 Biology: clustering has been used to find groups of genes that have
similar functions.
 Information Retrieval: Clustering can be used to group the search
results into a small number of clusters, each of which captures a
particular aspect of the query.
 Climate: cluster analysis has been applied to find patterns in the
atmospheric pressure of polar regions and areas of the ocean that have
a significant impact on land climate.
 Psychology and Medicine: identify different types of depression.
Cluster analysis can also be used to detect patterns in the spatial or
temporal distribution of a disease.
 Business: Clustering can be used to segment customers into a small
number of groups for additional analysis and marketing activities.
Notion of a Cluster can be Ambiguous
How many clusters? Six Clusters
Two Clusters Four Clusters

What is Clustering in Data Mining?
Clustering
Clusteringisisaaprocess
processof
ofpartitioning
partitioningaaset
setof
ofdata
data(or
(orobjects)
objects)in
inaaset
set
of
ofmeaningful
meaningfulsub-classes,
sub-classes,called
calledclusters
clusters
Helps users understand the natural grouping or structure in a data set

 Cluster:
– a collection of data objects that are
“similar” to one another and thus can be
treated collectively as one group
– but as a collection, they are sufficiently
different from other groups
 Clustering
– clustering can be regarded as a form of
classification in that it creates a labeling
of objects with class (cluster) labels.
– unsupervised classification
– no predefined classes

Types of Clusterings
 Partitional and Hierarchical Clustering
– A partitional clustering is simply a division of the set of data
objects into non-overlapping subsets (clusters) such that each
data object is in exactly one subset.
– A set of nested clusters organized as a hierarchical tree.
 Exclusive versus non-exclusive

– Exclusive assign each object to a single cluster
– In non-exclusive clusterings, points may belong to multiple
clusters.
– Can represent multiple classes or ‘border’ points

Types of Clusterings
 Fuzzy versus non-fuzzy

– In fuzzy clustering, a point belongs to every cluster with some
weight between 0(absolutely does not belong) and
1(absolutely belong)
– Weights must sum to 1
– Probabilistic clustering has similar characteristics
 Partial versus complete

– complete clustering assigns every object to a cluster
– In some cases, we only want to cluster some of the data
(partial)

Partitional Clustering
Original Points A Partitional Clustering

Hierarchical Clustering
p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram
p1
p3 p4
p2
p1 p2 p3 p4
Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Types of Clusters
Clustering aims to find useful groups of objects (clusters)
 Well-separated clusters
 Center-based (Prototype-based) clusters
 Contiguous-based (Graph-based) clusters
 Density-based clusters
 Property or Conceptual
 Described by an Objective Function

Types of Clusters: Well-Separated
 Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster is closer (or more
similar) to every other point in the cluster than to any point not in the cluster.
– Sometimes a threshold is used to specify that all the objects in a cluster must
be sufficiently close (or similar) to one another.
– This idealistic definition of a cluster is satisfied only when the data contains
natural clusters that are quite far from each other.
3 well-separated clusters :
Each point is closer to all of the
points in its cluster than to any
point in another cluster.
– The distance between any two points in different groups is larger than the
distance between any two points within a group.
– Well-separated clusters do not need to be globular, but can have any shape

Types of Clusters: Center-Based
 Center-based (Prototype-based)
– A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster.
– The center of a cluster is often a centroid, the average of all
the points in the cluster, or a medoid, the most “representative”
point of a cluster.
4 center-based clusters: Each point is closer to the center of its

cluster than to the center of any other cluster.

Types of Clusters: Contiguity-Based
 Contiguous Cluster (Nearest neighbor or

Transitive) (Graph based)
– two objects are connected only if they are within a specified distance of each
other
– A cluster is a set of points such that a point in a cluster is closer (or more
similar) to one or more other points in the cluster than to any point not in the
cluster.
– two objects are connected only if they are within a specified distance of each
other. This implies that each object in a contiguity-based cluster is closer to
some other object in the cluster than to any point in a different cluster
 When noise is present, a small bridge of points can merge two distinct clusters.
8 contiguous clusters

Types of Clusters: Density-Based
 Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
– The two circular clusters are not merged, in previous fig
because the bridge between them fades into the noise.
6 density-based clusters

Types of Clusters: Conceptual Clusters
 Shared Property or Conceptual Clusters

 Consider the clusters shown in Figure.
 A triangular area (cluster) is adjacent to a rectangular one,
 and there are two intertwined circles (clusters).
 In both cases, a clustering algorithm would need a very specific
concept of a cluster to successfully detect these clusters. The
process of finding such clusters is called conceptual clustering.
However, too sophisticated a notion of a cluster would take us into
the area of pattern recognition.
2 Overlapping Circles
Clustering Algorithms
 K-means and its variants

– K mean clustering is an unsupervised learning
algorithm that will attempt to group similar clusters
together in your data.
 Agglomreative Hierarchical clustering
 Density-based clustering (DBSCAN)

K-means Clustering
 Partitional clustering approach

 Number of clusters, K, must be specified
 Each cluster is associated with a centroid (center point)
 Each point is assigned to the cluster with the closest
centroid
 The basic algorithm is very simple

Example of K-means Clustering
Iteration 6
1
2
3
4
5
3
2.5
1.5
y
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

02/14/2018 Introduction to Data Mining,x2nd Edition 20
Example of K-means Clustering
Iteration 1 Iteration 2 Iteration 3
3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Iteration 4 Iteration 5 Iteration 6

3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Choosing a K value
 There is no easy answer for choosing a ‘best” K value
 One way is the elbow method
 First of all, compute the sum of the square error (SSE) for some values
of k (for example 2, 4, 6, 8, etc.).
 The SSE is defined as the sum of the square distance between each
member of the cluster and its centroid.
K
SSE    dist 2 ( mi , x )
i 1 xCi

Evaluating K-means Clusters
 Most common measure is Sum of Squared Error (SSE)

– For each point, the error is the distance to the nearest cluster
– To get SSE, we square these errors and sum them.
K
SSE    dist 2 ( mi , x )
i 1 xCi
– x is a data point in cluster Ci and mi is the representative point for

cluster Ci
 can show that mi corresponds to the center (mean) of the cluster
– Given two sets of clusters, we prefer the one with the smallest
error
– One easy way to reduce SSE is to increase K, the number of
clusters
 A good clustering with smaller K can have a lower SSE than a
poor clustering with higher K
Choosing a K Value (Elbow method)

Limitations of K-means
 K-means has problems when clusters are of

differing
– Sizes
– Densities
– Non-globular shapes
 K-means has problems when the data contains

outliers.

Intelligent System: Lecture Notes For Chapter 7

Uploaded by

Copyright:

Available Formats

You might also like

Intelligent System: Lecture Notes For Chapter 7

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intelligent System: Lecture Notes For Chapter 7

Uploaded by

Copyright:

Available Formats

Intelligent System

Lecture Notes for Chapter 7

02/14/2018 Introduction to Data Mining, 2nd Edition 1

 K-NN is a Supervised machine learning while K-

02/14/2018 Introduction to Data Mining, 2nd Edition 3

– Group related documents 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

and proteins that have 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,

price fluctuations Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

02/14/2018 Introduction to Data Mining, 2nd Edition 4

 Clusters are potential classes and cluster analysis is

How many clusters? Six Clusters

Two Clusters Four Clusters

02/14/2018 Introduction to Data Mining, 2nd Edition 6

Helps users understand the natural grouping or structure in a data set

02/14/2018 Introduction to Data Mining, 2nd Edition 7

 Exclusive versus non-exclusive

02/14/2018 Introduction to Data Mining, 2nd Edition 8

 Fuzzy versus non-fuzzy

 Partial versus complete

02/14/2018 Introduction to Data Mining, 2nd Edition 9

Original Points A Partitional Clustering

02/14/2018 Introduction to Data Mining, 2nd Edition 10

Traditional Hierarchical Clustering Traditional Dendrogram

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

02/14/2018 Introduction to Data Mining, 2nd Edition 11

Clustering aims to find useful groups of objects (clusters)

02/14/2018 Introduction to Data Mining, 2nd Edition 12

02/14/2018 Introduction to Data Mining, 2nd Edition 13

4 center-based clusters: Each point is closer to the center of its

02/14/2018 Introduction to Data Mining, 2nd Edition 14

 Contiguous Cluster (Nearest neighbor or

02/14/2018 Introduction to Data Mining, 2nd Edition 15

02/14/2018 Introduction to Data Mining, 2nd Edition 16

 Shared Property or Conceptual Clusters

 K-means and its variants

 Agglomreative Hierarchical clustering

 Density-based clustering (DBSCAN)

02/14/2018 Introduction to Data Mining, 2nd Edition 18

 Partitional clustering approach

02/14/2018 Introduction to Data Mining, 2nd Edition 19

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

Iteration 4 Iteration 5 Iteration 6

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

02/14/2018 Introduction to Data Mining, 2nd Edition 21

02/14/2018 Introduction to Data Mining, 2nd Edition 22

 Most common measure is Sum of Squared Error (SSE)

– x is a data point in cluster Ci and mi is the representative point for

02/14/2018 Introduction to Data Mining, 2nd Edition 24

 K-means has problems when clusters are of

 K-means has problems when the data contains

02/14/2018 Introduction to Data Mining, 2nd Edition 25

You might also like