Data Mining Unit 3 Cluster Analysis: Types of Clusters

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

.

Data Mining Unit 3


Cluster Analysis

What is Cluster Analysis?


Cluster analysis is a multivariate data mining technique whose goal is to groups objects (eg.,
products, respondents, or other entities) based on a set of user selected characteristics or attributes.
It is the basic and most important step of data mining and a common technique for statistical data
analysis, and it is used in many fields such as data compression, machine learning, pattern
recognition, information retrieval etc.

Types of clusters:
• Well-separated clusters
• Center-based clusters
• Contiguous clusters
• Density-based clusters
• Property or Conceptual

Well-Separated Clusters:
A cluster is a set of objects where each object is closer or more similar to every other object in the
cluster. Sometimes a limit is used to indicate that all the objects in a cluster must be adequately
close or similar to each other. The definition of a cluster is satisfied only when the data contains
natural clusters that are quite far from one another

Center-based (proto type based):


– A cluster is a set of objects such that an object in a cluster is closer (more similar) to the
“center” of a cluster, than to the center of any other cluster
– The center of a cluster is often a centroid, the average of all the points in the cluster, or a
medoid, the most “representative” point of a cluster
Contiguous Cluster (Graph based):
If the data is depicted as a graph, where the nodes are the objects, then a cluster can be described as
a connected component. It is a group of objects that are associated with each other, but that has no
association with objects that is outside the group. A significant example of graph-based clusters is
contiguity-based clusters, where two objects are associated when they are placed at a specified
distance from each other. It suggests that every object in a contiguity-based cluster is the same as
some other object in the cluster.

Density-based:
– A cluster is a dense region of points, which is separated by low-density regions, from other
regions of high density.
– Used when the clusters are irregular or intertwined, and when noise and outliers are
present.

Shared- property or Conceptual Clusters:


We can describe a cluster as a set of objects that offer some property. The object in a center-based
cluster shares the property that they are all closest to the similar centroid or medoid. However, the
shared-property approach additionally incorporates new types of the cluster. Consider the cluster
given in the figure. A triangular area (cluster) is next to a rectangular one, and there are two
intertwined circles (clusters). In both cases, a Clustering algorithm would require a specific concept
of a cluster to recognize these clusters effectively. The way of discovering such clusters is called
conceptual Clustering.
Q2) What are the diferent types of clustring
A)An entire collection of clusters is commonly referred to as a clustering, and the
various types of clusterings are:
• hierarchical clustring(nested)
• partitional clustring(unnested)
• exclusive clustring
• overlapping clustring
• finzy clustring
• complete clustring
• partial clustring

Q3) Explain Hierarchial clustering


A)

• This type of clustering groups together the unlabeled data points having similar
characteristics.
• Hierarchical clustering treats every data point as a separate cluster.
• Then, it repeatedly executes the subsequent steps like, Identify the two clusters which can be
closest together, and merging the two maximum comparable clusters.
• This process needs to continue until all the clusters are merged.
• Hence, this method creates a hierarchical decomposition of the given set of data objects.
• Based on this how the hierarchical decomposition is formed this clustering is further
classified into two types,
1. Agglomerative Approach
2. Divisive Approach
Agglomerative Approach
• This approach is also known as the Bottom-Up Approach.
• This approach starts with each object forming a separate group.
• It keeps on merging the objects or groups that are close to one another.
• It keeps on doing so until all of the groups are merged into one or until the termination
condition holds.
Algorithm for Agglomerative Hierarchical Clustering is:
Step 1 - Calculate the similarity of one cluster with all the other clusters. Calculation of Proximity
Matrix.
Step 2 - Consider every data point as an individual cluster.
Step 3 - Merge the clusters which are highly similar or close to each other.
Step 4 - Recalculate the proximity matrix for each cluster.
Step 5 - Repeat Steps 3 and 4 until only a single cluster remains.
Divisive Approach
• This approach is also known as the Top-Down Approach.
• This approach starts with all of the objects in the same cluster.
• In the continuous iteration, a cluster is split up into smaller clusters.
• It is down until each object in one cluster or the termination condition holds.
• This method is rigid, i.e., once a merging or splitting is done, it can never be undone.

Q) Explain the partitional clustring

A)Partitional clustering algorithms generate various partitions and then evaluate them by some
criterion. They are also referred to as nonhierarchical as each instance is placed in exactly one of k
mutually exclusive clusters. Because only one set of clusters is the output of a typical partitional
clustering algorithm, the user is required to input the desired number of clusters (usually called k).
One of the most commonly used partitional clustering algorithms is the k-means clustering
algorithm. User is required to provide the number of clusters (k) before starting and the algorithm
first initiates the centers (or centroids) of the k partitions. In a nutshell, k-means clustering
algorithm then assigns members based on the current centers and re-estimates centers based on the
current members. These two steps are repeated until a certain intra-cluster similarity objective
function and inter-cluster dissimilarity objective function are optimized. Therefore, sensible
initialization of centers is a very important factor in obtaining quality results from partitional
clustering algorithms.

Q)What is the difference between Hierarchical and Partitional Clustering?

Hierarchical and Partitional Clustering have key differences in running time, assumptions, input
parameters and resultant clusters. Typically, partitional clustering is faster than hierarchical
clustering. Hierarchical clustering requires only a similarity measure, while partitional clustering
requires stronger assumptions such as number of clusters and the initial centers. Hierarchical
clustering does not require any input parameters, while partitional clustering algorithms require the
number of clusters to start running. Hierarchical clustering returns a much more meaningful and
subjective division of clusters but partitional clustering results in exactly k clusters. Hierarchical
clustering algorithms are more suitable for categorical data as long as a similarity measure can be
defined accordingly.

Q) Explain K means clustering method and algorithm. What are the limitations
of k means
Ans:
• K means is a partitional method of cluster analysis.
• The objects are divided into non-overlapping clusters (or partitions) such that each object is in
exactly one cluster.
• This method obtains a single-level partition of objects.
• This method can only be used if the data-object is located in the main memory.
• This method is called K-means since each of the K clusters is represented by mean of the
objects(called centriod) within it.
• The method is also called the centroid-method since
→ at each step, the centroid-point of each cluster is assumed to be known &
→ each of the remaining points are allocated to cluster whose centroid is closest to it.
K-MEANS ALGORITHM
1) Select the number of clusters=k.
2) Pick k seeds as centroids of k clusters. The seeds may be picked randomly.
3) Compute euclidean distance of each object in the dataset from each of the centroids.
4) Allocate each object to the cluster it is nearest to.
5) Compute the centroids of clusters.
6) Check if the stopping criterion has been met (i.e. cluster-membership is unchanged)
If yes, go to step 7.
If not, go to step 3.
7) One may decide to stop at this stage or split a cluster or combine two clusters until a stopping
criterian is met
Complexity is O( n * K * I * d )
n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes

LIMITATIONS OF K MEANS
1) The results of the method depend strongly on the initial guesses of the seeds.
2) The method can be sensitive to outliers.
3) The method does not consider the size of the clusters.
4) The method does not deal with overlapping clusters.
5) Often, the local optimum is not as good as the global optimum.
6) The method implicitly assumes spherical probability distribution.
7) The method cannot be used with categorical data.

Q) Explain the evaluating of K-means cluster


• Most common measure is Sum of Squared Error (SSE)
• For each point, the error is the distance to the nearest cluster
• To get SSE, we square these errors and sum them.

• x is a data point in cluster Ci and mi is the representative point for cluster Ci


• mi corresponds to the center (mean) of the cluster
• Given two clusters, we can choose the one with the smallest error
• One easy way to reduce SSE is to increase K, the number of clusters
• A good clustering with smaller K can have a lower SSE than a poor clustering with higher K

Q) Explain the working of Bisecting K-means algorithm


A)
• It is a combination of K-means and hierarchical clustering
• Instead of partitioning data into k clusters in each iteration,bisecting k-means splits one
cluster into two sub clusters at each bisecting step (using the original k-means) until k
clusters are obtained.
• Note that running Bisecting K-Means with the same data does not always generate the same
result because Bisecting K-Means initializes clusters randomly.
• The iteration specifies how many times the algorithm should repeat a split to keep the best
split. If it is set to a high value it should provide better results but it would be more slow.
Splits are evaluated using the Squared Sum of Errors (SSE).
• There are a number of ways to choose which cluster to split.
• choose the largest cluster at each step
• choose the one with the largest SSE
• Use a criterion based on both size and SSE
• Different choices result in different clusters
• Because we are using the K-means algorithm “locally” to bisect individual clusters, the final
set does not represent a local minimum with respect to total SSE. This is partially true for
each bisect but not overall
• The clusters can be improved by using the cluster centroids as initial centroids for the
standard K-means algorithm

Bisecting K-Means Algorithm:


1. Initialize the list of clusters to accommodate the cluster consisting of all points.
2. repeat
3. Discard a cluster from the list of clusters.
4. { Perform several “trial” bisections of the selected cluster. }
5. for i = 1 to number of trials do
6. Bisect the selected clusters using basic K-means.
7. end for
8. Select the 2 clusters from the bisection with the least total SSE.(Sum of Squared Errors)
9. until the list of clusters contain ‘K’ clusters

Q) What are the problems with selecting initial centroid points in k- means cluster. Suggest
few suggestion
Choosing the proper initial centroids is the key step of the basic K-means procedure. A common
approach is to choose the initial centroids randomly, but the resulting clusters are often poor.When
random initialization of centroids is used, different runs of K-means typically produce different total
SSEs.Even though all the initial centroids are from one natural cluster, the minimum SSE clustering
is still found.however, even though the initial centroids seem to be better distributed, we obtain a
suboptimal clustering, with higher squared error.
Solutions to initial centroid problem
1.One technique that is commonly used to address the problem of choosing initial centroids is to
perform multiple runs, each with a different set of randomly chosen initial centroids, and then select
the set of clusters with the minimum SSE. This strategy may not work very well, depending on the
data set and the number of clusters sought
2.Another effective approach is to take a sample of points and cluster them using a hierarchical
clustering technique. K clusters are extracted from the hierarchical clustering, and the centroids of
those clusters are used as the initial centroids. This approach often works well.but it is practical
only if
(1) the sample is relatively small,
(2) K is relatively small compared to the sample size.

3)The following procedure is another approach to selecting initial centroids.Select the first point at
random or take the centroid of all points. Then, for each successive initial centroid, select the point
that is farthest from any of the initial centroids already selected. In this way, we obtain a set of
initial centroids that is guaranteed to be not only randomly selected but also well separated
Unfortunately, such an approach can select outliers, rather than points in dense regions (clusters).
AIso, it is expensive to compute the farthest point from the current set of initial centroids. To
overcome these problems this approach is often applied to a sample of the points. Since outliers are
rare, they tend not to show up in a random sample. In contrast, points from every dense region are
likely to be included unless the sample size is very small. Also, the computation involved in finding
the initial centroids is greatly reduced because the sample size is typically much smaller than the
number of points

4)More recently, a new approach for initializing K-means, called K-means++, has been developed.
This procedure is guaranteed to find a K-means clustering solution with noticeably better clustering
results in terms of lower SSE.

Q) List and explain the important issues concerned with respect to cluster validation
A)
The following is a list of several important issues for cluster validation.
1. Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random
structure actually exists in the data.
2. Determining the correct number of clusters.
3. Evaluating how well the results of a cluster analysis fit the data without reference to external
information.
4. Comparing the results of a cluster analysis to externally known results, such as externally
provided class labels.
5. Comparing two sets of clusters to determine which is better.

Q) Explain the evaluation measures, or indices, that are applied to judge various
aspects of cluster validity
or
Why do we need cluster validity indices? Explain with example internal and external validity
indices
A)we need cluster validity indices for the following reasons:
• To compare clustering algorithms.
• To compare two sets of clusters.
• To compare two clusters i.e which one is better in terms of compactness and connectedness.
• To determine whether random structure exists in the data due to noise.

Generally, cluster validity measures are categorized into 3 classes, they are –
1. Unsupervised(Internal validity indices): The clustering result is evaluated based on the data
clustered itself (internal information) without reference to external information.. An example of this
is the SSE. Unsupervised measures of cluster validity are often further divided into two
classes:
measures of cluster cohesion (compactness, tightness), which determine how closely the objects in
a cluster are
measures of cluster separation (isolation), which determine how distinct or wellseparated a cluster
is from other clusters. Unsupervised measures are often called internal indices because they use
only information present in the data set.
2.Supervised(External validity indices):. Clustering results are evaluated based on some
externally known result, such as externally provided class labels. An example of a supervised index
is entropy, which measures how well cluster labels match externally supplied class labels.
Supervised measures are often called external indices beecause they use information not present in
the data set.
3.Relative: . The clustering results are evaluated by varying different parameters for the same
algorithm (e.g. changing the number of clusters).As an example, two K-means clusterings
can be compared using either the SSE or entropy.

Q)How density based methods are used for clustering? Explain.


Or
What is DBSCAN? Explain the basic DBSCAN algorithm with example
DENSITY-BASED METHODS
• A cluster is a dense region of points, which is separated by a region of lower density from other
regions of high density.
• Typically, for each data-point in a cluster, at least a minimum number of points must exist
within a given radius.
• Data that is not within such high-density clusters is regarded as outliers or noise.
• For example: DBSCAN (Density Based Spatial Clustering of Applications with Noise).
DBSCAN
• It requires 2 input parameters:
1) Size of the neighborhood (R) &
2) Minimum points in the neighborhood (N).
• The point-parameter N determines the density of acceptable-clusters & also determines which
objects will be labeled outliers or noise.
• The size-parameter R determines the size of the clusters found. If R is big enough, there will be
one big cluster and no outliers. If R is small, there will be small dense clusters and there might be
many outliers.
• We define a number of terms :
1. Neighborhood: The neighborhood of an object y is defined as all the objects that are within the
radius R from y.
2. Core-object: An object y is called a core-object if there are N objects within its neighborhood.
3. Proximity: Two objects are defined to be in proximity to each other if they belong to the same
cluster.
Object x 1 is in proximity to object x 2 if two conditions are satisfied:
i) The objects are close enough to each other, i.e. within a distance of R.
ii) x 2 is a core object.
4. Connectivity: Two objects x 1 and x n are connected if there is a chain of
objects x 1 ,x 2 . . . .x n from x 1 to x n such that each x i+1 is in proximity to object x i .
DBSCAN ALGORITHM
1. Select values of R and N.
2. Arbitrarily select an object p.
3. Retrieve all objects that are connected to p, given R and N.
4. If p is a core object, a cluster is formed.
5. If p is a border object, no objects are in its proximity.
Choose another object. Go to step 3.
6. Continue the process until all of the objects have been processed.
Q)What are the features of cluster analysis
Ans:
DESIRED FEATURES OF CLUSTER ANALYSIS METHOD :
Scalability
• Data-mining problems can be large.
• Therefore, a cluster-analysis method should be able to deal with large problems gracefully.
• The method should be able to deal with datasets in which number of attributes is large.
Only one Scan of the Dataset
• For large problems, data must be stored on disk. So, cost of I/O disk becomes significant in
solving the problem. Therefore, the method should not require more than one scan of disk.
Ability to Stop & Resume
• For large dataset, cluster-analysis may require huge processor-time to complete the task.
Therefore, the task should be able to be stopped & then resumed as & when required.
Robustness
• Most data obtained from a variety of sources has errors. Therefore, the method should be able to
deal with i) noise, ii) outlier & iii) missing values gracefully.
Ability to Discover Different Cluster-Shapes
• Clusters appear in different shapes and not all clusters are spherical. Therefore, method should be
able to discover cluster-shapes other than spherical.
Different Data Types
• Many problems have a mixture of data types, for e.g. numerical, categorical & textual.
• Therefore, the method should be able to deal with
i) Numerical data
ii) Boolean data &
iii) Categorical data.

You might also like