Clustering in R

Practical 10:-
CLUSTERING MODEL
A. CLUSTERING ALGORITHMS FOR UNSUPERVISED
CLASSIFICATION.
B. PLOT THE CLUSTER DATA USING R VISUALIZATIONS
Supervised and Unsupervised learning
There are two types of learnings in data analysis: Supervised and Unsupervised learning.
Supervised learning – Labeled data is an input to the machine which it learns. Regression, classification, decision trees,
etc. are supervised learning methods.
Example of supervised learning:
Linear regression is where there is only one dependent variable. Equation: y=mx+c, y is dependent on x.
Eg: The age and circumference of a tree are the 2 labels as input dataset, the machine needs to predict the age of a tree
with a circumference as input after learning the dataset it was fed. The age is dependent on the circumference.
The learning thus is supervised on the basis of the labels.
Unsupervised learning – Unlabeled data is fed to the machine to find a pattern on its own. Clustering is an unsupervised
learning method having models – KMeans, hierarchical clustering, DBSCAN, etc.
Visual representation of clusters shows the data in an easily understandable format as it groups elements of a large dataset
according to their similarities. This makes analysis easy. Unsupervised learning is not always accurate though and is a
complex process for the machine as data is unlabeled.
Clustering in R
It is basically a type of unsupervised learning method. An unsupervised learning method is a method in which
we draw references from datasets consisting of input data without labeled responses. Generally, it is used as a
process to find meaningful structure, explanatory underlying processes, generative features, and groupings
inherent in a set of examples.
Clustering is the task of dividing the population or data points into a number of groups such that data points
in the same groups are more similar to other data points in the same group and dissimilar to the data points in
other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them.
Clustering Methods
Density-Based Methods: These methods consider the clusters as the dense region having some similarities and
differences from the lower dense region of the space. These methods have good accuracy and the ability to merge
two clusters. Example DBSCAN (Density-Based Spatial Clustering of Applications with Noise), OPTICS (Ordering
Points to Identify Clustering Structure), etc.
Hierarchical Based Methods: The clusters formed in this method form a tree-type structure based on the
hierarchy. New clusters are formed using the previously formed one. It is divided into two category
Agglomerative (bottom-up approach)
Divisive (top-down approach)
Partitioning Methods: These methods partition the objects into k clusters and each partition forms one cluster.
This method is used to optimize an objective criterion similarity function such as when the distance is a major
parameter example K-means, CLARANS (Clustering Large Applications based upon Randomized Search), etc.
Grid-based Methods: In this method, the data space is formulated into a finite number of cells that form a grid-
like structure. All the clustering operations done on these grids are fast and independent of the number of data
objects example STING (Statistical Information Grid), wave cluster, CLIQUE (CLustering In Quest), etc.
Methods of Clustering
There are 2 types of clustering in R programming:

Hard clustering: In this type of clustering, the data point either belongs to the cluster totally or
not and the data point is assigned to one cluster only. The algorithm used for hard clustering is k-
means clustering.
Soft clustering: In soft clustering, the probability or likelihood of a data point is assigned in the
clusters rather than putting each data point in a cluster. Each data point exists in all the clusters
with some probability. The algorithm used for soft clustering is the fuzzy clustering method or
soft k-means.
K-Means Clustering in R Programming language
K-Means is an iterative hard clustering technique that uses an unsupervised learning algorithm. In this, total numbers of clusters
are pre-defined by the user and based on the similarity of each data point, the data points are clustered. This algorithm also finds
out the centroid of the cluster.
Algorithm:
Specify number of clusters (K): Let us take an example of k =2 and 5 data points.
Randomly assign each data point to a cluster: In the below example, the red and green color shows 2 clusters with their
respective random data points assigned to them.
Calculate cluster centroids: The cross mark represents the centroid of the corresponding cluster.
Re-allocate each data point to their nearest cluster centroid: Green data point is assigned to the red cluster as it is near to the
centroid of red cluster.
Re-figure cluster centroid
Syntax: kmeans(x, centers, nstart)
where,
x represents numeric matrix or data frame object. centers represents the K value or distinct cluster centers
nstart represents number of random sets to be chosen
Applications of Clustering in R Programming Language
Marketing: In R programming, clustering is helpful for the marketing field. It helps in finding the
market pattern and thus, helping in finding the likely buyers. Getting the interests of customers using
clustering and showing the same product of their interest can increase the chance of buying the product.
Medical Science: In the medical field, there is a new invention of medicines and treatments on a daily
basis. Sometimes, new species are also found by researchers and scientists. Their category can be easily
found by using the clustering algorithm based on their similarities.
Games: A clustering algorithm can also be used to show the games to the user based on his interests.
Internet: An user browses a lot of websites based on his interest. Browsing history can be aggregated
to perform clustering on it and based on clustering results, the profile of the user is generated.
Example
# Library required for fviz_cluster function # Visualize the clusters
install.packages("factoextra") fviz_cluster(km, data = df)
library(factoextra) # saving the file
dev.off()
# Loading dataset
# output to be present as PNG file
df <- mtcars
png(file = "KMeansExample2.png")
# Omitting any NA values km <- kmeans(df, centers = 5, nstart = 25)
df <- na.omit(df) # Visualize the clusters
# Scaling dataset fviz_cluster(km, data = df)
df <- scale(df) # saving the file
dev.off()
# output to be present as PNG file
png(file = "KMeansExample.png")
km <- kmeans(df, centers = 4, nstart = 25)
Another Example
Step 1
In Iris datasets there are 5 columns namely – Sepal length, Sepal width, Petal Length, Petal
Width, and Species. Iris is a flower and here in this dataset 3 of its species Setosa, Versicolor,
Verginica are mentioned. We will cluster the flowers according to their species.
data("iris")
head(iris) #will show top 6 rows only
x=iris[,3:4] #using only petal length and width columns
head(x)
Step 2
The next step is to separate the 3rd and 4th columns into separate object x as we are using the
unsupervised learning method. We are removing labels so that the huge input of petal length and
petal width columns will be used by the machine to perform clustering unsupervised.
x=iris[,3:4] #using only petal length and width columns
head(x)
Step 3
The next step is to use the K Means algorithm. K Means is the method we use which has parameters (data, no. of
clusters or groups). Here our data is the x object and we will have k=3 clusters as there are 3 species in the dataset.
Then the ‘cluster’ package is called. Clustering in R is done using this inbuilt package which will perform all the
mathematics. Clusplot function creates a 2D graph of the clusters.
model=kmeans(x,3)
 library(cluster)
clusplot(x,model$cluster)
Component 1 and Component 2 seen in the graph are the two components in PCA (Principal Component
Analysis) which is basically a feature extraction method that uses the important components and removes the rest.
It reduces the dimensionality of the data for easier KMeans application. All of this is done by the cluster package
itself in R.
These two components explain 100% variability in the output which means the data object x fed to PCA was
precise enough to form clear clusters using KMeans and there is minimum (negligible) overlapping amongst them.
Step 4
The next step is to assign different colors to the clusters and shading them hence we use the color
and shade parameters setting them to T which means true.
clusplot(x,model$cluster,color=T,shade=T)

Clustering in R

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering in R

Uploaded by

Copyright:

Available Formats

Practical 10:-

There are 2 types of clustering in R programming:

You might also like