Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

Experiment No.

Aim : Write a program to implement Clustering algorithm using Map reduce.


Lab Outcome No. : 8.ITL 801.4
Lab Outcome : Construct algorithms for Clustering and finding association in Big Data.
Date of Performance: 8/4/22
Date of Submission: 13/4/22

Program Documentation Timely Viva Experiment Teacher


formation/ (02) Submission Answer Marks (15) Signature
Execution / (03) (03) with date
Ethical
practices (07 )
EXPERIMENT NO : 7

AIM : Write a program to implement Clustering algorithm using Map reduce.

THEORY :
K-means is one of the simplest unsupervised learning algorithms that solve the well known
clustering problem. The procedure follows a simple and easy way to classify a given data set
through a certain number of clusters (assume k clusters) fixed apriori.
The main idea is to define k centers, one for each cluster. These centers should be placed in a
cunning way because different locations cause different results. So, the better choice is to
place them as much as possible far away from each other. The next step is to take each point
belonging to a given data set and associate it to the nearest center. When no point is
pending, the first step is completed and an early group age is done. At this point we need to
re-calculate k new centroids as barycenters of the clusters resulting from the previous step.
After we have these k new centroids, a new binding has to be done between the same data set
points and the nearest new center. A loop has been generated.
As a result of this loop we may notice that the k centers change their location step by step
until no more changes are done or in other words centers do not move any more. Finally, this
algorithm aims at minimizing an objective function know as squared error function given by:

where,
‘||x i - vj||’ is the Euclidean distance between x i and v j.
‘c i’ is the number of data points in ith cluster.
‘c’ is the number of cluster centers.

K-means clustering is a classical clustering algorithm that uses an expectation maximization


like technique to partition a number of data points into k clusters.
K-means clustering is commonly used for a number of classification applications. Because k-
means is run on such large data sets, and because of certain characteristics of the algorithm, it
is a good candidate for parallelization.
The goal is to implement a framework in java for performing k-means clustering using
Hadoop MapReduce. In this problem, we have considered inputs a set of n 1-dimensional
points and desired clusters of size 3. Once the k initial centers are chosen, the distance is
calculated (Euclidean distance) from every point in the set to each of the 3 centers & points
with the corresponding center emitted by the mapper. Reducer collects all of the points of a
particular centroid and calculates a new centroid and emits.
Termination Condition : When difference between old and new centroid is less than or equal
to 0.1
ALGORITHM :
Step 1 : Initially randomly centroid is selected based on data. In our implementation we used
3 centroids.
Step 2 : The Input file contains initial centroid and data.
Step 3 : In the Mapper class the "configure" function is used to first open the file and read the
centroids and store them in the data structure(we used ArrayList).
Step 4 : Mapper read the data file and emit the nearest centroid with the point to the reducer.
Step 5 : Reducer collects all this data and calculates the new corresponding centroids and
emit.
Step 6 : In the job configuration, we are reading both files and checking if the difference
between old and new centroid is less than 0.1 then Convergence is reached else Repeat step 2
with new centroids.

CONCLUSION :
K-means Clustering algorithm is used using map reduce to reduce runtime for processing
large data sets (petabytes of data). By using map reduce in k-means algorithm it uses
commodity hardware and storage for large data.

You might also like