Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 6

Map-Reduce (Hadoop) based Data Clustering for

BigData: A Survey
A. J. Umbarkar1, S. A. Jadhav2, P.D.Sheth3 and A. S. Sathe4
Walchand College of Engineering, Sangli, Maharashtra, India
1
anantumbarkar@gmail.com

Abstract- Data clustering problems are challenging due to The Map-Reduce library provides an abstraction layer that
bigdata. The volume, value, variety, velocity, veracity and simplifies parallelization, fault tolerance, data distribution,
structure of data makes clustering more complex. Processing and load balancing for programmers [3]. Developing
time for data mining is directly proportional to the data to be algorithms suitable for big data processing can be
processed. To access big data efficiently, data clustering may be challenging, but the Hadoop framework has been developed
a good solution. Traditional data clustering algorithms have for this purpose. Hadoop is a distributed environment
been considered as NP-Hard problem. The clustering of big architecture that allows remote computers to communicate
data also comes under the roof of the NP-Hard problem.
and coordinate their actions using messages. HDFS is closely
Clustering algorithms require huge computations to make the
cluster of data. An increase in data increases the number of
integrated with Map-Reduce for efficient big-data processing
computations. Hence, research in use of distributed computing [4].
for clustering algorithms and improvising resources utilization. The architecture of Map-Reduce is intended to enable the
However, distributed computing gives the worst results when distribution of programming and the processing of massive
data has dependencies. Distributed system like Hadoop run the datasets. It consists of three primary functions: Shuffle,
clustering algorithms as the same code on partitioned data on Reduce, and Map. The Map function is responsible for data
different nodes of computing and finally gathers the result in
mapping to different nodes in a distributed environment,
one place. This paper gives a survey of Map-Reduce based big
data clustering algorithms and problem solved.
while the Reduce function performs a summary operation
and gathers all results in one location. The infrastructure or
Keywords- Big-data, Hadoop, Spark, Data Clustering, Map- framework of Map-Reduce manages parallel processing, data
Reduce. transfers, and communication among different parts of the
system, while also ensuring fault tolerance and redundancy
[5].
I. INTRODUCTION
Data storage has become cost-efficient and faster, but II. LITERATURE REVIEW
accessing storage devices remains a bottleneck due to limited
I/O speed. Data can be structured or unstructured, with There are various clustering techniques/algorithms to
unstructured data lacking a predefined data model. Database distinguish dissimilar data. Here, taxonomy of the clustering
management systems (DBMS) are effective in handling algorithm as listed below:
structured data. However, when data surpasses processing The K-means clustering algorithm creates "hard" clusters
capacity or system RAM, it becomes big data, which requires by dividing M points in N dimensions into K clusters and
distributed processing across multiple computing nodes. This reducing the sum of squares of distances between points
consumes more computational power and storage resources within each cluster. [6], [7]. In contrast, the "soft" Fuzzy C-
[1]. Hadoop and Spark have commonly used solutions for big means clustering algorithm assigns a degree of membership
data processing and clustering. to each feature vector in each cluster [7], [8]. Hierarchical
By combining similar data and separating it from clustering can be either "Bottom-Up" (Agglomerative) or
unrelated data, the concept of clustering makes the process of "Top-Down" (Divisive) and uses metrics such as Euclidean
processing data more straightforward. In order to generate Distance, Square Euclidean Distance, and Manhattan
clusters, a similarity matrix must be created and comparable distance to calculate proximity [9]. Canopy Clustering is
items must be chosen. NP-Hard problems exist in clustering. often used as a pre-processing step for other clustering
[2]. In several disciplines, including artificial intelligence, algorithms, and variations of K-means, such as K-medians
data mining, and image analysis, clustering is a critical task. and K-medoids, also exist [10].
However, because of the data's increased complexity and Anchalia, Koundinya, and Srinath [11] focused on data
size, clustering large data can be difficult and time- mining and aimed to improve information retrieval speed
consuming. Although data reliance presents a problem that through clustering. They proposed a parallel implementation
must be overcome and lowers performance, parallelization architecture to speed up data clustering. The authors
can be helpful. A simple programming framework that emphasized the importance of handling outliers and
provides a high degree of abstraction for parallelization, data suggested improvements in stopping criteria and cluster
distribution, and load balancing is required to handle these initialization for better results. They provided algorithm and
problems. To maximize the use of all resources, a 100% implementation details for K-means on Map-Reduce
parallel processing technique is also necessary. architecture.
To achieve faster clustering of big data, distributed Zhao, Ma, and He [12] proposed that clustering data has
environments, and the Map-Reduce architecture are helpful. become challenging with increasing data volume, and a more
complex clustering approach is required. The proposed decrease Map-Reduce performance. Although the literature
solution for dealing with the challenge of clustering large focuses on clustering using Map-Reduce, the authors argue
datasets is a parallel K-means algorithm that uses Map- that data clustering is a prerequisite for improving Map-
Reduce. In this algorithm, the map function assigns each Reduce performance. They highlight the challenge of
sample to its closest center, while the reduce function distributing data in a distributed environment and Map-
modifies the center values. A combiner function reduces Reduce architecture. Instead of distributing random data to
network communication, and the dataset is divided and random nodes, the authors propose performing data
broadcast to all mappers To enable each mapper to calculate clustering first to group similar data together and transfer
the closest center point for each data point, K-means them to the same node. Data clustering can enhance data
constructs a universal variant center for every map task. The locality and improve performance for data-intensive
intermediate values generated during the algorithm's applications by balancing data distribution across nodes.
execution include the index of the closest center point and
Cordeiro, Traina, Traina, López, Kang and Faloutsos [17]
related information.
identify bottlenecks as a problem when using Map-Reduce
Ene, Im, and Moseley [13] experimented on designing a for data clustering, and propose minimizing I/O costs by
Map-Reduce algorithm for k-means, specifically for k- considering existing data partitions that reduce network costs
median and k-center algorithms. They evaluated various between processing nodes. To address this, they introduce
serial and parallel algorithms for the k-median problem, and the "Best of Both Worlds" (BOW) strategy, which uses a
proposed a solution to reduce the overhead of storing more cost function to select the best approach and achieves good
information for each un-sampled point. The solution involves results without user-defined parameters. The approach was
selecting the nearest sampled point and assigning additional evaluated on both actual and artificial data with billions of
weight to it based on the number of un-sampled points it is points, utilizing up to 1,024 cores in parallel.
closest to. However, assigning weight to each point in the
Ekanayake, Li, Zhang, Gunarathne Bae and Fox [18]
sample adds extra time to the Map-Reduce K-Median
proposed Twister, a distributed in-memory Map-Reduce
algorithm, which needs to be addressed. The author
runtime, with the goal of optimizing iterative operations or
acknowledges that the Map-Reduce architecture is not the
computations over Map-Reduce. They presented the
best model for iterative programming.
extended programming model and architecture of Twister,
Esteves and Rong [14] experimented on a large, realistic, and compared it with Map-Reduce in terms of performance
and noisy dataset using two data clustering algorithms, on various applications including data clustering, computer
namely k-means and FCM (fuzzy c-means), and evaluated vision, and machine learning. The experiments showed that
the results using Apache Mahout, a free cloud computing Twister performed well and scaled effectively for many
solution, and the latest Wikipedia articles. They iterative Map-Reduce computations.
demonstrated that dimensionality reduction plays a critical
Pal, Pal, Keller, and Bezdek [19] proposed (PFCM)
role in document clustering and that FCM performs worse
Clustering Algorithm; this clustering algorithm is less
than k-means in the presence of noise. The convergence
sensitive to outliers and avoids coincident clusters. PFCM is
speed of both algorithms is affected by the initialization of
more suitable for clustering overlapping big data datasets.
the cluster center. Different initialization methods provide
varying convergence times for large datasets. In general, Heidari, Alborzi, Radfar, Afsharkazemi and Ghatari [20]
FCM is faster than k-means, but random initialization can proposed a Map-Reduce based framework for clustering big
yield different results, and it is difficult to predict which data that addresses the challenge of varied densities using an
algorithm will be faster efficient and scalable parallel method. They proposed a new
algorithm to overcome the problem of varying densities in
Lou, Li and Liu [15] proposed a solution to improve the
density-based clustering and designed a three-layer Map-
performance of the fuzzy C-means (FCM) algorithm by
Reduce based algorithm on top of the Hadoop platform to
introducing a distance regularity factor. The use of Euclidean
analyze clusters with a variety of densities. Their approach
distance as the similarity measurement criterion in the
improved scalability and execution time while focusing on
conventional FCM algorithm can lead to unequal data
local density of points.
distribution and reduced clustering performance, especially
when there are variations in cluster shape and density. The Gerakidis, Megarchioti and Mamalis [21] experimented
distance regularity factor was designed to overcome these two clustering techniques, namely, a K-Means-based fast
limitations by ensuring that correct similarity measurements clustering technique called Big K-Clustering and a hybrid
are assigned when calculating similarity criteria between the clustering approach, for document clustering on large-scale
cluster center and sample points. The distance regularity data. They adapted these techniques for use in the Map-
factor introduced in the study takes into account the cluster Reduce model.
density, which provides information on the global
distribution of points within a cluster. It was applied to the Liu, He, He, Zhang, and Guizani [22] proposes a new
conventional FCM algorithm to correct for distance segmentation algorithm for agricultural image processing,
measurement, and the results showed good performance designed to handle large datasets using the Apache Spark
across various clustering densities. framework and provides experimental results on several real-
world datasets.
Xie, Yin, Ruan, Ding, Tian,
Majors, Manzanares, and Qin [16] demonstrates that Table 1 shows pros and cons of various Map-Reduce
neglecting data locality in heterogeneous environments can (Hadoop) Based Data Clustering articles.

TABLE I. PROS AND CONS OF MAP-REDUCE BASED DATA CLUSTERING


Article Pros Cons

k-means in parallel using - Proposes a Map-Reduce-based parallel K-means technique - Assumes that the Map-Reduce framework
MapReduce [12] that can handle massive amounts of data and be easily scaled is available.
up. - The proposed algorithm is specifically
- Introduces several optimizations to the algorithm to improve designed for K-means.
its performance. -The proposed algorithm is based on a single
- Compares the performance of the proposed algorithm to dataset.
existing cutting-edge parallel K-means algorithms and
provides a thorough evaluation of the proposed algorithm on a
sizable dataset.

Fast clustering using Map-Reduce - The algorithm is designed to minimize data movement and - Evaluation of the proposed algorithm is
[13] network communication, which can significantly improve the based on a limited set of datasets, and further
overall performance of the clustering process. testing and evaluation on other datasets and
- Provide a comprehensive evaluation of the proposed platforms are needed to confirm the
algorithm on various datasets and show that it outperforms generalizability of the proposed algorithm.
several other state-of-the-art clustering algorithms.

A comparison of K-means versus - Compares the performance of two clustering algorithms, K- - Evaluation of the proposed algorithms is
fuzzy C-means in the cloud using means, and fuzzy C-means, for clustering Wikipedia's latest limited to a specific dataset (Wikipedia
Mahout to cluster the most recent articles, using the Mahout machine learning library in a cloud articles), and further testing on other datasets
articles from Wikipedia [14] environment. and platforms may be needed to confirm the
- Highlights the advantages and limitations of each algorithm generalizability of the results.
and provides insights into their strengths and weaknesses in
different clustering scenarios.

Fuzzy C-means clustering technique - Introduces an improved version of the Fuzzy C-Means - Experimental evaluations are conducted on
enhancement based on cluster clustering algorithm. a limited number of datasets, which may not
density [15] - Proposed algorithm considers both the distance between data be representative of all clustering scenarios.
points and the density of data points in the cluster for assigning
membership.

Increasing Map-Reduce performance - Paper proposes a data placement strategy to improve Map- - Proposed strategy assumes a static
by putting data in diverse Hadoop Reduce performance in heterogeneous Hadoop clusters. workload, which may not be applicable to
clusters [16] - Proposed strategy considers the heterogeneity of nodes, data dynamic workloads.
size, and network bandwidth to optimize data placement. - Paper lacks a detailed comparison with
- Experiments demonstrate that the proposed strategy can other data placement strategies in the
significantly improve Map-Reduce performance in literature.
heterogeneous Hadoop clusters compared to baseline
approaches.

Using Map-Reduce, clustering very - Proposes a Map-Reduce based clustering algorithm for very - Proposed algorithm has high computational
huge multidimensional datasets [17] large multi-dimensional datasets. overhead due to the use of Map-Reduce,
- The proposed algorithm is scalable and can handle datasets which may not be suitable for real-time
with billions of points and dimensions. clustering applications.
- Paper does not discuss the performance
impact of the choice of the Map-Reduce
platform and configuration on the proposed
algorithm.

Twister: an iterative map runtime- - Proposes twister, a runtime for iterative Map-Reduce -Paper lacks a comparison with other
Reduce [18] applications that can reduce overhead and improve performance iterative Map-Reduce runtimes in the
compared to traditional Map-Reduce. literature, which may limit the
- Supports multiple programming languages, and can run on generalizability of the findings.
various platforms, including Hadoop and Mesos. - Proposed runtime may not be suitable for
- Experiments demonstrate that Twister outperforms traditional non-iterative Map-Reduce applications,
Map-Reduce for iterative applications, including PageRank, k- which are common in big data processing.
means, and support vector machines.

An technique for probabilistic fuzzy -Proposes a new clustering algorithm that addresses some - Experimental evaluation may not be
c-means clustering [19] limitations of existing algorithms, such as the sensitivity to comprehensive enough to fully assess the
noise and the difficulty in handling overlapping clusters. strengths and weaknesses of the algorithm,
- Introduces a new parameter called the "possibilistic exponent" and it is unclear how sensitive the results are
to control the degree of overlap between clusters. to the choice of parameters.
- Paper does not discuss the computational
complexity of the algorithm or provide any
implementation details, which may limit its
practical applicability.

Map-Reduce-based big data - Experiments demonstrate that the proposed algorithm


clustering with varying densities [20] outperforms traditional clustering algorithms, including
DBSCAN and K-Means, on various large datasets.
- Paper proposes a clustering algorithm for big data based on - Proposed algorithm assumes that the data
the (VDBSCAN) algorithm and Map-Reduce framework. can be partitioned into equal-sized chunks,
which may not always be possible or
efficient in practice.
- Paper does not discuss the performance
impact of the choice of the Map-Reduce
platform and configuration on the proposed
algorithm.

Big text data clustering techniques


using Hadoop and Spark [21]

- Experiments demonstrate that the proposed algorithms can - Proposed algorithms are limited to text
handle large text datasets and achieve high clustering accuracy data.
compared to traditional clustering algorithms. - Paper does not discuss the trade-off
- Paper proposes two efficient text data clustering algorithms between performance and ease of use of the
based on the K-Means and Affinity Propagation algorithms, proposed algorithms compared to other text
implemented using Hadoop and Spark. data clustering algorithms.

Agricultural Image Segmentation


Using Fuzzy c-Means [22]

- Paper proposes a new segmentation algorithm for agricultural - Paper does not discuss the limitations or
images processing. potential drawbacks of the proposed
- Provides test results on a number of real-world datasets. algorithm, such as the trade-offs between
accuracy and computational complexity or
the sensitivity to noise and outliers.

III. DISCUSSION International Journal of Current Engineering and Technology, vol. 5,


pp. 688-691, 2015.
Overlapping datasets present a challenge for data [2] M. Mahajan, P. Nimbhorkar, and K. Varadarajan, "The planar k-
clustering due to their mixed characteristics and the means problem is NP-hard," in WALCOM: Algorithms and
limitations of hard clustering. Fuzzy C-means, are more Computation, ed. Springer, 2009, pp. 274-285.
suitable for clustering overlapping big data datasets. The [3] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing
Map-Reduce (Hadoop) architecture is useful on distributed on Large Clusters," in Communications of the ACM, vol. 51, no. 1,
pp. 107-113, Jan. 2008.
computing with automatic load balancing for handling large
datasets. Researchers are faced with the future challenge of [4] J. Dean and S. Ghemawat, "MapReduce: A Flexible Data Processing
Tool," Communications of the ACM, vol. 53, no. 1, pp. 72-77, Jan.
solving data clustering problems using Fuzzy clustering with 2010.
Map-Reduce for voluminous overlapping datasets or big [5] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy,
data. and R. Sears, "MapReduce Online," in Proceedings of the 7th
USENIX Symposium on Networked Systems Design and
When using the Fuzzy C Means algorithm for soft Implementation (NSDI), 2010, pp. 20-34.
clustering on large amounts of data through Map-Reduce, it's [6] R. Suganya and R. Shanthi, "Fuzzy C-Means Algorithm - A Review,"
important to carefully consider several factors. These include International Journal of Scientific and Research Publications, vol. 2,
choosing the right method for initializing cluster center no. 12, pp. 1-3, Dec. 2012.
points, selecting efficient criteria for measuring similarity to [7] J. A. Hartigan and M. A. Wong, "Algorithm AS 136: A k-means
speed up the convergence of the algorithm, determining clustering algorithm," Applied Statistics, vol. 28, no. 1, pp. 100-108,
1979.
appropriate stopping criteria, and correctly placing data in a
heterogeneous network as part of the pre-processing stage. [8] http://www.cs.princeton.edu/courses/archive/fall08/cos436/Duda/C/
fk_means.htm dated 03 April 2023.
By varying the Fuzzy C Means algorithm in these ways, its
[9] R. L. Cannon, J. V. Dave, and J. C. Bezdek, "Efficient
performance in soft clustering can be improved, and it's implementation of the fuzzy c-means clustering algorithms," IEEE
necessary to parallelize these variations in order to evaluate Transactions on Pattern Analysis and Machine Intelligence, vol. 8, no.
their performance effectively. 2, pp. 248-255, Mar. 1986, doi: 10.1109/TPAMI.1986.4767778.
[10] A. K. Jain, "Data clustering: 50 years beyond K-means," Pattern
In future scope, Fuzzy C Means should be experimented Recognition Letters, vol. 31, no. 8, pp. 651-666, Jun. 2010.
on spark for voluminous overlapping datasets or big data for [11] P. Anchalia, A. Koundinya, and N. Srinath, "MapReduce Design of
clustering problems. In a variety of data processing K-Means Clustering Algorithm," in Information Science and
applications, deep learning approaches like neural networks Applications (ICISA), 2013 International Conference on, 2013, pp. 1-
and convolutional neural networks have shown considerable 5, DOI: 10.1109/ICISA.2013.6579448.
potential, and there is a growing interest in exploring their [12] W. Zhao, H. Ma and Q. He, "Parallel k-means clustering based on
potential for clustering Big Data. Map-Reduce," in Cloud Computing, vol. 5931, pp. 674-679, Springer,
2009, doi: 10.1007/978-3-642-10665-1_71.
[13] A. Ene, S. Im, and B. Moseley, "Fast clustering using MapReduce," in
ACKNOWLEDGMENT Proceedings of the 17th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 681-689, ACM, 2011.
It seems like a statement of acknowledgment and doi: 10.1145/2020408.2020515.
gratitude to other authors whose research work in the field of [14] R. M. Esteves and C. Rong, "Using Mahout for clustering Wikipedia's
data clustering and Map-Reduce has been published in latest articles: A comparison between k-means and fuzzy C-means in
various conference proceedings and journals. the cloud," in Cloud Computing Technology and Science
(CloudCom), 2011 IEEE Third International Conference on, 2011, pp.
565-569, doi: 10.1109/CloudCom.2011.86.
REFERENCES [15] X. Lou, J. Li, and H. Liu, "Improved fuzzy C-means clustering
[1] R. A. Arsekar, A. V. Chikhale, V. T. Kamble, and V. N. Malavade, algorithm based on cluster density," Journal of Computational
"Comparative Study of MapReduce and Pig in Big Data," in Information Systems, vol. 8, pp. 727-737, 2012.
[16] J. Xie, S. Yin, X. Ruan, Z. Ding, et al., "Improving MapReduce MapReduce," J. Big Data, vol. 6, p. 77, 2019, doi: 10.1186/s40537-
performance through data placement in heterogeneous Hadoop 019-0236-x.
clusters," in Parallel & Distributed Processing, Workshops and Phd [21] S. Gerakidis, S. Megarchioti, and B. Mamalis, "Efficient big text data
Forum (IPDPSW), 2010 IEEE International Symposium on, 2010, pp. clustering algorithms using Hadoop and Spark," arXiv preprint
1-9. DOI: 10.1109/IPDPSW.2010.5470880. arXiv:2112.00200, 2021.
[17] R. L. F. Cordeiro, C. Traina, A. J. M. Traina, J. López, U. Kang, C. [22] B. Liu, S. He, D. He, Y. Zhang, and M. Guizani, "A Spark-Based
Faloutsos, "Clustering very large multi-dimensional datasets with Parallel Fuzzy c-Means Segmentation Algorithm for Agricultural
MapReduce," in Proc. 17th ACM SIGKDD Int. Conf. Knowl. Discov. Image Big Data," IEEE Access, vol. 7, pp. 42169-42180, 2019.
Data Min., 2011, pp. 690-698, doi: 10.1145/2020408.2020516. doi:10.1109/ACCESS.2019.2900635
[18] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, and G. Fox,
"Twister: A runtime for iterative MapReduce," in Proc. 19th ACM
Int. Symp. High Perform. Distrib. Comput., 2010, pp. 810-818, doi:
10.1145/1851476.1851592.
[19] N. R. Pal, K. Pal, J. M. Keller, and J. C. Bezdek, "A possibilistic
fuzzy c-means clustering algorithm," IEEE Trans. Fuzzy Syst., vol.
13, pp. 517-530, 2005, doi: 10.1109/TFUZZ.2004.840099. .
[20] S. Heidari, M. Alborzi, R. Radfar, M.A. Afsharkazemi & A. R.
Ghatari, "Big data clustering with varied density based on

You might also like