Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

3/10/24, 12:58 PM Canopy w ith k-means clustering algorithm for big data analytics | AIP Conference Proceedings | AIP

lytics | AIP Conference Proceedings | AIP Publishing

RESEARCH ARTICLE | MARCH 02 2021

Canopy with k-means clustering algorithm for big


data analytics
Noor S. Sagheer  ; Suhad A. Yousif

 Author & Article Information


AIP Conf. Proc. 2334, 070006 (2021)
https://doi.org/10.1063/5.0042398

Recently, Big Data is gathered from various sources in different types, and it is not
easy to analyze them by traditional methods. Apache Hadoop is a robust solution to
the problems of saving and processing large datasets by providing HDFS (Hadoop
Distributed File System) and MapReduce for storing and processing data. One of
the essential methods for analyzing big data to discover new patterns is the
clustering algorithms. In this paper, we have used the canopy clustering algorithm
provided by Distributed Machine Learning with Apache Mahout as preprocessing
step for the k-means clustering algorithm. The results showed that using Canopy as a
preprocessing step has sped up the time of managing the massive scale of the
healthcare insurance dataset, and it also reduces the execution time of the k-means
by providing initial centroids for the given dataset.

Topics
Machine learning, Data visualization, Health care

https://pubs.aip.org/aip/acp/article-abstract/2334/1/070006/586472/Canopy-w ith-k-means-clustering-algorithm-for-big?redirectedFrom=PDF 1/5


3/10/24, 12:58 PM Canopy w ith k-means clustering algorithm for big data analytics | AIP Conference Proceedings | AIP Publishing

REFERENCES
1. Memon, M.A., et al, Big data analytics and its applications. arXiv preprint arX
iv:1710.04135, (2017).
Google Scholar

2. R. Suganyal, M.P., P. Nandhini, Algorithms and Challenges in Big Data


Clustering. International Journal of Engineering and Techniques, 4(4), (2018).
Google Scholar

3. Yousif, S.A., Z.N. Sultani, and V.W. Samawi, Utilizing Arabic WordNet
Relations in Arabic Text Classification: New Feature Selection Methods. IAENG
International Journal of Computer Science, 46(4),(2019).
Google Scholar

4. Eswari, T., P. Sampath, and S.J.P.C.S. Lavanya, Predictive methodology for


diabetic data analysis in big data. 50: p. 203–208, 2015.

5. Uma, K. and M. Hanumanthappa, Data Collection Methods and Data


Preprocessing Techniques for Health- care Data Using Data Mining.
International Journal of Scientific & Engineering Research, 8(6): p. 1131–1136,
(2017).
Google Scholar

6. McCallum, A., K. Nigam, and L.H. Ungar, Efficient clustering of high-


dimensional data sets with application to reference matching. in Proceedings of
the sixth ACM SIGKDD international conference on Knowledge discovery and
data mining. (2000).
Google Scholar Crossref

7. Amresh Kumar, K.M., Saikat Mukherjee, Ravi Prakash G., Verification and
Validation of MapReduce Program model for Parallel k-means algorithm on
Hadoop Cluster. International Journal of Computer Applications (0975 – 8887),

https://pubs.aip.org/aip/acp/article-abstract/2334/1/070006/586472/Canopy-w ith-k-means-clustering-algorithm-for-big?redirectedFrom=PDF 2/5


3/10/24, 12:58 PM Canopy w ith k-means clustering algorithm for big data analytics | AIP Conference Proceedings | AIP Publishing

72(8), (2013).
Google Scholar

8. Kumar, A., et al, Canopy clustering: a review on pre-clustering approach to k-


means clustering. Int. J. Innov. Adv. Comput. Sci.(IJIACS), 3(5): p. 22–29,
(2014).
Google Scholar

9. Van Hieu, D. and P. Meesad, Fast k-means clustering for very large datasets
based on mapreduce combined with a new cutting method, in Knowledge and
Systems Engineering., Springer. p. 287–298, (2015).
Google Scholar

10. H.H., Maala, and S.A. Yousif, Cluster Trace Analysis for Performance
Enhancement in Cloud Computing Environments. Journal of Theoretical and
Applied Information Technology, 97(7), (2019).
Google Scholar

11. Yousif, S.A. and A. Al-Dulaimy, Clustering cloud workload traces to improve
the performance of cloud data centers. in Proceedings of the World Congress on
Engineering (WCE'17), (2017).
Google Scholar

12. Yousif, S.A., H.Y. Abdul-Wahed, and N.M. Al-Saidi, Extracting a new fractal
and semi-variance attributes for texture images. in AIP Conference Proceedings.
AIP Publishing LLC, (2019).
Google Scholar

13. Daoping, X., Z. Alin, and L. Yubo. A parallel Clustering algorithm


implementation based on Apache Mahout. in 2016 Sixth International Conference
on Instrumentation & Measurement, Computer, Communication and Control
(IMCCC). IEEE, (2016).
Google Scholar

https://pubs.aip.org/aip/acp/article-abstract/2334/1/070006/586472/Canopy-w ith-k-means-clustering-algorithm-for-big?redirectedFrom=PDF 3/5


3/10/24, 12:58 PM Canopy w ith k-means clustering algorithm for big data analytics | AIP Conference Proceedings | AIP Publishing

14. S.A. Yousif, A.J Mohammed, N.M.G. Al-Saidi, “Texture images analysis
using fractal extracted attributes”, International Journal of Innovative Computing,
Information and Control, 16(4), (Aug. 2020).
Google Scholar

This content is only available via PDF.

© 2021 Author(s).

You do not currently have access to this content.

Sign in

Don't already have an account? Register

Sign In
Username Sign in via your Institution
Sign in via your Institution

Password

Reset password
Register

https://pubs.aip.org/aip/acp/article-abstract/2334/1/070006/586472/Canopy-w ith-k-means-clustering-algorithm-for-big?redirectedFrom=PDF 4/5


3/10/24, 12:58 PM Canopy w ith k-means clustering algorithm for big data analytics | AIP Conference Proceedings | AIP Publishing

Pay-Per-View Access $40.00

 BUY THIS ARTICLE

https://pubs.aip.org/aip/acp/article-abstract/2334/1/070006/586472/Canopy-w ith-k-means-clustering-algorithm-for-big?redirectedFrom=PDF 5/5

You might also like