Canopy With K-Means Clustering Algorithm For Big Data Analytics - AIP Conference Proceedings - AIP Publishing

3/10/24, 12:58 PM Canopy w ith k-means clustering algorithm for big data analytics | AIP Conference Proceedings | AIP
lytics | AIP Conference Proceedings | AIP Publishing
RESEARCH ARTICLE | MARCH 02 2021
Canopy with k-means clustering algorithm for big

data analytics
Noor S. Sagheer  ; Suhad A. Yousif
 Author & Article Information

AIP Conf. Proc. 2334, 070006 (2021)
https://doi.org/10.1063/5.0042398
Recently, Big Data is gathered from various sources in different types, and it is not
easy to analyze them by traditional methods. Apache Hadoop is a robust solution to
the problems of saving and processing large datasets by providing HDFS (Hadoop
Distributed File System) and MapReduce for storing and processing data. One of
the essential methods for analyzing big data to discover new patterns is the
clustering algorithms. In this paper, we have used the canopy clustering algorithm
provided by Distributed Machine Learning with Apache Mahout as preprocessing
step for the k-means clustering algorithm. The results showed that using Canopy as a
preprocessing step has sped up the time of managing the massive scale of the
healthcare insurance dataset, and it also reduces the execution time of the k-means
by providing initial centroids for the given dataset.
Topics
Machine learning, Data visualization, Health care
https://pubs.aip.org/aip/acp/article-abstract/2334/1/070006/586472/Canopy-w ith-k-means-clustering-algorithm-for-big?redirectedFrom=PDF 1/5

3/10/24, 12:58 PM Canopy w ith k-means clustering algorithm for big data analytics | AIP Conference Proceedings | AIP Publishing
REFERENCES
1. Memon, M.A., et al, Big data analytics and its applications. arXiv preprint arX
iv:1710.04135, (2017).
Google Scholar
2. R. Suganyal, M.P., P. Nandhini, Algorithms and Challenges in Big Data

Clustering. International Journal of Engineering and Techniques, 4(4), (2018).
Google Scholar
3. Yousif, S.A., Z.N. Sultani, and V.W. Samawi, Utilizing Arabic WordNet
Relations in Arabic Text Classification: New Feature Selection Methods. IAENG
International Journal of Computer Science, 46(4),(2019).
Google Scholar
4. Eswari, T., P. Sampath, and S.J.P.C.S. Lavanya, Predictive methodology for

diabetic data analysis in big data. 50: p. 203–208, 2015.
5. Uma, K. and M. Hanumanthappa, Data Collection Methods and Data

Preprocessing Techniques for Health- care Data Using Data Mining.
International Journal of Scientific & Engineering Research, 8(6): p. 1131–1136,
(2017).
Google Scholar
6. McCallum, A., K. Nigam, and L.H. Ungar, Efficient clustering of high-

dimensional data sets with application to reference matching. in Proceedings of
the sixth ACM SIGKDD international conference on Knowledge discovery and
data mining. (2000).
Google Scholar Crossref
7. Amresh Kumar, K.M., Saikat Mukherjee, Ravi Prakash G., Verification and
Validation of MapReduce Program model for Parallel k-means algorithm on
Hadoop Cluster. International Journal of Computer Applications (0975 – 8887),

72(8), (2013).
Google Scholar
8. Kumar, A., et al, Canopy clustering: a review on pre-clustering approach to k-

means clustering. Int. J. Innov. Adv. Comput. Sci.(IJIACS), 3(5): p. 22–29,
(2014).
Google Scholar
9. Van Hieu, D. and P. Meesad, Fast k-means clustering for very large datasets
based on mapreduce combined with a new cutting method, in Knowledge and
Systems Engineering., Springer. p. 287–298, (2015).
Google Scholar
10. H.H., Maala, and S.A. Yousif, Cluster Trace Analysis for Performance
Enhancement in Cloud Computing Environments. Journal of Theoretical and
Applied Information Technology, 97(7), (2019).
Google Scholar
11. Yousif, S.A. and A. Al-Dulaimy, Clustering cloud workload traces to improve
the performance of cloud data centers. in Proceedings of the World Congress on
Engineering (WCE'17), (2017).
Google Scholar
12. Yousif, S.A., H.Y. Abdul-Wahed, and N.M. Al-Saidi, Extracting a new fractal
and semi-variance attributes for texture images. in AIP Conference Proceedings.
AIP Publishing LLC, (2019).
Google Scholar
13. Daoping, X., Z. Alin, and L. Yubo. A parallel Clustering algorithm

implementation based on Apache Mahout. in 2016 Sixth International Conference
on Instrumentation & Measurement, Computer, Communication and Control
(IMCCC). IEEE, (2016).
Google Scholar

14. S.A. Yousif, A.J Mohammed, N.M.G. Al-Saidi, “Texture images analysis
using fractal extracted attributes”, International Journal of Innovative Computing,
Information and Control, 16(4), (Aug. 2020).
Google Scholar
This content is only available via PDF.
© 2021 Author(s).
You do not currently have access to this content.
Sign in
Don't already have an account? Register
Sign In
Username Sign in via your Institution
Sign in via your Institution
Password
Reset password
Register

Pay-Per-View Access $40.00
 BUY THIS ARTICLE

Canopy With K-Means Clustering Algorithm For Big Data Analytics - AIP Conference Proceedings - AIP Publishing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Canopy With K-Means Clustering Algorithm For Big Data Analytics - AIP Conference Proceedings - AIP Publishing

Uploaded by

Copyright:

Available Formats

3/10/24, 12:58 PM Canopy w ith k-means clustering algorithm for big data analytics | AIP Conference Proceedings | AIP

lytics | AIP Conference Proceedings | AIP Publishing

RESEARCH ARTICLE | MARCH 02 2021

Canopy with k-means clustering algorithm for big

 Author & Article Information

https://pubs.aip.org/aip/acp/article-abstract/2334/1/070006/586472/Canopy-w ith-k-means-clustering-algorithm-for-big?redirectedFrom=PDF 1/5

2. R. Suganyal, M.P., P. Nandhini, Algorithms and Challenges in Big Data

4. Eswari, T., P. Sampath, and S.J.P.C.S. Lavanya, Predictive methodology for

5. Uma, K. and M. Hanumanthappa, Data Collection Methods and Data

6. McCallum, A., K. Nigam, and L.H. Ungar, Efficient clustering of high-

https://pubs.aip.org/aip/acp/article-abstract/2334/1/070006/586472/Canopy-w ith-k-means-clustering-algorithm-for-big?redirectedFrom=PDF 2/5

8. Kumar, A., et al, Canopy clustering: a review on pre-clustering approach to k-

13. Daoping, X., Z. Alin, and L. Yubo. A parallel Clustering algorithm

https://pubs.aip.org/aip/acp/article-abstract/2334/1/070006/586472/Canopy-w ith-k-means-clustering-algorithm-for-big?redirectedFrom=PDF 3/5

This content is only available via PDF.

You do not currently have access to this content.

Don't already have an account? Register

https://pubs.aip.org/aip/acp/article-abstract/2334/1/070006/586472/Canopy-w ith-k-means-clustering-algorithm-for-big?redirectedFrom=PDF 4/5

Pay-Per-View Access $40.00

 BUY THIS ARTICLE

https://pubs.aip.org/aip/acp/article-abstract/2334/1/070006/586472/Canopy-w ith-k-means-clustering-algorithm-for-big?redirectedFrom=PDF 5/5

You might also like