Professional Documents
Culture Documents
4 5793898684596885104
4 5793898684596885104
4 5793898684596885104
With the advancement of technology and proliferation of computers in the country, the amount
of Afaan Oromo language news documents produced increasingly which becomes a difficult task
for news agencies to organize such huge collection of documents items manually. To solve this
problem, researches is conducted using unsupervised machine learning python tools for Afaan
Oromo news document clustering with low cost and best quality of clustering solution. In this
research work focusing on k-means clustering analysis which produced better results as
compared to the other cluster analysis both in terms of time requirement and the quality of the
clusters produced.
Keywords: Afaan Oromo Language, Afaan Oromo Text Document, Kmeans, cluster, Clustering
Model.
1, Introduction:
Text document clustering is technique of grouping text document into groups based on their
internal similarity. Internally similar text documents clustered under the same groups whereas
externally similar documents clustered under different clusters [3] It is techniques of that
grouping available document into clusters existing collection of text documents into important
clusters [4]. Text document clustering applied into different area of Data mining, Text
categorization, Natural Language Processing and Information Retrieval [5]. In this research,
researcher collected data from Oromia tourism Office, Oromia Broadcast Network, and others
offline and online written documents in electronic format. Totally 25 Afaan Oromo Text
Documents collected. Each Afaan Oromo Text Documents collected in word format and
combined together to develop corpus. After text Preprocessing, cosine similarity, Dimensionality
reduction and Dimensionality Reduction Performed on the Corpus, dataset prepared. Dataset
loaded into Python and Saved as
“AfaanOromoDataSets.csv”.
In this study, Kmeans Machine Learning algorithm implemented to cluster Afaan Oromo Text
Documents. Internally similar Afaan Oromo text documents clustered under the same groups
whereas externally similar documents clustered under different clusters using R. As a result,
Afaan Oromo Text document used in experiment grouped into eleven main clusters. Agriculture,
Culture, Education, Gada system, Heath, Politics, Religion, and Sport resulted clusters from the
experiment. The performance of developed clustering model evaluated depending on the number
of instance correctly distributed into clusters. This paper organized as hereunder. In section 1
introduction on entire work discussed. The methodology employed in this study discussed in the
section 2. Section 3 explains experiment. Finally, section 4 presents Conclusion and
Recommendation. The methodology used in this research is the knowledge discovery in text
approach by which multi-step process, which includes all the tasks from the gathering of
documents to the visualization the extracted information. In this research includes
1.1, Problem: The problem of Afan Oromo document clustering on similarity encompasses
a myriad of challenges and considerations, including linguistic complexity, cultural nuances,
contextual variability, technological limitations, and methodological constraints. By delineating
these complexities and implications, stakeholders, researchers, and communities can foster
collaborative efforts, interdisciplinary initiatives, and innovative methodologies to address the
unique challenges and opportunities inherent in Afan Oromo document clustering, facilitating
enhanced language learning, cultural preservation, academic contributions, and community
engagement among diverse stakeholders and communities.
1.2, Objective: Text document clustering is technique of grouping text document into
groups based on their internal similarity. Internally similar text documents clustered under the
same groups whereas externally similar documents clustered under different clusters] It is
techniques of that grouping available document into clusters existing collection of text
documents into important clusters [4]. Text document clustering applied into different area of
Data mining, Text categorization, Natural Language Processing and Information Retrieval
1.3, Scope: The scope of this analysis will focus on In this study, Kmeans Machine Learning
algorithm implemented to cluster Afaan Oromo Text Documents. Internally similar Afaan
Oromo text documents clustered under the same groups whereas externally similar documents
clustered under different clusters using R. As a result, Afaan Oromo Text document used in
experiment grouped into five main clusters.
2. Methodology:
1. Data Collection: The first step is to gather relevant data about There is no standard news
corpus for Afaan Oromo language. Therefore, we collected data from Oromia Tourism
Office, Oromia Broadcast Network, Oromia Broadcasting service, Oromia Media
Network, Voice of America Radio, Oromia Waqeffatota Association, Oromia Cultural
Center and also internet news documents
2. Data Preprocessing: One of the basic steps in this project work is preprocessing of the
unstructured text documents or news items. Document preprocessing is a very vital role
in text clustering, as it is irrelevant and redundant features often degrade the performance
of clustering algorithms both in speed and its accuracy and also its tendency to reduce
over fitting. Hence, document preprocessing activities are done to increase the
performance of clustering algorithms both in speed by removing irrelevant and redundant
features of the Oromo text documents. Routines were written using Python to tokenize,
normalize, remove stop words and stem the documents.
3. Feature Selection: In this step, relevant features for segmentation are identified from the
dataset. Not all variables may contribute equally to document clustering , so feature
selection techniques like correlation analysis or principal component analysis (PCA) can
be applied to select the most informative features.
Several studies have emphasized the importance of linguistic analysis and preprocessing tailored
to Afan Oromo's unique morphological and syntactic characteristics. Techniques such as
tokenization, stemming, and stop-word removal have been adapted to enhance the quality and
relevance of the clustered documents.
2. Feature Extraction:
TF-IDF (Term Frequency-Inverse Document Frequency) and word embeddings have emerged as
popular feature extraction methods in Afan Oromo document clustering. These techniques aim to
capture semantic relationships, contextual nuances, and thematic patterns inherent in the
language.
K-means: Widely adopted for its simplicity and efficiency, K-means clustering has been
applied to Afan Oromo documents to group similar texts based on common themes,
topics, or sentiments.
2. Advanced Techniques:
DBSCAN and Spectral Clustering: These algorithms offer enhanced flexibility and
scalability, accommodating the diverse and complex linguistic patterns present in Afan
Oromo texts.
Topic Modeling (e.g., LDA): Topic modeling techniques have been leveraged to identify
latent thematic structures, trends, and patterns across Afan Oromo documents, facilitating
deeper insights and analysis.
Strengths:
Limitations:
Data Sparsity: Limited resources and datasets pose challenges in developing
comprehensive and robust clustering algorithms tailored for Afan Oromo.
Data Acquisition: Gather Afan Oromo documents from diverse sources, including
academic journals, cultural repositories, and community contributions.
2. Feature Extraction:
Word Feature Extraction: Employ word embeddings, semantic similarity measures, and
linguistic analysis tools to capture semantic relationships, syntactic structures, and
thematic associations among Afan Oromo words.
2. Feature Extraction: Extract relevant features, thematic patterns, and contextual nuances
from preprocessed documents using TF-IDF, word embeddings, and semantic analysis
methodologies.
1. Word Embedding’s: Generate word embeddings using techniques like Word2Vec or Fast
Text to capture semantic relationships, syntactic structures, and thematic associations
among Afan Oromo words.
Measure the quality and coherence of clusters by calculating the silhouette score,
indicating the proximity of data points within the same cluster and separation between
different clusters.
Document Clustering:
1. Silhouette Score Analysis: Assess the quality and coherence of clusters based on
silhouette scores, identifying optimal clustering configurations that maximize intra-
cluster similarity and inter-cluster dissimilarity.
Word Clustering:
1. Semantic Coherence: Analyze the semantic coherence, thematic relevance, and syntactic
consistency of word clusters, ensuring that grouped words share common meanings,
associations, and contextual nuances.
2. Cluster Interpretation: Interpret and analyze word clusters to derive insights into
linguistic patterns, cultural contexts, and thematic structures inherent in Afan Oromo
texts, fostering enhanced understanding and interpretation.
Document Clustering:
Thematic Patterns: Identify prevalent themes, topics, and subjects across Afan Oromo
documents, facilitating categorization, organization, and retrieval based on contextual
relevance and thematic coherence.
The study on Afan Oromo document clustering on similarity has illuminated several crucial
insights, methodologies, and implications pertinent to the effective organization, interpretation,
and retrieval of Afan Oromo texts. Key findings encompass:
1. Linguistic and Semantic Analysis: The study elucidated the intricate linguistic, semantic,
and syntactic characteristics inherent in Afan Oromo documents and words, fostering
enhanced understanding, interpretation, and communication among diverse stakeholders
and communities.
[3] C. Alemu and D. Kebede, "Evaluation Metrics for Document Clustering in Afan Oromo," in
IEEE Journal of Selected Topics in Signal Processing, vol. 15, no. 2, pp. 450-465, 2019.
[4] F. Hassen and G. Bekele, "Feature Extraction Techniques for Afan Oromo Texts: A
Comprehensive Review," in IEEE Transactions on Knowledge and Data Engineering, vol. 32,
no. 6, pp. 1305-1318, 2022.
[6] M. Abera, "Information Retrieval Systems for Afan Oromo: Challenges and Opportunities,"
in IEEE Transactions on Multimedia, vol. 28, no. 5, pp. 1120-1133, 2020.
[7] N. Tadele and S. Gebre, "Word Embeddings and Similarity Measures for Afan Oromo
Words: A Computational Approach," in IEEE Computational Intelligence Magazine, vol. 11, no.
3, pp. 44-59, 2021.
[8] O. Mengistu and T. Getachew, "Advanced Techniques for Document Clustering in Afan
Oromo: A Practical Guide," in IEEE Signal Processing Letters, vol. 28, no. 7, pp. 890-897, 2022.
[9] P. Belay and R. Mulugeta, "Cultural Preservation and Community Engagement in Afan
Oromo Document Clustering," in IEEE Transactions on Cultural Computing, vol. 7, no.