Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 29

Bahir Dar University

Bahir Dar Institute Of Technology


Faculty of Computing
Department of Information Technology
Natural Language Processing

Presentation on: Lemmatization

Presented by: Agerie Belete

Atinkut Muche
Bizuayehu Tadege
Tiruedle Asteraye
Outline
 Introduction
 Statement of the problem
 Objective
 General Objective
 Specific Objective
 Significance of the Project
 Methodology
 Scope of the Project
 Literature Review
 Architecture
 Experimental Result
 Conclusion
Introduction

 Text normalization is a process of transforming a word into a


single canonical form. This can be done by two processes,
stemming and lemmatization.
 Stemming and Lemmatization is simply normalization of words,
which means reducing a word to its root form.
 Stemming is the process of find out the root word by removing
affixes. E.g: (ሰዎች ፥ ሰ, ቤቶች፥ቤ , ለታሪክ፥ታሪክ)
 Lemmatization is an organized procedure of obtaining the lemma
of the word by considering its dictionary meaning. E.g: ( ሰዎች ፥
ሰዉ , ቤቶች፥ቤት , ለታሪካችን፥ታሪክ)
Statement of the Problem

 Amharic is a morphologically rich language that makes


development of efficient stemmer very difficult.

 The aim of stemming and Lemmatization processes is reducing


the inflectional forms of each word into a common base or root.

 Stemming algorithms work by cutting off the end or the


beginning of the word, taking into account a list of common
prefixes and suffixes that can be found in an inflected word.
Cont.
 This indiscriminate cutting can be successful in some occasions,
but not always, and that is why we propose Amharic
Lemmatization to overcome the stemming limitations.

 Lemmatization, takes into consideration the morphological


analysis of the words. To do so, it is necessary to have detailed
dictionaries which the algorithm can look through to link the
form back to its lemma.
Objective

General Objective
 The main objective of the project is to design Amharic language
Lemmatizer.

Specific Objective
 To have a clear understanding of the area through conducting
relevant literature review and identify different lemmatization
algorithms that have been developed for other languages.
Specific Objectives ...

 To construct lemma dictionary, stop word list and compile a list of


affixes used for the corpus.
 To design appropriate lemmatizing algorithm for Amharic language.
 To develop a Lemmatizer that identifies inflectional affixes of
Amharic language.
 To test the performance of the Lemmatizer on the selected Amharic
text and report result of the study.
Significance of the Project

 It can reduce word variants and to minimize total number of


files.
 It enables native speakers to drive a large number of words in a
single lemma.
 Allowing other researchers to investigate the lemmatization of
Amharic language.
Methodology

Literature review
 Document analysis is used to understand the characteristics of the
language. As studying the language’s morphology constitute which
is an important component in the project, a literature survey was
made to gather information and to understand the language.
Data Source
 A text corpus is one of the resources required in stemming and
lemmatization process in NLP works. The text was used for
compiling stop words, prefixes and suffixes. Moreover, it has been
used for testing the algorithm.
 For the purpose of this project texts are gathered from Amharic
fiction books, Amhara Mass Media Agency (AMMA) and other
sources were used.
Methodology …

Tools and Techniques


 Python (3.8.5) programming language
 Anaconda navigator (4.9.2)
 Jupiter notebook

 Notepad

10
Scope of the Project

 The main aim of this project is developing a Lemmatizer


for Amharic language.
 This project mainly focused on only inflectional word
variants of Amharic language.
 The Lemmatizer contains prefix, suffix and Circumfix
which incorporates both prefix and suffix techniques.

Limitation of this Project


 Infix removal were not addressed in our project.
Literature Review

 Different works had been studied in Lemmatization or stemming


in different language morphologies, Some of the related works
are:

 lemmatization has been actively applied in the recent biomedical


research, In this work, Authors developed a domain-specific
lemmatization tool, BioLemmatizer, for the morphological
analysis of biomedical text.
 The tool focuses on the inflectional morphology of English and is
based on the general English lemmatization tool MorphAdorner.
Literature Review …
 An innovative aspect of this BioLemmatizer is the use of a
hierarchical strategy for searching the lexicon, which enables the
discovery of the correct lemma
 They achieves an accuracy of 97.5%., even if the input Part-of
Speech information is inaccurate,
 However, the lemmatizer should return all possible lemma with out
knowing the word context to resolve ambiguity.
 An Analysis of Lemmatization on Topic Models of Morphologically
Rich Language in this paper Authors establish the first
measurements of the effect of token-based lemmatization on topic
models on a corpus of morphologically rich language.
 They consider two preprocessing schemes to account for stop words
and other high frequency terms in the corpus
Literature Review …
 They remove any of the top 100 words from the lemmatized and no
lemmatized corpora, and producing a non-lemmatized vocabulary of
72,641 words. While the large size of this vocabulary slows learning,
but they do not believe it impacts negatively;
 Lemmatization may benefit topic models on morphologically rich
languages, but that further investigation is needed on large size
vocabulary.
 Stemmer development was restricted in English. Nowadays, stemmers are
adapted and constructed in other languages: Spanish, Arabic, Amharic and
so on. For stemming Amharic texts, there are few works carried out. The
work in (Argaw & Asker, 2007), they use the method of rule based
Literature Review …

 This stemmer relies on corpus statistics to resolve ambiguities of


citation forms.
 Their approach contains 65 rules to reduce an Amharic word to
citation form for cross-lingual information retrieval applications.
 The accuracy of this stemmer is 60% and 75%, for old fashion
fiction text and news text, respectively.
 However, in this paper the number of rule to reduce an Amharic
word is complex and the dataset they took were only fiction text and
news text.
Literature Review …
 Amharic light stemmer for Amharic Sentiment classification was
developed as Amharic is one of Semitic family, nouns and verbs are
derived from limited roots.
 They design a light stemmer of Amharic language where it
keeps semantic information by removing frequent prefixes and
postfixes of the input words
 The final result is the stem, if there is any prefix, infix or/and
suffix, otherwise it remains in one of the earlier states.
Literature Review …

 Their technique does not rely on any additional resource


(e.g. dictionary) to verify the generated stem.
 Their result is compared with state-of-the-art stemmer
for Amharic showing an increase of 7% in stemmer
correctness.
 However, the performance of the generated stemmer is
evaluated using manually interpreted Amharic words.
Literature Review …

 Sentence retrieval using Stemming and Lemmatization with


Different Length of the Queries
 In this paper Authors use 𝑇𝐹 − 𝐼𝑆𝐹 (term frequency - inverse sentence
frequency) method for sentence retrieval.
 As pre-processing steps, they use stop word removal and the
language modeling techniques: stemming and lemmatization.
 Their results show that data pre-processing with stemming and
lemmatization is useful with sentences retrieval as it is with document
retrieval.
 Lemmatization produces better results with longer queries, while
stemming shows worse results with longer queries.
 They used data of the Text Retrieval Conference (TREC) novelty
tracks.
 However, the positive effects only appeared with the measures
MAP which improve recall rather than precision.
Architecture
Cont.…
Experimental Result

 For our implementation we use the rule based approach


for efficient Amharic Lemmatizer, because rule based
approach is appropriate for lemmatization to get the best
result.
 For our project we use 350 lemma words (dictionary) for
the implementation of the Amharic Lemmatizer. We have
also Amharic prefix lists, suffix lists, prefix-suffix lists,
special suffix lists and punctions mark lists.
 Rule1. Removing of prefix's only
Cont.…..
 Rule2. Removing of suffix inflection only

 Rule3. Removing of both prefix-suffix inflection.


Cont..
Concluding Remark
 Lemmatization play a crucial role in text and natural language processing which
creates the base inflected words.
 The lemma of the word have a meaningful dictionary word that can be used.
 To make a lemmatizer efficient in identifying lemma of a words researchers first
review different papers to identify gabs presented and use their rules to improve
their rule based lemmatizer.

 The lemmatization process gives the lemma of the word which has contextually
meaningful sense.

 In rule based lemmatizer, by adding more rules for a given sentence/word the
lemmatizer correctly identify the lemma of a words.

 The weakness of Amharic lemmatizer are the existence of derivational words,


enhanced number of rules for a single word, and complexity of infix lemma of a
word is very challenging.
Contribution of Group members
• All group members are participated in all document and implementation of the
project work, but their particular works are as follows.
No Name Contribution
1 Agerie Belete Introduction
Objective
Methodology
Dataset preparation

2 Atinkut Much Literature review


Significance
Scope
Design architecture

3 Bizuayehu Tadege Dataset preparation


Design architecture
Implementation
Experimental result

4 Tiruedle Asteraye Dataset preparation


Problem statement
Significance
Conclusion
Reference
 Alemneh, G. N. (september 2020). Amharic Light Stemmer.
ICLR (pp. 1-10). Addis Ababa Science and Technology
University: Amharic Sentiment Classification relying on
cross lingual resource Adaptation.
 Chandler May, Ryan Cotterell,Benjamin Van Durme. (10
May 2019). An Analysis of Lemmatization on Topic Models
of Morphologically Rich Language. Johns Hopkins :
arxiv.org.
 Divya Khyani1, Siddhartha B S2, Niveditha N M3,Divya B
M4 . (Volume 22, Issue 10, October - 2020). An
Interpretation of Lemmatization and Stemming in Natural
Language Processing . Journal of University of Shanghai for
Science and Technology , 350-357.
Cont.……
 https://nlp.stanford.edu/IR-
book/html/htmledition/stemming-and-lemmatization-1.html.
(n.d.).
 Ivan Boban*,1, Alen Doko2, Sven Gotovac3. (Vol. 5, No. 3
(2020)). Sentence retrieval using Stemming and
Lemmatization with Different Length of the Queries. Advances
in Science, Technology and Engineering Systems Journal , 349-
354.
 Kargın, K. ( 2021, Feb 26). nlp-tokenization-stemming-
lemmatization-and-part-of-speech-tagging. Retrieved June 09,
2021, from medium.com: https://medium.com/mlearning-
ai/nlp-tokenization-stemming-lemmatization-and-part-of-
speech-tagging-9088ac068768
Cont.…..
 Kettunen, Kimmo, Kunttu, Tuomas and Järvelin, Kalervo.
(August 2005). To stem or lemmatize a highly inflectional
language in a probabilistic IR environment? Journal of
Documentation , 476-496.
 Minnen G, C. J. (2001). Applied morphological processing
of English. Natural Language Engineering , 207 - 223.
 Srinidhi, S. (2020, feb 26). Lemmatization in Natural
Language Processing (NLP) and Machine Learning.
Retrieved June 09, 2021, from towardsdatascience.com:
https://towardsdatascience.com/lemmatization-in-natural-
language-processing-nlp-and-machine-learning-
a4416f69a7b6
Thank you

You might also like