Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/280291756

Part of Speech Tagger for Marathi Language

Article  in  International Journal of Computer Applications · June 2015


DOI: 10.5120/21169-4245

CITATIONS READS

3 4,241

3 authors, including:

Sharvari Govilkar Shubhangi Rathod


Pillai college of Engineering Pillai Institute of Information Technology, Engineering, Media Studies and Research
34 PUBLICATIONS   317 CITATIONS    1 PUBLICATION   3 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Rule based Pos tagger for marathi View project

text normalization in code mix text and sentiment analysis View project

All content following this page was uploaded by Sharvari Govilkar on 03 January 2018.

The user has requested enhancement of the downloaded file.


International Journal of Computer Applications (0975 – 8887)
Volume 119 – No.18, June 2015

Part of Speech Tagger for Marathi Language

Sharvari Govilkar Bakal J. W Shubhangi Rathod


Department of Information SJCOE , Dombivli, Thane Department of Computer
Technology India Engineering
TSEC, Bandra, Mumbai, India PIIT, New Panvel, India

ABSTRACT Dictionary plays an important role to assign appropriate tag


A part of speech (POS) tagging is one of the most well studied to each word. The tagger divided into two stages. First, it
problem in the field of Natural Language Processing (NLP). A search words in corpus and second, it assigns a tag by
POS Tagger is the process of assigning correct tag like noun, resolving ambiguity. The small set of meaningful rules helps
adjective, verb, adverb etc to each word of the input sentence. to remove disambiguity of words. The paper presents a rule
Disambiguation rules and Tagset is vital parts of POS tagger. based part of speech tagger for Marathi language. Related
POS tagging is difficult for Marathi language due to work and past literature is discussed in section 2. Basic
unavailability of corpus for computational processing. In this working of POS tagger is discussed in section 3. Finally,
paper, a POS Tagger for Marathi language using Rule based paper concludes in section 4.
technique is presented. Our proposed system find root word
using morphological analyzer and compare the root word with
2. LITERATURE SURVEY
corpus to assign appropriate tag. If word has assigned more In this section we cite the relevant past literature that use the
than one tags then by using grammar rules ambiguity is various pos tagging techniques. Most of the researchers
removed. Meaningful rules are provided to improve the concentrate on rule base rather than statistical approach for
performance of the system. POS tagging. The small set of the meaningful rules of this
tagger provides the better improvements over statistical
General Terms tagger.
Part of Speech, Marathi Jyoti Singh, et.al. [1] Proposed a Development of Marathi Part
of Speech Tagger Using Statistical Approach. They used
Keywords statistical tagger using Unigram, Bigram, Trigram and HMM
Part of Speech (POS), Tagset, Tokenizer, Stemmer, Methods. To achieve higher accuracy they use set of Hand
Morphological analyzer, Disambiguation coded rules, it include frequency and probability. They use
most frequently used tag for a specific word from the
1. INTRODUCTION annotated training data and use this information to tag that
A part of speech (POS) tagging is one of the important word in the annotated text. They train and test their model by
problem in the field of Natural Language Processing (NLP) calculating frequency and probability of words of given
that has begun in the early 1960s. A POS Tagging is the corpus.
process of assigning exact tag like noun, verb, adverb,
adjective etc to each word in the input sentence, based on its H.B. Patil, et.al. [2] Proposed a Part-of-Speech Tagger for
classification as well as its relationship with adjacent and Marathi Language using Limited Training Corpora. It is also a
related words in a phrase, sentence, or paragraph. It is an rule based technique. Here sentence taken as an input
extremely powerful and accurate tool [1] used in any generated tokens. Once token generated apply the stemming
application that deals with natural language processing. The process to remove all possible affix and reduce the word to
tagging performance totally depends on tag dictionary. It is stem. SRR used to convert stem word to root word. The root-
also called grammatical tagging or word-category words that are identified are then given to morphological
disambiguation. analyzer. The morphological analysis is carried out by
dictionary lookup and morpheme analysis rules.
Taggers can be classified as supervised or unsupervised:
Supervised taggers are based on pre-tagged corpora, whereas Pallavi Bagul, et.al. [3] Proposed a Rule Based POS Tagger
unsupervised taggers automatically assign tags to words [6]. for Marathi Text. Which will assign part of speech to the
Furthermore, taggers divide into three types: (i) Rule Base words in a sentence given as an input and used a corpus which
Taggers: The rule based POS tagging approach that uses a set is based on tourism domain. The ambiguous words are those
of hand constructed rules. (ii) Stochastic Taggers: A words which can act as a noun and adjective in certain
stochastic approach assigns a tag to word using frequency, context, or act as an adjective and adverb in certain context.
probability or statistics [6]. It required vast stored contextual The ambiguity is resolved using Marathi grammar rules.
information because many high frequency words of POS are
ambiguous. (iii) Hybrid Taggers: The hybrid approach, assign Jyoti Singh, et.al. [4] Proposed a Part of speech tagging of
tag to the word using statistical approach after that, if wrong Marathi text using Trigram method. The main concept of
tag is found then by applying some rules tagger tries to Trigram is to explore the most likely POS for a token based
change it [7]. on given information of previous two tags by calculating the
transition probabilities between the tags and helps to capture
POS tagging is a necessary pre-module and building block for the context of the sentence. The probability of a sequence is
various NLP application. POS tagging is difficult for Marathi just the product of conditional probabilities of its trigrams.
language due to unavailability of corpus for computational Each tag transition probability is computed by calculating the
processing. The rule based part of speech tagger that assigns frequency count of two tags which come together in the
all possible tags to word and uses a set of hand written rules.

29
International Journal of Computer Applications (0975 – 8887)
Volume 119 – No.18, June 2015

corpus divided by the frequency count of the previous two from sentence and treat as single token so, we can deal with
tags coming in the corpus. each word separately. Information retrieval applications
requires list of root words for the purpose of easy searching,
maintain index of words, calculating term frequency of words
Nidhi Mishra, et.al. [5] Proposed Part of Speech Tagging for etc.
Hindi Corpus. The system scans the Hindi (Unicode) corpus
and then extracts the Sentences and words from the given Input
Hindi corpus. Finally Display the tag of each Hindi word like (Marathi Document)
noun tag, adjective tag, number tag, verb tag etc. and search
tag pattern from database.
Preprocessing
Namrata Tapaswi, Suresh Jain [6] proposed a Treebank Based
Deep Grammar Acquisition and Part-Of-Speech Tagging for Validation
Sanskrit Sentences. In the Sanskrit morphology meaning of
the word is remain same. When affixes are added to the stem,
words are differentiated at database level directly. The input is Tokenization
one sentence per line, split the sentence into words called
lexeme .read each word to find longest suffix, and eliminated
the suffix until the word length is 2. Apply the lexical rules
and assign the tag. Remove the disambiguity using context
Stemmer
sensitive rules.
Javed Ahmed MAHAR, Ghulam Qadir MEMON [7],
proposed a system for “Rule Based Part of Speech Tagging of Morphological
Sindhi Language”. Take input text, and generate token. Once Analyzer
token generated search and compare selected word from
lexicon (SWL) .If word is found one or more times, then store POS
associated tag and if not found add that word into lexicon by Assign Tag Tagged
generating linguistic rule for new word. Database

3. PROPOSED SYSTEM
We proposed a rule based part of speech tagger that assigns
parts of speech to each word, such as noun, verb, adjective, Disambiguation
adverb etc in a sentence. If word has more than one tag then Disambiguation Rule
it is an instance of ambiguity , so such a word can be
disambiguate by using small set of context rule. Finally word
is displayed with single associated tag as an output. The rule- Output
based tagger which we are going to develop has many Tagged Document
advantages over other taggers. The proposed approach Figure 1. Proposed System
consists of following phases:
1. Preprocessing
3.2 Stemmer
Stemming is important in the system, which uses a suffix list
2. Stemmer to remove suffixes from words and thus reduces the word to
its stem. The result of stemming is stem of word that can be
3. Morphological analyzer. given as input to Morphological Analyzer for further
4. Tag Generator processing. The stem word contains inflections. The
inflections in the stem word cannot be removed using simple
5. Disambiguation. stemming operation. In Marathi language the infected words
are the words, which belong to Noun, Pronoun, Adjective, or
3.1 Preprocessing Verb. So the Suffix Replacements Rules (SRRs) are used for
these categories words only.
3.1.1 Validation of Input document:
Validation of Input document is very important stage because 3.3. Morphological analyzer
the resultant information is totally depends on the language The aim of morphological analysis is to recognize the inner
and nature of query supplied to the system. The input structure of the word. The words after stemming are analyzed
document may contain some words or sentences in other to check whether they are inflected or not. If stem word is
script or language. Here we are analyzing whether the input inflected then the root word is formed by addition of
document is valid in Devanagari script or not. The words replacement characters with stem word. A morphological
which are not valid to Devanagari script are simply removed analyzer is expected to produce Root words for a given input
from further processing. The aim of this phase is to maintain document. There is need to design some standard rules called
pure Devanagari script document as an input to inflection rule which will enable the system to process the
Morphological Analyzer. stem of words and find the actual Root word.
3.1.2 Tokenization: 3.4. Tag Generator
This Tokenization is the process of separating tokens from
Corpus linguistics is the study of language as expressed in
input text. The division of input text into tokens is important
samples (corpora) of "real world" text. Corpus is a large
for POS tagging. This tokenization task is possible by
collection of texts. It is a body of written or spoken material
searching spaces between the words. The words separated
upon which a linguistic analysis is based [9]. This phase

30
International Journal of Computer Applications (0975 – 8887)
Volume 119 – No.18, June 2015

assigns corresponding part of speech tags to the words and we Table 1. Tagset
have used tagset developed by IIIT Hyderabad [9] [10]. A
well-chosen tagset is important to represents parts of speech. Descripti
Name Tag Example
The language tagset represents parts of speech and consist on on
syntactic classes [8]. Common
NN मुलगा, साखर
Algorithm for POS tagging System: Nouns
Proper
1) Take input text and generate a token. NNP मोहन, राम, सुरेश
Nouns
2) Use tokens to generate stem of word.
Verbal
3) Use rule to generate root word using morphological NV समर्पण, कुरवंडी
noun
analyzer and stored them. NOUN
Noun
4) Select each word one by one from array and compare with Denoting
corpus. Spatial
5) If word is found one or more times, then store associated NST and मागे, र्ुढे, वर, खाली
tag or tags of word and else display “the word is not found” Temporal
add this new word into corpus. Expressio
ns
6) If one tag is stored, then display word with associated tag
as an output. PPN
Personal मी, आम्ही, तम्
ु ही
pronoun
माझा, माझी, तुझा
7) Else apply rule to select most appropriate tag for word. Possessive
PPS
pronoun
3.5. Disambiguation PDM Demonstr
The Natural language has the ambiguity issues as the single /DE ative तो, ती, ते, हा, ही
word has different tags. To overcome the ambiguity issues M pronoun
and assigning a "correct” tag in particular contexts,
disambiguation rules are required. Disambiguation is based on आर्ण, आम्ही,
Reflexive
तुम्ही, तुम्हाला
contextual information or word/tag sequences. The ambiguity PRONOUN PRF
pronoun
which is identified in the tagging module is resolved using the
Marathi grammar rules. एकमेकांचा,
Following example demonstrates processing of our system: PRC
Reciproca एकमेकाला,
l pronoun
Input Query: . It is आर्ल्याला
given in history of Marathi Language. Interrogati
PIN ve काय,कसे,कोण,का
Validation: .
pronoun
Tokenization: उत्साही,
Modifier
JJ of Noun श्रेष्ठ,बळवान
ADJECTIVE
ककती,र्ुष्कळ,खूर्,भ
रर्ूर,चचक्कार
MQ Quantifier

बसणे, दिसणे,
Verb
ललदहणे,र्डला,
VM
Main
VERB
नाही, नको, करणे,
Verb
हवे, नये
VAX
Auxiliary
.
आता, काल, कधी,
ADVERB ADV (Modifier
नेहमी, लवकर
Stemmer: .
/RB of Verb)
Morphological Analyzer: .
आणण, र्ण, जर,
Coordinati
CONJUNCTION ng and
तर
POS Tagging Output: \\NN \\NN \\NNP CC
Subordina
\\NN \\JJ \\VM . \\RD_PUNC ting
आणण, वर, कडे,
POSTPOSITION Postpositi
जवळ
PSP
on

INTERJECTION INJ
Interjectio आहा, छान, अगो
n

31
International Journal of Computer Applications (0975 – 8887)
Volume 119 – No.18, June 2015

NU Cardinal एक, िोन, तीन 5. REFERENCES


NUMERAL MCD Numeral [1] Jyoti Singh Nisheeth Joshi Iti Mathur “Development of
NU Ordinal र्दहला,िस
ु रा Marathi Part of Speech Tagger Using Statistical
MO Numeral Approach”, Advances in Computing, Communications
Foreign and Informatics (ICACCI), 2013.
RDF word English, गुजराती
[2] H.B. Patil, A.S. Patil, B.V. Pawar “Part-of-Speech
residual
RESIDUAL Tagger for Marathi Language using Limited Training
Symbol $, &, *, (, ) Corpora” 2014 in International Journal of Computer
RDS
residual Applications (0975 –8887) Recent Advances in
RD_ Information Technology.
Punctuatio ?,;:!
PUN
n
C [3] Pallavi Bagul, Archana Mishra, Prachi Mahajan,
REDUPLICATIO Medinee Kulkarni, Gauri Dhopavkar, "Rule Based POS
RDP
Reduplica जवळ-जवळ Tagger for Marathi Text" 2014 in proceeding of:
N tions
International Journal of Computer Science and
काळे मांजर- Information Technologies, Vol. 5 (2) , 2014, 1322-1326 .
COMPOUNDS XC
Compoun काळमांजर, [4] Jyoti Singh Nisheeth Joshi Iti Mathur “PART OF
ds
तेलर्ाणी- तेलर्ाणी SPEECH TAGGING OF MARATHI TEXT USING
TRIGRAM METHOD” 2013 International Journal of
NEGATIVE NEG Negative नाही,नको Advanced Information Technology (IJAIT) Vol. 3, No.2.
[5] Nidhi Mishra, Amit Mishra, ”Part of Speech Tagging for
DEMONSTRATI
QF
Quantifier बहुत, थोडा, कम Hindi Corpus” 2011 in proceeding of : International
VE s Conference on Communication Systems and Network
QUESTION Technologies.
WQ
Question काय, कधी, कु ठे
WORDS Words [6] Namrata Tapaswi Suresh Jain, “Treebank Based Deep
Grammar Acquisition and Part- Of-Speech Tagging for
खर्
ू , Sanskrit Sentences” Software Engineering (CONSEG),
INTENSIFIER INTF Intensifier
फार,बराच,अततशय 2012 CSI Sixth International Conference on.
[7] Javed Ahmed MAHAR Ghulam Qadir MEMON,”Rule
PARTICLES RP Particles तर,ओहो
Based Part of Speech Tagging of Sindhi Language” 2010
proceeding of International Conference on Signal
नमस्कार, Acquisition and Processing.
PHRASE PHR Phrase
अलभनंिन,खेि आहे
[8] Sankaran Baskaran , Kalika Bali1, Tanmoy
Bhattacharya, Pushpak Bhattacharyya, Monojit
Echo जेवणबबवण, Choudhury, Girish Nath Jha, Rajendran S.5, Saravanan
ECHO ECH
Word डोकेबबके,र्डणबबडण K.1, Sobha L.6, and KVS Subbarao “A Common Parts-
of-Speech Tagset Framework for Indian Languages”.
4. CONCLUSION [9] Kh Raju Singha Bipul Syam Purkayastha Kh Dhiren
POS tagger is very useful in the context of Information Singha “Part of Speech Tagging in Manipuri: A Rule-
Extraction and Question Answering systems. A Rule-based based Approach ”International Journal of Computer
POS tagger simply assigns all possible tags to word and then Applications (0975 – 8887) Volume 51– No.14, August
uses context rules to remove disambiguation associated with 2012.
words. Rules based POS tagger would yield better accuracy
when there is large set of meaningful rules to remove [10] Bharati, A., Sharma, D.M., Bai, L., Sangal, R.,
disambiguation. Our proposed approach handles large set of “AnnCorra: Annotating Corpora Guidelines for POS and
rules for Marathi POS tagger. Chunk Annotation for Indian Languages” (2006).

IJCATM : www.ijcaonline.org
32

View publication stats

You might also like