Ieee HMM

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

2017 International Conference on Computer, Communications and Electronics (Comptelix)

Manipal University Jaipur, Malaviya National Institute o/Technology Jaipur & IRISWORLD, July 01-02,2017

HMM based Named Entity Recognition


for Inflectional Language
Nita V. Patil, Ajay S. Patil and B. V. Pawar
School of Computer Sciences
North Maharashtra University
Jalgaon (MS), India 425001
Email: nvpatil@nmu.ac.in.aspatil@nmu.ac.in and bvpawar@nmu.ac.in

Abstract-Named Entity Recognition ( NER ) is the section describes supervised learning method for Marathi
problem of identi fying named entities in natural lan­ NER that uses HMM.It also includes details about pre­
guage text, classi fying them into various classes and
processing, training and testing phase of the system. The
assigning the proper class tag to each word in its
context. This paper describes a Named Entity Recog­
final section of the paper compares the performance of the
nition system for Marathi using Hidden Markov Model system without and with preprocessing.
( HMM ) . It addresses the problem of assigning the
correct named entity class tag to each word using prob­ II. RELATED WORK
abilistic model trained on a manually tagged corpus
Research for NER in different languages started during
for the Marathi language. The most probable named
entity tag is assigned to each word using the Viterbi
1990s. Patil et al. [1] presented a survey in which re­
algorithm. Proposed system reports an overall F1- search for NER in foreign and Indian languages is focused.
score of 62.70% when no preprocessing was applied Authors have mentioned that lot of research is done for
whereas it reports an overall F1-score of 77.79% when NER in English, German, Chinese etc. Good work is in
preprocessing was applied on the same data. Thus, the
progress in Indian languages. Little work is done in NER
performance of the system is improved by 15% when
linguistic knowledge is used to preprocess the test and
for Marathi language. Rule based NER system for Marathi
training dataset. is developed by Patel et al. [2] in which inductive logic
programming is used to construct language specific rules.
Keywords Hidden Markov Model, Marathi, Named
- NER system is developed by using GATE framework.
Entity Recognition, Viterbi, Preprocessing An approach is presented by Kumar et al. [3] in which
NEs present in English are utilized to identify the NEs
1. INTRODUCTION present in under resourced languages by bisecting k-means
Named Entity Recognition (NER) is Information Ex­ algorithm for clustering multilingual documents based on
traction (IE) task which plays a significant role in many the identified NEs. Ekbal et al. [4] has presented HMM
different natural language processing tasks such as infor­ based NER system for Bengali and Hindi and reported an
mation retrieval, machine translation, question answering average F1-score 78.35% for Hindi.
systems etc. Predefined entities in text such as people, or­
ganizations, locations, events, expressions such as amount, III. DIFFICULTIES IN MARATHI NE RECOGNITION
percentage, numbers, date, time are Named Entities Marathi is morphologically rich, free word order and
(NEs). Identification of NEs from unstructured text and highly inflectional language. Some important challenges
classifying them into suitable NE class is known as NER. in Marathi NER system development are discussed in
This paper describes a system based on Hidden Markov this paper that must be considered in development of the
Model (HMM) to recognize named entities in Marathi system.
language. The objective of the system is to recognize
twelve types of NEs viz. Person, Organization, Location, A. Inflectional Nature of Marathi
Miscellaneous, Amount, Number, Date, Time, Year, Day In order to show intensity of word inflections in Marathi
and Measure from Marathi language text using supervised a frequent word � is searched in 9,135 documents of
learning technique based on HMM. The difficulties of Fire-2010 corpus. A total of 6,886 words were found that
unseen probabilities and poor probabilities caused due are inflected forms of word � . It is observed that various
to the sparseness of data are handled by replacing less suffixes are added to word � in order to make mean­
frequent words. The Viterbi algorithm is used for decoding ingful context. 42 distinct words are formed by adding
and word disambiguation. This paper is organized in five suffixes to the word as shown in Table 1. It shows word
sections. Initial two sections give the introduction and form and its frequency of appearance in 9,135 documents.
discusses related work. Section three describes issues of Variations of word are not only syntactically different but
Marathi NER that needs to be addressed. The fourth also differ semantically. At several places the same word is

978-1-5090-4708-6/171$31.00 ©2017 IEEE 565


TABLE I TABLE II
SAMPLE OF WORD INFLECTIONS IN MARATI-II VARIATIONS IN WRITING STYLE

J`i?B qQ`/U6`2[m2M+vV aivH2 R aivH2 k aivH2 j


—ȡš
 —ȡšȡ€ŒǗ “
 —ȡšȡ€Œȯ
 ”Ȳ’šȡ h‚èŠ CG h‚èŠ CKFI  h‚èŠ 
—ȡšȡ…Ȳ
 —ȡšȡ…ȡ
 —ȡšȢ™ȡȲ͙ȡ
 f€ȪŽȢ žȯ  ƣȯ…ȡœȣ 
—ȡšȡ…ȡ…
 —ȡšȡ…Ȣ
 —ȡšȡ…ȯ
 —ȡšȢ™ èŠȯ Š –ȱ€ èŠȯ Š –ȱ€ h• ^Ȳ ǑŒ™ȡ f  –Ȣ ]™
—ȡšȡ͙ȡ
 —ȡšȡȢ›
 —ȡšȡ¡ȣ
 —ȡšȢ™ ǐšˆ[å¡ –ȱ€ ǐšˆ[å¡ –ȱ€ ]š –Ȣ ]™
—ȡšȡ‡œ
 —ȡšȡ
 —ȡšȡ…
 ” –Ȳ‚ȡ› ”ǔƱ˜ –Ȳ‚ȡ› –Ȳ‚ȡ›
”
—ȡšȡ•ȯš
Q  —ȡšȡ›ȯ
 —ȡšȡã™ȡ
 ”Ǖ[ ˜ǕȲ–_ ˜ǕȲ–_
^ ˜ǕȲ–_
”Ǖ
—ȡšȡ“ȯ
 —ȡšȡ”ȡ Ǘ“
 —ȡšȡ”ȯ¢ȡ

—ȡšȡ›ȡ
 —ȡšȡš
 —ȡšȡǒǽƨ

—ȡšȡǒǽƨ͙ȡ
 —ȡšȡžȢ
 —ȡšȡ ˜Ȫš
 The textual units of interest considered are names of peo­
—ȡšȡ ¡
 —ȡšȡ ȡ‹Ȥ
 —ȡšȡ ȡšÉ™ȡ
 ple, location, organization, dates, numbers and amounts.
—ȡšȢȲ͙ȡ
 —ȡšȢȲ“ȡ
 —ȡšȢȲǒǽƨ
 Names are written as full names( ĮǕȢ \‡™ ”ȡŠȣ›), abbrevi­
—ȡšȢ
 —ȡšȢ™
 —ȡšȢ™ȡȲ…ȯ
 ated full names using Marathi character set(ĮǕ \ ”ȡŠȣ› ),
—ȡšȢ™ȡȲ‹Ȥ
 —ȡšȢ™ȡȲ“ȡ
 —ȡšȢ™ȡȲ“Ȣ
 abbreviated full names using English alphabets spelled
—ȡšȢ™ȡȲ“Ȣ‘ȯ Ȣ›
 —ȡšȢ™ȡȲ”Ȱ€ȧ
 —ȡšȢ™ȡȲ˜Ú™ȯ
 in Devanagari( f  f ”ȡŠȣ›), only first name(ĮǕȢ), only
last name( ”ȡŠȣ›), name with or without title person(€Ǖ˜ȡšȣ
ĮǕȢ \‡™ ”ȡŠȣ› ) and so on. Maximum organization names
used to formulate different types of NEs. For instance word contain either person or location entity. Some sample
form —ȡšȢ™ appeared 3,022 times in 9,135 documents as ENAMEX NEs are shown in Table III and NUMEX
part of `ƣš —ȡšȢ™ (Miscelleneous),—ȡšȢ™ èŠȯ Š –ȱ€ȯ“ȯ (Orga­ and TIMEX NEs are shown in Table IV. Numerals are
nization),—ȡšȢ™ ĐȧŒȡ ˜ȲǑ‘š Œȡœȡ ror ™ȯȯ (Location),—ȡšȢ™ Ǒ•ã˜ written in English style, Marathi Style, Number in words
•ȯǔ芝› (Event) etc. entities . This results in ambiguity in inflectional form. Numbers written in multiple forms are
that statistical algorithm has to handle. If a word appears part of almost every NE type such as Date, Time, Amount,
many times then it is helpful to the algorithm to predict Measure. It is found that contextual clue words are present
its correct NE type, but many forms of the word —ȡš for Time and Measure NE but it is very difficult to disam­
appeared less frequently in the training. For such word biguate between Amount and Number NE. For Instance
forms it becomes very difficult for the algorithm to predict —Ǖ€Ȳ”ȡ  ƒšȯ `Úè ˆȡ›Ȣ ˜¡ȡšȡƶ  š€ȡš“ȯ Ĥיȯ€ȡ  
the correct tag. ˜Ȳ‡Ǖš €ȯ›ȯ .(11,125 houses destroyed by earthquake. Maha­
Similar study is made by H. B. Patil et al. [5] related rashtra Goverenment granted 1,00,000 to each) Here first
to stemming of Marathi text terms. Authors have shown instance of numeral 11,125 is number and second instance
variations of Marathi terms after adding suffixes. N Patil numeral 1,00,000 is Amount. Implementing NER system
et al. [6] explored 68 variations of word –Ȫ› to illustrate for Marathi is very difficult as compared to English.
agglutinative nature of Marathi language.
IV. PROPOSED LEMMATIZATION BASED NER SYSTEM
B. Writing style variations ARCHITECTURE
Marathi is free word order language. It allows change Named Entity Recognizer works in three phases. In first
in positions of language components (nouns, verbs, adjec­ phase the tokenizing algorithm tokenizes raw Marathi text
tives) in the sentence freely. Such change in positions of into individual tokens. These tokens are then subject to
language components in sentence may not affect sentence preprocessing based on linguistic lexicons. In the second
semantic. A fixed position of nouns in sentence can assists phase the HMM is trained using the preprocessed data.
NE recognition because mostly proper nouns are NEs During training word tag frequencies are computed and
of some type. Marathi names are very much diverse, language modeling is done. In third phase(test phase)
many common nouns are used in naming persons. Marathi language models generated in phase two are subjected to
language is flexible and doesn't force any standard style viterbi decoding. This phase outputs word tag pair which
for writing names, hence no specific pattern for names can is passed on to the evaluation module for determining the
be discovered. Table II depicts three styles of writing same accuracy of the results. Fig 1 shows proposed preprocess­
NEs. ing based named entity recognition system for Marathi.
C. Polymorphic Textual Units
There are many variations in writing textual units in A. Dataset Details
Marathi. Mhaske et al. [8] has investigated many issues Marathi news corpus FIRE-2010 developed at lIT Bom­
and challenges for Marathi text related to opinion min­ bay is used to develop NE annotated corpus by manually
ing. Authors have concluded that it is very difficult to tagging for 12 types of NEs. The Forum for Information
implement natural language processing tools for Marathi. Retrieval Evaluation (FIRE) aims to encourage research

566
TABLE III
NAMED ENTITIES OF TYPE ENAMEX

NE Instance NE Instance Description


Ȳ – ”ȡŠȣ› K. B. Patil Multiword name in Marathi style abbreviation
ϙǕ›Ȣ ǑŒŠȣ Jyulee Ditee English name in Devanagari
Ǔ›ˆȡ šȯ ˜ȲŒ Liza Raymond Ambiguous Surname(Raymond:Organization)
˜Ǚ‘›Ǖ ȡ ‡™Ȳ ƒȡŽȯ€š Mrudula Jayant Ghanekar Marathi style person full name writing
‡ȯ å¡ȣ ]š Ĥ ȡ‘ šȡ J. V. R. Prasad Rao English in Devanagari style
”ȡ›ȯ€šȡȲ“Ȣ Palekarani single word person name with inflection (� + <ITcfr)
 ŠȡŽȡ —ȡ¢Ȣ Satana Bhakshee Location
Ǔ Œ€Ȫ –  èȡ“€ Sidako Bus Sthanak Name of bus stand(Location)
…ȡȲ‘Ȫ›Ȣ \—™ȡšÖ™ Chandoli Abhayaranya Name of Forest(Location)
“ȡ™‚ȡ ”ȪǓ›  ˜Ǖəȡ›™ Naygaon Police Mukhyalaya Mixed NE (Organization contains Location)
Œȯ Ȁ“ ȝȢ“ Deccan Queen Name of Train
 å™ȡ ˜šȡ‹Ȥ  ȡǑ¡×™  Ȳ˜ȯ›“ 77 v ya Marathi Sahitya Sammelan Name of function
–Ǚ¡Û˜ǕȲ–_ ”ȪǓ›  €ȡ™‘ȡ €›˜  Bruhanmumbai Police Kayada Kalam 111 Law
‡ȡa ǓȲ ȡa Jaoo Titha Khaoo Marathi movie name
šȡƶȢ™ ϙǕǓ“™š ȯŠǓ›ǔ݊Ȳ ‚ 蔒ȡ[ Rashtriya Junior Weightlifting Spardha Sport competition

TABLE IV
NAMED ENTITIES OF TYPE NUMEX AND TIMEX

NE Instance NE Instance Description


 ¡‡ȡš 80 Thousand Number or Amount depend upon the context
CCJE 11.83 Number(Marathi)
‘Ȫ“ ›ȡ ”Ȳ…ȡ¡ƣš ¡‡ȡš ǽ”™ȯ Two Lac Seventy Five Thousand Rupees Amount in words in Marathi
”ȡ…å™ȡ Fifth Single word measure
 €ȪŠȣ ǽ”™ȯ 140 Crore Rupees English numerals in Amount
CG h‚èŠ 15 August Marathi number followed by English month in Devanagari
“Ȫå¡Ʌ –š DBBH November 2006 English month in Devanagari followed by Year in Marathi

l Training Datiriet J in South Asian language Information Access technologies.


+ I Text Preprocessing
I FIRE-2010 dataset contains news stories from Maharash­
tra Times that contains news between 2004 to 2007. 27,177

rV��J
r::::- ::: sentences of Marathi text have been manually annotated
Lexicons of Marathi numerals, units
>it. r.t. 3r
of measurem ents, date and time using lOBES scheme. Training data developed for Marathi
NER consists of 4,12,388 word forms. It is observed that

I I il
Algorithm of

Tokenization
many word tag pairs are less frequent, mostly only once
Tokens

I I
Algorithm for Word

l
f-------. seen in training corpus. Increasing training corpus in­
1I Replacement b as ed on Lexicons

,::::: :::. creases frequent word counts, still rare words are present

J
"woed,""' "'th tth'"
Ifwcrd end> WIth ':then .... Pre�o=ssed TOIen5
+ in training corpus in huge amount. Hence words in less
N-.gramCounts Word - Tag Frequency
frequent word tag pair in training corpus are replaced by
and 11 213 and 4-gram the _RARE_ token. Independent test set is created
I Tokenilation
I Tag counter
by splitting annotated data into two non-overlapped parts,
l Test Dataset

J I Language Modeling
l one for training and the other for testing. The test data
J
Key Dataset
is held out and not seen during training. Table V shows
1 1 number of sentences, number of word forms and number
1 1 I
I
Text
Viterbi Decoding __ Evaluation
Prepro(:essin� distinct word forms in training and testing dataset and

'-'�
Table VI describes number of instances of NE classes
present in train and test dataset. Marathi Named Entity
Precision
Recall Recognizer is evaluated for tagging accuracy using held­
Fl-&:o IE!
out Dataset 1 and Dataset 2 ( Dataset 2 is actually pre­
processed Dataset 1 ) . Test datasets used are
Fig. 1. NER System Architecture
• Dataset 1 :Held Out Dataset
• Dataset 2: Dataset l (Preprocessed )

567
TABLE V ŒȨ› […ȢŒȨ› [“ȡŒȨ›š…ȡ ⇒ ŒȨ›š
DATASET DETAILS
The word mappings in proposed technique is as given
# Training Test
below,
ƒ“•ǗŠ ƒ“˜ȢŠš Šȣf˜ Ȣ ƒ“›ȢŠš ⇒ CUBVOL
Sentences 26462 715
ŒȨ› […ȢŒȨ› [“ȡŒȨ›š…ȡ ⇒ DOLLOR
Words 401295 11093

Distinct Words 61957 5346


The result of this mapping of text will be something like:
dN ]Ǒ‘ȡ Ȣ  —ȡ ‘ȡȲ“ȡ Ĥיȯ€ ‘Ȫ“ ¡‡ȡš ǽ”™ȯ Ĥ˜ȡŽȯ f€ ›ȡ 83 ¡‡ȡš ǽ”™ȯ
\[ ȡ¡ȡᙠۙǕǔț\  –‡ȯŠ ™Ȫ‡“ȯ˜’Ǘ“ žȲ—š ŠÈ€ȯ \“Ǖ‘ȡ“ȡš ‘ȯ ֙ȡ  ˜Ȳ‡Ǖšȣ Ǒ‘›ȢX
TABLE VI
TRAINING AND TEST DATA DETAILS

NUM ]Ǒ‘ȡ Ȣ  —ȡ ‘ȡȲ“ȡ Ĥיȯ€ NUM HAZAR RUPEE Ĥ˜ȡŽȯ
Features Training Test NUM LAKH NUM HAZAR RUPEE \[ ȡ¡ȡᙠۙǕǔț\  –‡ȯŠ
People 11998 305 ™Ȫ‡“ȯ˜’Ǘ“ NUM TAKKE \“Ǖ‘ȡ“ȡš ‘ȯ ֙ȡ  ˜Ȳ‡Ǖšȣ Ǒ‘›ȢX
Organizations 7236 204

Locations 9723 292 Algorithm 1 Algorithm for Lemmatization


Miscellaneous 3170 72 Input: Text Document
Amounts 1463 37 Initialization:
Numbers 6893 200 L= Set of Lexicons
Measures 2887 34 S= Set of Suffixes
Dates 1515 80 \¥=Set of \¥ords in Sentence
Years 384 11
Found = False
Time 360 9
for Word in Knowledge-Base do
Months 193 4
if Match Found && Found != True then
Weekdays 441 15
Replace word form by Knowledge Representing
Total NEs 46263 1263
Token
FOUND = True
break
B. Lemmatization for Preprocessing of Text end if
Agglutinative nature of Marathi introduces difficulties if Word Not Present in Knowledge-Base then
in NE recognition. There is need of some technique that forsuffix in Suffix-List do
can deal with inflections so that frequency count of word if word form ends with suffix then
tag pair can be increased significantly so that effective Replace word form by Knowledge Representing
HMMs can be modelled and assists the Viterbi algorithm Token
to decode correct tag sequence for a given word sequence. Found = True
Grammar allows text generation using different forms of break
a word. Families exists of derivationally related words end if
with similar meanings. It seems useful to replace the end for
member of derivationally related word family by the name end if
of family for text processing applications. There exist two end for
techniques stemming and lemmatization commonly used
to deal with inflections of words forms in NLP. However, Target word forms of proposed algorithm( l ) were only
stemming usually chops off the word endings that includes numerical NEs. A word form that could be member for
the removal of derivational affixes. Lemmatization make replacement is input to the algorithm , which searches
use of vocabulary and morphological analysis of words the word form in knowledge base either the word form
to chop off inflectional endings only and returns lemma. as it appears or by striping the suffixes. Lexicons are
Lemmatization is a complicated process that groups re­ developed and various types of numerals are replaced by
lated words by understanding word sense and the context lexicon representing token prefixed and postfixed with
in which words appear to understand word semantically. _RARE_ so that uniform numeral count could be
Lemmatization includes basic word variations like singular increased. Various lexicons for different number patterns
vs. plural, thesaurus or synonym matching. It normal­ such as number in words, lexicons for half and quarter
izes the inflections. The proposed technique is similar number in words, currency, units, time, length, weight,
to lemmatization except that derivationally related and electricity, temperature, area, volume measures are devel­
inflected word form is replaced by a specialized token so oped. Replacement of matching Marathi tokens using lexi­
that NEs could be normalized and variations could be cons is termed as preprocessing of datasets. Preprocessing
reduced. The word mappings in lemmatization for Marathi is applied to training and test dataset. All necessary group
shown below, of lexicons and number of lexicons in group are given in
¡Ÿ[`ã¡ȡ  ȲȪŸ ⇒ ]“Ȳ‘ following subsections.

568
TABLE VII TABLE VIII
NUMERICAL AND CLUE 'NORDS FOR TIME EXPRESSIONS NAMES IN DATE EXPRESSIONS

hQF2Mb iQ #2 _2TH+2/ hPE1L O J2K#2`b hQF2Mb iQ #2 _2TH+2/ hPE1L O J2K#2`b


“aȡ‡ȯ…Ȣ”ȡŽȯ‘¡ȡ›ȡ–ȡšȡ“Ȳš LlJh jje ĮȡŽȡĤ˜ȡŽȯ…fǒĤ›‘šà™ȡ“ JPLh> Rkd
 åȡ‘šà™ȡ“ åȡ“a‘šà™ȡ“ LlJh. 93 ĥȡ™Œȯ žǓ“ȡš›ȡ Ȫ˜ȡšȣ¡ȣ .ua 9k
\€šȡ”ȡ Ǘ“ ȡ”™ɍ åȡȡ‡ȯ”™ɍ LlJhSS R99
]‹Í™ȡ åȡȢ“Í™ȡ åȡ\€šȡ͙ȡ LlJh* 9d
‘”
Ǖ ȡšÍ™ȡ ȡ™Ȳ€ȡœ…ȯšȡğȢ…ȯ”¡ȡŠȯ  €ȡœȣ EG 99 TABLE IX
UNITS OF TIME MEASUREMENT
ȡ‡ȯ”™ɍ ȡ‡ȡ ȡ‡Ǖ“ qC1 Re
hQF2Mb iQ #2 _2TH+2/ hPE1L O J2K#2`b
ȡ ȡȲg‡Ȣȡ ȡȲ€ǐšȡȡ ȡȲ…ȡ ha ky
1) Numeric Components of the Text: After studying
 ȯ€Ȳ‘ȡȲ“Ȳš ȯ€Ȳ‘ȡȲ“Ȣ ȯ€Ȳ‘ȡ”ǕšȲ… a1*PL. 8
many numerical and temporal expressions it is found that
˜ȢǓ˜Ǔ“Šȡ͙ȡǓ˜Ǔ“Šȯ … JALlh1 k8
initial components of multiword NEs are common numer­
Ǒ‘ …Ǒ‘ ȡȲ^€ȡǑ‘ ȡȲ…Ȳ .Aqa j8
als followed by inflected clue words. These clue words ]‹Œȯ ]‹Ô™ȡȲ͙ȡ]‹Ô™ȡȲ h>o. ke
decide the exact type of NE. All the clue words found in ˜Ǒ¡“ȯ˜Ǒ¡Û™ȡȲ€ǐšȡ˜Ǒ¡Û™ȡȲ…Ȳ J>AL j9
corpus are collected and categorized where each category ȡšȯ ȡšȯg‡Ȣȡšȯ\ȯš h_AE> e
is represented by a specialized token. If a clue word is Ÿȡɍ…ȡŸȡɍ…ȡ…Ÿȡɍ…Ȣ o_a> ee
found it is replaced by the respective token. Marathi uses
many different types of tokens and their inflected forms
to represent numbers. All tokens found in corpus are TABLE X
grouped in ten categories as number constituting Marathi UNITS OF AMOUNT EXPRESSIONS

digits and fractions, English digits and fractions, numbers


hQF2Mb iQ #2 _2TH+2/ hPE1L O J2K#2`b
representing ranks, number names and hundreds, num­
ǽ͙ȡǽ”™ȡ¡ȣǾ”™ȡȲ…Ȣ _lS11 93
ber name representing quarter, half and three quarters,
”ȰžȡȲ…ȡ”ȰžȡȲ…Ȣ”ȰžȡȲ…ȯ SAa1 k3
named fractions, inflectional forms of names for repre­
”ȫŒȡȲ”ȯ¢ȡ”ȡ`Ȳ Œ”ɋŒȡȲ SPlL. R3
senting thousands, lakhs, crore and hundred crores etc.
™ǕšȪ…Ȳ™ǕšȪ…ȡ™ǕšȪ…Ȣ 1l_P Re
These groups have tokens names assigned ( e.g., NUMT, ŒȨ› […ȢŒȨ› [“ȡŒȨ›š…ȡ .PGGP_ R3
TAAS, MONTH, RUPEE etc. ) which are used to represent ǒ–Ǔ›™““ȯǒ–Ǔ›™““Ȣǒ–Ǔ›™“˜Ǖœȯ "AGGAPL 8
number group and used as replacement for group member. Ǔ˜Ǔ›™“‘ž›¢‘ž›¢ȡȲ“Ȣ JAGGAPL R8
Exclusive groups are formed to categorize numbers, sam­
ple members of the group and number of members in that
group are shown in fig 2.
2) Time and Date Expressions: There are four groups that represents all the tokens that can be a part of time
NUMT ,NUMTD, NUMTPP and NUMTC that represent measurements.
numerical part of time. A group of clue words ( KAL ) that 4) Amount Expressions: Number types shown in Fig.2
tells that the next expression must be time and the other are common components of amount, measure and number
group ( WAJE ) is also a group of clue words that comes expressions. Each NE in these categories of expressions
after numerical component of time expression. Table VII ends with clue word that can be used to disambiguate
shows sample tokens in column one to be replaced by among AMOUNT, MEASURE and NUMBER NE. Table
group token in second column and number of members X shows group of clues that are part of AMOUNT NE.
in the group in third column.
Date is made up of day, month and year components 5) Measurement Expressions: MEASURE NE is com­
all together or any two in any sequence. Day and month posed of number and units of measurement. If units could
can be written using number or name and year can be 2 be identified correctly then it becomes easy to identify
or 4 digit number. To model any format of date two lists MEASURE NE. Table XI,XII,XIII,XIV and XV shows
are created. The first list MONTH includes 127 forms of group of clues for distance, electricity, weight, temperature
Marathi and English month names and inflections; second and area measures respectively.
list includes 42 forms of Marathi, English day names and 6) Miscellaneous Expressions: 42 clue words are part of
their inflections as shown in Table VIII. percentage expressions are grouped and replaced by token
3) Time Measurement Expressions: NEs that represent TAKKE, 28 clue words that are single word measures
temporal expressions differ semantically such that one are collected and replaced by token SMEASURE. 63 clue
indicates time of the day and other indicates time dura­ words that do not represent any measure categorized in
tion. Therefore the classification done as time of the day Table XVI but are used to demonstrate some distinct
is a TIME NE and time duration is a MEASURE NE. measurements such as first semester, eleventh in class are
As shown in Table IX eight different groups are created replaced by MISC and NUMMISC tokens respectively.

569
Fig. 2. Number Formats

TABLE XI TABLE XV
UNITS OF DISTANCE MEASUREMENTS UNITS OF AREA MEASUREMENTS

.BbiM+2 J2bm`2 hvT2b hPE1L O J2K#2`b `2 J2bm`2b hvT2b hPE1L O J2K#2`b
˜Ȱ›ȡȲ͙ȡ˜Ȱ›ȡȲš˜Ȱ›ȡš JAG1 9 ƒ“•ǗŠ ƒ“˜ȢŠš Šȣf˜ Ȣ ƒ“›ȢŠš *l"oPG e
Ǒ€˜ȢǑ€˜ȢǑ€˜Ȣ͙ȡǑ€›Ȫ˜ȢŠš EJa Rd f€š f€š…ȡ f€šȡȲ f€šȡ…Ȣ *_1 d
 ɅŠȣ˜ȢŠš›ȡ Ʌ˜Ȣ Ʌ˜Ȣ…Ȳ *Ja 3 ‚ǕȲ‹ȯ ‚ǕȲәȡȲ”ȯ¢ȡ ]š ǒ–ƒȯ :lLh>1 8
•ǕŠȡȲ”™ɍ•ǗŠ•ǕŠȡȲ‘šà™ȡ“ 6PPh Rj ¡ȯ ǔȊš ¡ȯ Ȋš ȡ‹Ȥ ¡ȯ Ȋȯ š >1*h_1 8
˜Ȣ˜ȢǓ˜Ǔ›˜ȢŠšǓ˜Ǔ›˜ȢŠš”ȯ¢ȡ JJ 8
˜ȢŠš…Ȣ˜ȢŠš…ȡ˜ȢŠš“ȯ J1h1_ e
TABLE XVI
^Ȳ …ȡȲ͙ȡ^Ȳ …ȡȲ”™ɍ^Ȳ …ȡ›ȡ AL*> e MISCELLANEOUS EXPRESSIONS

JBb+2HHM2Qmb J2bm`2b hPE1L O J2K#2`b


TABLE XII
UNITS OF ELECTRICITY MEASUREMENTS
ž€ȡȲ ¡ŸŠ€ȡš ȯǓ˜èŠš›ȡ JAa* ej
ŠÈ™ȡȲšŠÈ™ȡȲǾ“ŠÈ™ȡ“ȯ hEE1 9k
1H2+i`B+Biv J2bm`2 hvT2b hPE1L O J2K#2`b Œˆ“—š\’[ž€…ȡǓœžȢ͙ȡ aJ1al_1 k3
™ǕǓ“Š ™ǕǓ“Š”™ɍ ™ǕǓ“Š”ȯ¢ȡ lLAh j Ĥ˜\€šȡȢ\€šȡ‡ŽȡȲ…ȡ LlJJAa* jyj
˜ȯ‚ȡȨŠ˜ȯ‚ȡȨŠ…ȡ˜ȯ‚ȡȨŠ…Ȣ Jqa N
€ȯå¡ȣǑ€›ȪȨŠ€ȯå¡ȣ…ȯ Eo e
å¡Ȫ㊡Ȫ㊅Ȣå¡Ȫ㊅ȡ oPGh j
C. HMM Training

Literature present in [7] is used in present study. HMM


TABLE XIII
UNITS OF WEIGHT MEASUREMENTS is trained using supervised training method that uses
Maximum Likelihood Estimation (MLE) algorithm. HMM
q2B;?i J2bm`2 hvT2b hPE1L O J2K#2`b relies on three parameters,
Ǒ€›ȪǑ€›ȪĒȭ˜Ǒ€›Ȫ›ȡ E:a 8
Ēȭ˜Ēȭ˜…ȡĒȭ˜”ȡ Ǘ“ :Ja j
• AMatrix of tag transition probabilities
ǔȝȲŠ›…ȢǔȝȲŠ›”™ɍǔȝȲŠ›š ZlALhG 9
• B:Matrix of observation probabilities
Ǔ›Šš Ǔ›Šš…ȯ Ǔ›Š [ GAh1_ 8
• 7r:Matrix of probability of the tag to occur in the
Š“ȡ”™ɍŠ“ȡ…ȡŠ“ȡ“ȯ hPL Ry initial state
˜ȯǑ̀ J1h_A* R Trigram HMM is defined as (K, V,),) for,
…ȫš ‚[…ȫ…ȫ aZl_1 RR
• K = {Sl' S2,....sn}:Finite set of possible states
• V = {Wl,W2....wn}:Finite set of possible observation
TABLE XIV symbols
UNITS OF TEMPERATURE MEASUREMENTS
• ), (7r,A,B), where,
=

h2KT2`im`2 J2bm`2 hvT2b hPE1L O J2K#2`b 7r { 7rd : Set of initial state probabilities where
=

 ȯǔã \  ȯǔ㠙 •ȭšȯ “ȡ_Š *1G*lAa 8 7ri : Initial probability that system starts at state
\ȲžȡȲ…Ȣ\ȲžȡȲ“Ȣ\ȲžȡȲš lLa> d i
A ={aij} : Set of state transition probabilities
where aij: Probability of going to state j from
state i

570
{b }
B = {biWk}:
B iW : Set of emission probabilities where TABLE XVII
XVII
{b } k
= TABLE
PERFORMANCE
PERFORMANCE OF
OF SYSTEM FOR
FOR HELD
HELD OUT DATA SET 1
OUT DATA 1
{biWk}
iW k :: Probability of generating symbol Wk at
state i. NE Class P
P R
R Fl
,\
,\ model is created using large training samples by counting PER
PER 83.10 77.38 80.14
frequencies of transitions and emissions that are used to ORG
ORG 52.66 58.33 55.35
55.35
estimate transitions and observation probabilities of ,\ ,\ LOC
LOC 73.05 70.55 71.78
71.78
models. MLE algorithm is used to estimate parameters MISC
MISC 31.13 45.83 37.08
37.08
of ,\
,\ model as, NUM
NUM 64.10 75.00 69.12

Count(i,j,
Count ( i,j, k) AMT
AMT 50.00 72.97 59.34
aaijk
"tJ k =
= Count(i,j) DATE
DATE 42.25 88.24 57.14
Count(i,j)
MEASURE
MEASURE 57.85 87.50 69.65
and YEAR 18.00 81.82 29.51
bb.tt. {W
Wkk -
_ f-7 W
} _ Count(i f-7 k) .
Wk)
TIME
TIME 05.37 88.89 10.13
10.13
Count( i)
Count(i)
-

MONTH
MONTH 20.00 75.00
75.00 31.58
31.58
Two *
* symbols and one STOP tag is used to mark start WEEKDAY
WEEKDAY 43.75 93.33 59.58
and end of the sentence. The probability of state sequence Overall
Overall 55.73
55.73 71.66 62.70
62.70
SI, S2···Sn+l
SI, S2",Sn+l for given WI,
Wl, W2
w2 ....Wn
wn observation sequence for
NE tagging can be computed as, TABLE XVIII
TABLE XVIII
PERFORMANCE
PERFORMANCE OF
OF SYSTEM
SYSTEM FOR
FOR HELD OUT
OUT DATA SET 2

P(Wl, w2 ..... wnlsl, S2 ..... Sn) ~ NE Class P R


R Fl

rr
nn PER
PER 84.94 76.03 80.24
q(sil s i-2, si-d ORG
ORG 82.40 55.08
55.08 66.03
66.03
ii=1
=1
LOC
LOC 87.11 61.01 71.76
71.76
( l )
x e(wilsi)
x e wi si
MISC
MISC 49.30
49.30 49.30
49.30 49.30
49.30

vVhere
Where q and ee are parameters for maximum likelihood NUM
NUM 89.85 90.31 90.08

estimates. AMT
AMT 75.00 81.08 77.92
77.92
DATE
DATE 94.29
94.29 97.06
97.06 95.65
95.65
D. Decoding
D. Decoding MEASURE
MEASURE 94.59 93.33 93.96
Decoding is the problem of predicting most likely tag YEAR 81.82 81.82 81.82
sequence for an input sequence such as for a sequence TIME
TIME 85.71 85.71 85.71
of observations WI,Wl, W2, Wn ,, problem is to find most
W2, ....Wn MONTH
MONTH 100.0
100.0 75.00
75.00 85.71
85.71
probable state sequence SI, S2, ....Sn which maximizes
SI , S2, WEEKDAY
WEEKDAY 100.0
100.0 100.0
100.0 100.0
100.0

P(SI, S2,
P(SI, S2, ...Sn
SnIWl, W2 , ....W
IWl,W2, wn ) as
Overall
Overall 84.34
84.34 72.18
72.18 77.79
77.79

P(SI, S2,
arg max P(SI' S2, ..
"Sn *,WI,
sn ll*, Wl, W2,
W2, "Wn,
.. Wn , STOP) more correctly decode test data. Two important steps have
SI,"
Sl,"SS'n
n
been performed to boost system performance. Preprocess­
Preprocess-
Unknown Words
E. Unknown ass ist algorithm to focus
ing has decreased confusion and assist
The lexicons which are not seen in training are unseen on contextual clues. Replacing unknown or rare word in
or unknown words. The frequency count of unseen words is _ RARE_
test corpus only by _RARE_ token does not work.
zero hence prediction probability also becomes zero. Hence Training HMM for treating unknown or _RARE_ _ RARE_
unknown words should be treated properly. If If frequency of tokens is done. This has improved the performance of the
observation in test set is less than 5, then that observation system to good extent. System has tested on dataset 2
is treated as rare or non frequent word. Unknown words which is preprocessed version of held out dataset 1. The
in test set are replaced by pseudo word _RARE_ _ RARE_ performance of the system after preprocessing is shown in
and probabilities of unseen word model _RARE_
_ RARE_ are Table XVIII.
estimated from its count just like other words in training. Overall F1-Score reported by the system for data set 1
is 62.70% whereas 77.79% is reported for dataset 2. Fig.3
RESULTS AND
V. RESULTS AND DISCUSSION
DISCUSSION
shows that performance of the system has increased by
Performance of the system is measured using Preci­ Preci- 15% after applied preprocessing on test as well train data
sion(P),Recall(R)
sion metrics. System eval­
(P),Recall(R) and F1-score(F1) metrics. eval- sets.
uated for test dataset 1 as shown in Table XVII.
Overall performance of the system reported is 62.70%. VI. CONCLUSION
CONCLUSION
System has recognized person, locations, numbers and of a machine
This paper has demonstrated the use of
measures well but other NEs are not recognized satisfacto­
satisfacto- learning technique (HMM) for Named Entity Recognition
rily. Some technique is needed to help Viterbi algorithm to
rily. in an highly inflectional language. The pre-processing

571
Performance Comparision
REFERENCES
• Fl(Unproce�5ed) • Fl(Proce�5ed)
[1] Nita Patil, Ajay S. Patil, and B. V. Pawar, Survey of Named
Entity Recognition Systems with Respect to Indian and For­
100 ,- eign Languages, International Journal of Computer Applications
80 (IJCA), 134(16):21-26, 2016.

B 60 [2] Anup Patel, Ganesh Ramakrishnan and Pushpak Bhattacharya,
� 40 Incorporating Linguistic Expertise using ILP for Named Entity
20 Recognition in Data Hungry Indian Languages, Proceedings of
the 19th International Conference on Inductive Logic Program­
ming, 178-185, 2009.
[3] Kumar, N. Kiran, Santosh, G. S. K. and Varma, Vasudeva, A
NEcategories Language-independent Approach to Identify the Named Enti­
ties in Under-Resourced Languages and Clustering Multilingual
Documents, Proceedings of the Second International Conference
Fig. 3. F1-Score of Held Out and Preprocessed Held Out Data on Multilingual and Multimodal Information Access Evaluation,
Springer-Verlag, 74-82, 201l.
[4] Asif Ekbal and Sivaji Bandyopadhyay, A Hidden Markov Model
based Named Entity Recognition System: Bengali and Hindi as
of both training and testing data using lemmatization Case Studies, Proceedings of the 2nd International Conference
technique presented in this work along with replacement of on Pattern Recognition and Machine Intelligence, 545-552, 2009.
[5] Harshali B. Patil, Ajay S. Patil, and B. V. Pawar, A Compre­
rare terms with a special token has significantly improved hensive Analysis of Stemmers Available for Indic Languages,
the performance of the system by 15%. The system also International Journal of Natural Language Computing(IJNLC),
responds well to unseen data. 5(6):45-55, 2016.
[6] Nita Patil, Ajay S. Patil, and B. V. Pawar, Issues and Challenges
ACKNOWLEDGMENT in Marathi Named Entity Recognition, International Journal of
Natural Language Computing(IJNLC), 5(6):15-31, 2016.
The work presented in this paper is financially sup­ [7] Jurafsky Daniel and James H. Martin, Speech and Language
ported by Rajiv Gandhi Science and Technology Commis­ Processing, Pearson Education International, 5(6):117-155, 2008.
[8] Neelima Mhaske and Ajay S. Patil,Issues and Challenges in
sion, Government of Maharashtra and SAP DRS-II, UGC, Analyzing Opinions in Marathi Text, International Journal of
New Delhi. Computer Science Issues(IJCSI),13(2):19-25, 2016.

572

You might also like