Professional Documents
Culture Documents
Ieee HMM
Ieee HMM
Ieee HMM
Manipal University Jaipur, Malaviya National Institute o/Technology Jaipur & IRISWORLD, July 01-02,2017
Abstract-Named Entity Recognition ( NER ) is the section describes supervised learning method for Marathi
problem of identi fying named entities in natural lan NER that uses HMM.It also includes details about pre
guage text, classi fying them into various classes and
processing, training and testing phase of the system. The
assigning the proper class tag to each word in its
context. This paper describes a Named Entity Recog
final section of the paper compares the performance of the
nition system for Marathi using Hidden Markov Model system without and with preprocessing.
( HMM ) . It addresses the problem of assigning the
correct named entity class tag to each word using prob II. RELATED WORK
abilistic model trained on a manually tagged corpus
Research for NER in different languages started during
for the Marathi language. The most probable named
entity tag is assigned to each word using the Viterbi
1990s. Patil et al. [1] presented a survey in which re
algorithm. Proposed system reports an overall F1- search for NER in foreign and Indian languages is focused.
score of 62.70% when no preprocessing was applied Authors have mentioned that lot of research is done for
whereas it reports an overall F1-score of 77.79% when NER in English, German, Chinese etc. Good work is in
preprocessing was applied on the same data. Thus, the
progress in Indian languages. Little work is done in NER
performance of the system is improved by 15% when
linguistic knowledge is used to preprocess the test and
for Marathi language. Rule based NER system for Marathi
training dataset. is developed by Patel et al. [2] in which inductive logic
programming is used to construct language specific rules.
Keywords Hidden Markov Model, Marathi, Named
- NER system is developed by using GATE framework.
Entity Recognition, Viterbi, Preprocessing An approach is presented by Kumar et al. [3] in which
NEs present in English are utilized to identify the NEs
1. INTRODUCTION present in under resourced languages by bisecting k-means
Named Entity Recognition (NER) is Information Ex algorithm for clustering multilingual documents based on
traction (IE) task which plays a significant role in many the identified NEs. Ekbal et al. [4] has presented HMM
different natural language processing tasks such as infor based NER system for Bengali and Hindi and reported an
mation retrieval, machine translation, question answering average F1-score 78.35% for Hindi.
systems etc. Predefined entities in text such as people, or
ganizations, locations, events, expressions such as amount, III. DIFFICULTIES IN MARATHI NE RECOGNITION
percentage, numbers, date, time are Named Entities Marathi is morphologically rich, free word order and
(NEs). Identification of NEs from unstructured text and highly inflectional language. Some important challenges
classifying them into suitable NE class is known as NER. in Marathi NER system development are discussed in
This paper describes a system based on Hidden Markov this paper that must be considered in development of the
Model (HMM) to recognize named entities in Marathi system.
language. The objective of the system is to recognize
twelve types of NEs viz. Person, Organization, Location, A. Inflectional Nature of Marathi
Miscellaneous, Amount, Number, Date, Time, Year, Day In order to show intensity of word inflections in Marathi
and Measure from Marathi language text using supervised a frequent word � is searched in 9,135 documents of
learning technique based on HMM. The difficulties of Fire-2010 corpus. A total of 6,886 words were found that
unseen probabilities and poor probabilities caused due are inflected forms of word � . It is observed that various
to the sparseness of data are handled by replacing less suffixes are added to word � in order to make mean
frequent words. The Viterbi algorithm is used for decoding ingful context. 42 distinct words are formed by adding
and word disambiguation. This paper is organized in five suffixes to the word as shown in Table 1. It shows word
sections. Initial two sections give the introduction and form and its frequency of appearance in 9,135 documents.
discusses related work. Section three describes issues of Variations of word are not only syntactically different but
Marathi NER that needs to be addressed. The fourth also differ semantically. At several places the same word is
566
TABLE III
NAMED ENTITIES OF TYPE ENAMEX
TABLE IV
NAMED ENTITIES OF TYPE NUMEX AND TIMEX
rV��J
r::::- ::: sentences of Marathi text have been manually annotated
Lexicons of Marathi numerals, units
>it. r.t. 3r
of measurem ents, date and time using lOBES scheme. Training data developed for Marathi
NER consists of 4,12,388 word forms. It is observed that
I I il
Algorithm of
Tokenization
many word tag pairs are less frequent, mostly only once
Tokens
I I
Algorithm for Word
l
f-------. seen in training corpus. Increasing training corpus in
1I Replacement b as ed on Lexicons
,::::: :::. creases frequent word counts, still rare words are present
J
"woed,""' "'th tth'"
Ifwcrd end> WIth ':then .... Pre�o=ssed TOIen5
+ in training corpus in huge amount. Hence words in less
N-.gramCounts Word - Tag Frequency
frequent word tag pair in training corpus are replaced by
and 11 213 and 4-gram the _RARE_ token. Independent test set is created
I Tokenilation
I Tag counter
by splitting annotated data into two non-overlapped parts,
l Test Dataset
J I Language Modeling
l one for training and the other for testing. The test data
J
Key Dataset
is held out and not seen during training. Table V shows
1 1 number of sentences, number of word forms and number
1 1 I
I
Text
Viterbi Decoding __ Evaluation
Prepro(:essin� distinct word forms in training and testing dataset and
'-'�
Table VI describes number of instances of NE classes
present in train and test dataset. Marathi Named Entity
Precision
Recall Recognizer is evaluated for tagging accuracy using held
Fl-&:o IE!
out Dataset 1 and Dataset 2 ( Dataset 2 is actually pre
processed Dataset 1 ) . Test datasets used are
Fig. 1. NER System Architecture
• Dataset 1 :Held Out Dataset
• Dataset 2: Dataset l (Preprocessed )
567
TABLE V Ȩ [
ȢȨ [ȡȨ
ȡ ⇒ Ȩ
DATASET DETAILS
The word mappings in proposed technique is as given
# Training Test
below,
Ǘ Ȣ ȣf Ȣ Ȣ ⇒ CUBVOL
Sentences 26462 715
Ȩ [
ȢȨ [ȡȨ
ȡ ⇒ DOLLOR
Words 401295 11093
568
TABLE VII TABLE VIII
NUMERICAL AND CLUE 'NORDS FOR TIME EXPRESSIONS NAMES IN DATE EXPRESSIONS
569
Fig. 2. Number Formats
TABLE XI TABLE XV
UNITS OF DISTANCE MEASUREMENTS UNITS OF AREA MEASUREMENTS
.BbiM+2 J2bm`2 hvT2b hPE1L O J2K#2`b `2 J2bm`2b hvT2b hPE1L O J2K#2`b
ȰȡȲÍȡȰȡȲȰȡ JAG1 9 Ǘ Ȣ ȣf Ȣ Ȣ *l"oPG e
ǑȢǑȢǑȢÍȡǑȪȢ EJa Rd f f
ȡ fȡȲ fȡ
Ȣ *_1 d
ɅȣȢȡ ɅȢ ɅȢ
Ȳ *Ja 3 ǕȲȯ ǕȲÓȡȲȯ¢ȡ ] ǒȯ :lLh>1 8
ǕȡȲɍǗǕȡȲàȡ 6PPh Rj ¡ȯ ǔÈ ¡ȯ È ȡȤ ¡ȯ Èȯ >1*h_1 8
ȢȢǓǓȢǓǓȢȯ¢ȡ JJ 8
Ȣ
ȢȢ
ȡȢȯ J1h1_ e
TABLE XVI
^Ȳ
ȡȲÍȡ^Ȳ
ȡȲɍ^Ȳ
ȡȡ AL*> e MISCELLANEOUS EXPRESSIONS
h2KT2`im`2 J2bm`2 hvT2b hPE1L O J2K#2`b 7r { 7rd : Set of initial state probabilities where
=
ȯǔã \ ȯǔã ȭȯ ȡ_ *1G*lAa 8 7ri : Initial probability that system starts at state
\ȲȡȲ
Ȣ\ȲȡȲȢ\ȲȡȲ lLa> d i
A ={aij} : Set of state transition probabilities
where aij: Probability of going to state j from
state i
570
{b }
B = {biWk}:
B iW : Set of emission probabilities where TABLE XVII
XVII
{b } k
= TABLE
PERFORMANCE
PERFORMANCE OF
OF SYSTEM FOR
FOR HELD
HELD OUT DATA SET 1
OUT DATA 1
{biWk}
iW k :: Probability of generating symbol Wk at
state i. NE Class P
P R
R Fl
,\
,\ model is created using large training samples by counting PER
PER 83.10 77.38 80.14
frequencies of transitions and emissions that are used to ORG
ORG 52.66 58.33 55.35
55.35
estimate transitions and observation probabilities of ,\ ,\ LOC
LOC 73.05 70.55 71.78
71.78
models. MLE algorithm is used to estimate parameters MISC
MISC 31.13 45.83 37.08
37.08
of ,\
,\ model as, NUM
NUM 64.10 75.00 69.12
Count(i,j,
Count ( i,j, k) AMT
AMT 50.00 72.97 59.34
aaijk
"tJ k =
= Count(i,j) DATE
DATE 42.25 88.24 57.14
Count(i,j)
MEASURE
MEASURE 57.85 87.50 69.65
and YEAR 18.00 81.82 29.51
bb.tt. {W
Wkk -
_ f-7 W
} _ Count(i f-7 k) .
Wk)
TIME
TIME 05.37 88.89 10.13
10.13
Count( i)
Count(i)
-
MONTH
MONTH 20.00 75.00
75.00 31.58
31.58
Two *
* symbols and one STOP tag is used to mark start WEEKDAY
WEEKDAY 43.75 93.33 59.58
and end of the sentence. The probability of state sequence Overall
Overall 55.73
55.73 71.66 62.70
62.70
SI, S2···Sn+l
SI, S2",Sn+l for given WI,
Wl, W2
w2 ....Wn
wn observation sequence for
NE tagging can be computed as, TABLE XVIII
TABLE XVIII
PERFORMANCE
PERFORMANCE OF
OF SYSTEM
SYSTEM FOR
FOR HELD OUT
OUT DATA SET 2
rr
nn PER
PER 84.94 76.03 80.24
q(sil s i-2, si-d ORG
ORG 82.40 55.08
55.08 66.03
66.03
ii=1
=1
LOC
LOC 87.11 61.01 71.76
71.76
( l )
x e(wilsi)
x e wi si
MISC
MISC 49.30
49.30 49.30
49.30 49.30
49.30
vVhere
Where q and ee are parameters for maximum likelihood NUM
NUM 89.85 90.31 90.08
estimates. AMT
AMT 75.00 81.08 77.92
77.92
DATE
DATE 94.29
94.29 97.06
97.06 95.65
95.65
D. Decoding
D. Decoding MEASURE
MEASURE 94.59 93.33 93.96
Decoding is the problem of predicting most likely tag YEAR 81.82 81.82 81.82
sequence for an input sequence such as for a sequence TIME
TIME 85.71 85.71 85.71
of observations WI,Wl, W2, Wn ,, problem is to find most
W2, ....Wn MONTH
MONTH 100.0
100.0 75.00
75.00 85.71
85.71
probable state sequence SI, S2, ....Sn which maximizes
SI , S2, WEEKDAY
WEEKDAY 100.0
100.0 100.0
100.0 100.0
100.0
P(SI, S2,
P(SI, S2, ...Sn
SnIWl, W2 , ....W
IWl,W2, wn ) as
Overall
Overall 84.34
84.34 72.18
72.18 77.79
77.79
P(SI, S2,
arg max P(SI' S2, ..
"Sn *,WI,
sn ll*, Wl, W2,
W2, "Wn,
.. Wn , STOP) more correctly decode test data. Two important steps have
SI,"
Sl,"SS'n
n
been performed to boost system performance. Preprocess
Preprocess-
Unknown Words
E. Unknown ass ist algorithm to focus
ing has decreased confusion and assist
The lexicons which are not seen in training are unseen on contextual clues. Replacing unknown or rare word in
or unknown words. The frequency count of unseen words is _ RARE_
test corpus only by _RARE_ token does not work.
zero hence prediction probability also becomes zero. Hence Training HMM for treating unknown or _RARE_ _ RARE_
unknown words should be treated properly. If If frequency of tokens is done. This has improved the performance of the
observation in test set is less than 5, then that observation system to good extent. System has tested on dataset 2
is treated as rare or non frequent word. Unknown words which is preprocessed version of held out dataset 1. The
in test set are replaced by pseudo word _RARE_ _ RARE_ performance of the system after preprocessing is shown in
and probabilities of unseen word model _RARE_
_ RARE_ are Table XVIII.
estimated from its count just like other words in training. Overall F1-Score reported by the system for data set 1
is 62.70% whereas 77.79% is reported for dataset 2. Fig.3
RESULTS AND
V. RESULTS AND DISCUSSION
DISCUSSION
shows that performance of the system has increased by
Performance of the system is measured using Preci Preci- 15% after applied preprocessing on test as well train data
sion(P),Recall(R)
sion metrics. System eval
(P),Recall(R) and F1-score(F1) metrics. eval- sets.
uated for test dataset 1 as shown in Table XVII.
Overall performance of the system reported is 62.70%. VI. CONCLUSION
CONCLUSION
System has recognized person, locations, numbers and of a machine
This paper has demonstrated the use of
measures well but other NEs are not recognized satisfacto
satisfacto- learning technique (HMM) for Named Entity Recognition
rily. Some technique is needed to help Viterbi algorithm to
rily. in an highly inflectional language. The pre-processing
571
Performance Comparision
REFERENCES
• Fl(Unproce�5ed) • Fl(Proce�5ed)
[1] Nita Patil, Ajay S. Patil, and B. V. Pawar, Survey of Named
Entity Recognition Systems with Respect to Indian and For
100 ,- eign Languages, International Journal of Computer Applications
80 (IJCA), 134(16):21-26, 2016.
�
B 60 [2] Anup Patel, Ganesh Ramakrishnan and Pushpak Bhattacharya,
� 40 Incorporating Linguistic Expertise using ILP for Named Entity
20 Recognition in Data Hungry Indian Languages, Proceedings of
the 19th International Conference on Inductive Logic Program
ming, 178-185, 2009.
[3] Kumar, N. Kiran, Santosh, G. S. K. and Varma, Vasudeva, A
NEcategories Language-independent Approach to Identify the Named Enti
ties in Under-Resourced Languages and Clustering Multilingual
Documents, Proceedings of the Second International Conference
Fig. 3. F1-Score of Held Out and Preprocessed Held Out Data on Multilingual and Multimodal Information Access Evaluation,
Springer-Verlag, 74-82, 201l.
[4] Asif Ekbal and Sivaji Bandyopadhyay, A Hidden Markov Model
based Named Entity Recognition System: Bengali and Hindi as
of both training and testing data using lemmatization Case Studies, Proceedings of the 2nd International Conference
technique presented in this work along with replacement of on Pattern Recognition and Machine Intelligence, 545-552, 2009.
[5] Harshali B. Patil, Ajay S. Patil, and B. V. Pawar, A Compre
rare terms with a special token has significantly improved hensive Analysis of Stemmers Available for Indic Languages,
the performance of the system by 15%. The system also International Journal of Natural Language Computing(IJNLC),
responds well to unseen data. 5(6):45-55, 2016.
[6] Nita Patil, Ajay S. Patil, and B. V. Pawar, Issues and Challenges
ACKNOWLEDGMENT in Marathi Named Entity Recognition, International Journal of
Natural Language Computing(IJNLC), 5(6):15-31, 2016.
The work presented in this paper is financially sup [7] Jurafsky Daniel and James H. Martin, Speech and Language
ported by Rajiv Gandhi Science and Technology Commis Processing, Pearson Education International, 5(6):117-155, 2008.
[8] Neelima Mhaske and Ajay S. Patil,Issues and Challenges in
sion, Government of Maharashtra and SAP DRS-II, UGC, Analyzing Opinions in Marathi Text, International Journal of
New Delhi. Computer Science Issues(IJCSI),13(2):19-25, 2016.
572