Proposed Myanmar Word Tokenizer Based On LIPIDIPIKAR Treatise

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Proposed Myanmar Word Tokenizer Based on LIPIDIPIKAR Treatise

Thein Than Thwin Aye Thida win


University of Computer Studies, Mandalay University of Computer Studies, Mandalay
Mandalay, Myanmar Mandalay, Myanmar
theinthanthwin@gmail.com kabyar22@gmail.com

Phyo Phyo Wai


University of Computer Studies, Mandalay Mie Mie Su Thwin
Mandalay, Myanmar University of Computer Studies, Mandalay
phyophyowai81@gmail.com Mandalay ,Myanmar
miemiesuthwinster@gmail.com

Abstract— Natural Language Processing (NLP) based killers. The basic consonants in Myanmar can be extended
technologies are now becoming important and future intelligent by medial. Syllables or words are formed by consonants
systems will use more of these techniques as the technology is combining with vowels. However, some syllables can be
improving explosively. But Asia becomes a dense area in NLP formed by just consonants, without any vowel. Other
field because of linguistic diversity. Many Asian languages are
inadequately supported on computers. Myanmar language is an
characters in the Myanmar script include special characters.
analytic language but it includes special character like killer, Therefore, it is too complex for Natural Language
medial, etc.. In English or European languages, all of the Processing (NLP) purpose.
syllables are formed by combining the alphabets that represent The English or European languages are flexible for NLP
only consonants and vowels but Myanmar language uses systems because their syllables or words are formed by
compound syllables that make more difficult to analyze. So we combining only consonants and vowels at all. The Myanmar
can face difficulties in word sorting. In our proposed system, the language can also be changed to that simple form and that
condensed form of Myanmar ordinary scripts will be transformed idea is briefly described in the LiPiDiPiKar treatise written
into analyzable elaborated scripts based on LIPIDIPIKAR by Yaw Min Gyi U Pho Hlaing which is used in
treatise written by Yaw Min Gyi U Pho Hlaing. These elaborated
words can be easily sorted by using this treatise. In our proposed
development of telegraph communication since 1870. Our
system, complexity of Myanmar condensed words sorting proposed system is based on the technique that is described
compared with complexity of elaborated words sorting. in that treatise and modified them to be used in computer
system. The system will change the traditional writing form
Keywords-Phonetic token, Unicode, NLP, Condensed form, to the elaborated form and then, convert them to equivalent
Elaborated form Introduction English word that becomes ready to be used with any of the
higher natural language processing systems. The output can
Natural language processing (NLP) is gradually be used in two language Myanmar and English for Natural
becoming a more multidisciplinary field. But most of NLP Language Processing System such as Ontology, Word
tools and technologies are tailored for English or European Romanization, Sorting, etc. In our system, the output text
languages. Recently, there has been a rapid growth of IT will be implemented with Microsoft speech synthesis engine
industry in many Asian countries. This is now the perfect as testing.
time to reduce the linguistic, computational and
computational linguistics gap between the ‘more privileged’ I. TYPE STYLE AND FONTS
and ‘less privileged’ languages. This paper is centered on The Spanish, Antonio Bonafonte and his group tried to
utilizations of NLP techniques for Myanmar condensed develop a system for generation of the database of speech
script to elaborated script (phonetic token) conversion frame segments and they represent the paper that summarizes the
work [3]. text-to-speech system that has been developed in the Speech
The Burmese (Myanmar) language is a tonal and Group of the Universitat Politècnica de Catalunya (UPC)
analytic language. The language utilized the Burmese script [2]. Richard Sproat also proposed a model of text analysis
which derives from the Mon-scripts and ultimately from the for text-to-speech (TTS) synthesis based on weighted finite-
Brahmin script. Myanmar language is said to have basically state transducers, which serves as the text-analysis module
33 consonants, viramas, dependent vowels and independent of the multilingual Bell Labs TTS system. The transducers
vowels (altogether about 20 vowels), other medial and are constructed using a lexical toolkit that allows declarative

c
978-1-4244-6349-7/10/$26.00 2010 IEEE V7-136
descriptions of lexicons, morphological rules, numeral When we combine the consonant with vowel, in the most
expansion rules, and phonological rules, inter alia. case, the consonants stand prior to vowel. But in some
That model has been applied to eight languages: consonant such as ( a ), the vowel comes prior to that
Spanish, Italian, Romanian, French, German, Russian, consonant as exception.
Mandarin and Japanese [8]. Unfortunately their systems are
not compatible with Myanmar language. In 2009 January,
Alexandre Trilla presented a working paper that depicts the A. Condensed writing versus Elaborated writing
usage of Natural Language Processing techniques in the There are two formats of writing, condensed writing and
production of voice from an input text, a.k.a [1]. They elaborated writing. The format of writing that used by
represented Text-To-Speech synthesis, and the inverse French and English are elaborated writing. They do not use
process, which is the production of a written text the condensed format of writing in their standard. Myanmar,
transcription from an input voice. The core of the text to Srilanka and Bingali use condensed format. In condensed
speech system is composed of four modules: text format, there are two forms of writing for vowels and also
normalization, phonetic transcription, prosody generation two forms of writing for some consonants to condense in
and speech synthesis [2], [8], [4], [11]. To implement that writing.
approach, the writing format that is used in Myanmar Since, Myanmar writing can use both condensed and
language should be transformed into the form that is similar elaborated writing format, we should select the format of
to their language writing format. There are many approaches writing according to the specific domain. For example,
that tried to analyze Myanmar words for NLP. Most of condensed writing should be used in writing on paper or
them used the words that are already stored in some other physical media and elaborated form should be used in
repository such as database [9], [10]. In some systems, a software media such as telegraph communication, for some
database is used to store the Myanmar syllable and the NLP in computer systems. Nowadays, the elaborated
system extracts and matches that syllable many times [5], writing form of Myanmar writing is rarely used and it
[6]. There is an important issue that how can all of the becomes a writing format that is not well known one.
syllables be stored in database perfectly. Moreover, some
system based on the speech files of words and again speech B. The origin of Myanmar Alphabets
files of words cannot cover all of the words in our language.
Originally, Myanmar language has 11 vowels and 35
II. DIFFERENT KINDS OF ALPHABET consonants as shown in Table1 and Table2.
The sounds that are produced by the creature (mental TABLE I. THE VOWELS OF MYANMAR LANGUAGE
sound) can be represented by using alphabets. In order to
recognize the sound, they are written on some media such as t tm £ þ O OD { {J Mo aMomf tdk
paper, palm leaf, etc. There are a variety of race on the
world and the language they used are different. So, the TABLE II. THE ORIGINAL CONSONANTS OF MYANMAR LANGUAGE
number of the alphabets that they used is also different from
u c * C i
each other. But the maximum number of speech unit for
every language is not more than 67 and the minimum p q Z ps n
number of speech unit is not less than 12. Moreover there # X ! ¡ P
are only two kind of sound for speech in every language,
w x ' " e
vowel and consonant. It is an important issue for NLP
systems such as text to speech or speech recognition y z A b r
systems. The terms vowels and consonants are discovered , & v 0 o
by the Greek grammarian and used for analyzing mental
[ V thH tH t;
sound but later, it is used for writing systems especially.
However, they are different from each other; all human
beings come from the same species. So, the speech that But the alphabets currently used have 33 consonants
comes from a language can be written by using the other listed in Table 3 and 11 vowels at all. The alphabet ' t ' is
language’s alphabets. originally belong to vowel group but now it is stated as a
The sound such as ‘t?tm’ can be directly heard by consonant. Moreover, some of the consonants as well as
human-being, and the alphabets that are used to represent vowels are transformed into short form of symbols to be
them are called vowel. The alphabets that represent the used in condensed writing format (currently used format).
sound which cannot form the speech, only used for So, the condensed writing format is very complex for
supporting the meaning should be called consonants. When computerized processes. Moreover, by using the original 46
we say the consonants’ sound, the sound-interval is too alphabets, we can write all of Myanmar words.
short and it does not long enough to hear. So, in order to
hear the consonants’ sound, they must used with the vowel.

[Volume 7] 2010 2nd International Conference on Computer Engineering and Technology V7-137
TABLE III. THE CURRENTLY USED CONSONANTS OF MYANMAR can be used in indexing. Some relations between the
LANGUAGE
elaborated writing and condensed writing are shown in
Table 6.
u c * C i
p q Z ps n TABLE VI. RELATIONSHIP BETWEEN ELABORATED WRITING AND
CONDENSED WRITING
# X ! ¡ P
w x ' " e
y z A b r ut utm u£ uþ uO uOD u{ u{J uMo uaMomf utdk
, & v 0 o
u um ud uD uk ul au uJ aum aumf udk
[ V t
Elaborated Writing
C. Condensed Vowels
pt ? utm; ? y&Mo? ete;? vOy? p{? oOD? ptm? &{;?
As Myanmar words are compound syllables, the original u&þ;
vowels are transformed into condensed form to combine Condensed Writing
with the consonants. In that process, the vowels are
completely changed their shapes (symbol) and their name p? um;? ajym? eef;? vkyf? ap? ol? pm? a&;? BuD;
and only used their original sound to support the As shown in the above example, elaborated writing
consonants. For example tm becomes m or g and £ includes only two types of alphabets (consonants and
becomes -d and so on. The list of vowels that combinable to vowels) and eliminates the complexity of the condensed
consonants is shown in Table 4. writing (currently used form) and is suitable for some
signaling process.
TABLE IV. LIST OF THE CONDENSED VOWEL Moreover, their sounds are nearly similar to the sound of
the English word as shown in Table 7 and Table 8 we can
-g -d -D -k -l a- -J a-g a-: -dk easily substitute them with the corresponding English word
and can use them directly in the existing English text to
tm £ þ O OD { {J Mo aMomf tdk speech systems.

TABLE VII. RELATIONSHIP BETWEEN MYANMAR CONSONANTS AND


ENGLISH
D. The Condensed consonant
ߺ ߻ ߼ ߽ ߾
Some of the original consonants are also transformed k kh g g ng

into condensed form to combine with the other consonants ߿ ࠀ ࠁ ࠂ ࠄ


s s z z nj
or vowels. As the condensed vowels, they are also changed ࠅ ࠆ ࠇ ࠈ ࠉ
their shapes and their name. Their sounds are used to tt ht d d n
ࠊ ࠋ ࠌ ࠍ ࠎ
support the other consonants or vowels to state the meaning. t ht d d n
But the E is only changed its shape as exception. Changing ࠏ ࠐ ࠑ ࠒ ࠓ
p ph b b m
the form of & also causes the order of the alphabets in
ࠔ ࠕ ࠖ ࠗ ࠘
creating the Myanmar word. The list of consonants that are y r l w th

transformed into condensed form is shown in Table 5. ࠙ ࠚ ࠛ࠭࠮ ࠛ࠭ ࠛ࠯


h l ant an arr

TABLE V. LIST OF THE CONDENSED CONSONANTS

TABLE VIII. RELATIONSHIP BETWEEN MYANMAR VOWELS AND


-s M - -S E ENGLISH WORDS
, & [ e
ࠛ ࠛࠤ ࠜ ࠝ ࠞ ࠞࠦ ࠠ ࠠ` ࠡ ࠢ ࠛࠥࠧ

a ar i ie u uu ae el aw or o

III. ELABORATED WRITING


We can convert the condensed writing into elaborated
writing by placing the consonants in front of vowels. In this
case, the corresponding alphabets are placed in the
appropriated order as the order of the sounds in the speech.
So, it becomes more flexible for computerized processing.
For example, it can be used in sorting by comparing the
corresponding alphabets or can be used in text to speech
system by assigning the sound to corresponding alphabets or

V7-138 2010 2nd International Conference on Computer Engineering and Technology [Volume 7]
IV. PROPOSED SYSTEM x Many shortened form of writing have to be
Text Input
transformed to corresponding normal form. eg. od*Ð
Frame Work

has to be transformed to odif*D.


C heck Myanmar
Lexical

Then our proposed system performs the segmentation


Interactive
Correction
Yes
Error process on the input. All of Myanmar word can start with a
No
, j one of the consonants or the vowel like ࠜ , ࠝ, ࠞ, ࠞࠦ, ࠠ,
Text
ࠡ , ࠢ or ࠛࠥࠧ . So, the system inserts a marker before them
Normalization

and performs segmentation according to that symbol.


An elaborated writing alphabets scheme is also created in
Elaborated
Writting
Alphabets
Tokenize to
Elaborated Myanmar
Script
Text in
Elaborated
Myanmar
the same way as shown in the example conversion that was
Scheme Writting
Form

H igher level
shown in section 3. By using that scheme it perform the
Multilangual
N LP
A pplication
tokenizing process. Then the Myanmar alphabets are
English Tokenize to
English phonetic English
assigned the corresponding English word as shown in section
phonetic phonetic
Word word
Words
3.
V. CASE STUDY
Figure 1. Proposed Myanmar Word Tokenizer.
Since the meaning of the word does not play the important
role in text to speech conversion system that emphasis only
The first step of our proposed system in Figure 1 is the for the output sound, we try to translate from text input to
conversion of input text into a linguistic representation. This speech blindly; we do not try to know the meaning of the
is a complex task since the written form of Myanmar word. So, we skip some TTS’s sub process like semantic
language is an imperfect representation of the corresponding
analysis, tagging. We only used Myanmar alphabets as the
spoken forms. So, our system has to solve the following
symbol that represents the sound.
problems;
Therefore, the proposed system starts with text input
x Myanmar language does not delimit words with process. The input text is checked by the grammar rule for
white space. So, it is required to reconstruct word Myanmar writing and the interactive correction process will
boundaries in proposed systems for our languages. perform until the input text is ready to process. Then the text
is splitted into Myanmar words. The Myanmar word that is
x Digit sequences need to be expanded into words. already written in condensed form will be transformed into
For example,"243" would generally be expanded as elaborated form according to the Myanmar elaborated script
ESpf&mav;q,fhoHk;. writing rule that is stored in the repository. All of the above
steps are performed by proposed Myanmar Word Tokenizer.
1. Abbreviations must be expanded into full words.
For example, Myanmar language used ࠲ for ࠕ and Text in

write the alphabet under the previous one to omit -࠰ Elaborated


Myanmar
Writting
D isplay

(urÇm – urfbm).
Form

Proposed
x Frame
Work
x Since Myanmar script symbol that represent the
phonetic are sometimes written in miss order. For
English
example, in the word ࡤߺ although ࠲ come in front phonetic
Words
D isplay

of ߺ the but in the real speech the sound for ߺ is


come in front of ࠲ . So, we need to reorder the
Microsoft
alphabets. Speech Synthesizer

Speech
x Myanmar script uses more than one symbol for
Figure 3. Text to speech conversion Demo.
only the meaning. So, we have to convert them into
the same format. For example, j , M , ~ , `
The converted English words in Figure 2 are used as
, B are used alternatively and there are many input for the text to speech engine that is developed by
other symbols, too. Microsoft and tested as the Myanmar text to speech system.

[Volume 7] 2010 2nd International Conference on Computer Engineering and Technology V7-139
VI. EXPERIMENTAL RESULT system, we used English Text to Speech system to test our
To test the accuracy of our proposed system, a board output. The output alphabets (elaborated word) are
including 20 members from different departments of our substituted with the nearly similar English word. So, the
university (U.C.S.M), especially from English and speech can not be exactly the same as the human speech in
Myanmar departments is setup. They judged the output Myanmar. In forthcoming research, we have to develop
sound of our system whether good or poor (understandable Myanmar TTS system.
or not) by its accuracy level. The panel of judges determined REFERENCES
for over 1600 words that are grouped into four main groups,
consonants, vowels, killers and medial. They judged over
[1] A. Trilla, Natural Language Processing techniques in Text-
both of the two versions of words, the words that are
To-Speech synthesis and Automatic Speech Recognition,
tokenized by using our system and not. The overall accuracy Departament de Tecnologies M`edia Enginyeria i Arquitectura La
percentages of those four groups are shown in Figure3. Salle (Universitat Ramon Llull), Barcelona, Spain
100 atrilla@salle.url.edu.
Accuracy in percentage (% )

90
80
70 [2] A. Bonafonte, Ignasi Esquerra, A. Febrer, J. A. R. Fonollosa,
60
F. Vallverdú , The Upc Text-To-Speech System For Spanish And
50
40
Catalan, Universitat Politècnica de Catalunya C/Jordi Girona 1-3
30 08034 Barcelona, SPAIN.
20
10 [3] C. Strapparava and R. Mihalcea, Learning to identify
0 emotions in text, in SAC’08: Proceedings of the 2008 ACM
Consonants Vowels Killers Medial
symposium on Applied computing, (New York, NY, USA), pp.
Unmodified words Words modified by our system 1556–1560, ACM, 2008.
Figure 4. The accuracy comparison for the two versions of words
[4] D. Garc´ıa and F. Al´ıas, “Emotion identification from text
using semantic disambiguation”, in Procesamiento del Lenguaje
According to Figure3, it can be seen that the accuracy Natural, no. 40, pp. 75–82, March 2008.
percentage for both of the two groups are equal for the
words that only used the consonants and vowels. But, it can [5] H. M. Oo, P. Y. Mon, K. T. Nakahrat, Y. Mikami,
be seen that that words that used the killers and medial "Romanized Myanmar Input Method for Mobile Phone",
groups occupied the significant differences for the two Proceeding of the 7th International Conference on Computer
versions. So, our system can significantly support for the Applications, pp-233-237, February, 2009.
words that used the killers or medial. On the other hand,
majority of Myanmar words used the killers or medial to [6] K. K. Oo, N. L. Thein , "Implementation of Text-to-
represent the speech. Speech(TTS) System with Myanmar Language" Proceeding of the
4th International Conference On Computer Applications, pp-337-
VII. CONCLUSION 343, February, 2006.
There are many approaches that tried to analyze
[7] P. Hlaing (Yaw Min Gyi ), LiPiDiPiKar Treatise, 1870.
Myanmar words for NLP. Most of them used the words that
are already stored in some repository such as database but [8] R. Sproat, Multilingual Text Analysis for Text-To-Speech
there is no limitation for Myanmar spoken words. Synthesis, Speech Synthesis Research Department Bell
Moreover, in some case of media translation such as text to Laboratories, Murray Hill, NJ, USA.
speech translation, the meaning of the word does not play
the important role. We can blindly translate them to speech [9] T. H. Nwe, N. L. Thein, A Framework for Natural Language
as writing. In some system, a database is used to store the Translation: English_Myanmar Translation Process, pp-331-336,
Myanmar syllable and the system extracts and matches that February, 2006.
syllable many times. There is an important issue that how
[10] U. D. Reichel and H. R. Pfitzinger, “Text preprocessing for
can all of the syllables be stored in database perfectly.
speech synthesis,” in Proceedings of the TC-STAR Workshop on
Moreover, some system based on the speech files of words Speechto-Speech Translation, (Barcelona, Spain), pp. 207–212,
and again speech files of words cannot cover all of the June 2006.
words in our language.
For some cases, assessing Myanmar alphabets is more
suitable than assessing the words because there are only 46 [11] V. Francisco and R. Herv´as, “EmoTag: Automated Mark Up
alphabets at all. So, in our proposed system, we store all of Affective Information in Texts,” in Proceedings of the Doctoral
possible combination of the Myanmar alphabets in a Consortium in EUROLAN 2007 Summer School (C. Forascu, O.
repository and convert the input into elaborated form and it Postolache, G. Puscasu, and C. Vertan, eds.), (Iasi, Romania), pp.
5–12, July–August 2007.
becomes more suitable for sequential processing. In this

V7-140 2010 2nd International Conference on Computer Engineering and Technology [Volume 7]

You might also like