Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

A Hybrid Approach to Vietnamese Word Segmentation using

Part of Speech tags

Dang Duc Pham, Giang Binh Tran, Son Bao Pham


Faculty of Information Technology
College of Technology
Vietnam National University, Hanoi

Abstract sequence brings about a correlative word segmentation


result. Therefore, POS can be helpful for word
Word segmentation is one of the most important segmentation and resolving word ambiguity [3].
tasks in NLP. This task, within Vietnamese language By analyzing existing approaches to Vietnamese
and its own features, faces some challenges, especially word segmentation and considering the relationship
in words boundary determination. To tackle the task of between POS tagging and word segmentation, we
Vietnamese word segmentation, in this paper, we propose a hybrid approach that uses Maximum
propose the WS4VN system that uses a new approach matching algorithm combining with statistical models
based on Maximum matching algorithm combining and part-of-speech information to address the word
with stochastic models using part-of-speech ambiguity problem.
information. The approach can resolve word ambiguity In section 2, we will give an overview the
and choose the best segmentation for each input Vietnamese word segmentation task and its challenges.
sentence. Our system gives a promising result with an We cover related works in section 3 and then propose
F-measure of 97%, higher than the results of existing our approach in section 4. Section 5 describes our
publicly available Vietnamese word segmentation experimental setup and results. Section 6 concludes
systems. with pointers to our work in the future.

1. Introduction 2. Vietnamese word segmentation:


difficulties and challenges
Word segmentation is one of the most basic and
important tasks in Natural Language Processing (NLP). In Vietnamese, syllable is the smallest linguistic
In some studies, it is the foremost task that must be unit, and one word consists of one or more syllables [5,
completed before further analysis [13]. For example, 7, 13]. It leads to an ambiguity problem when
most of text classification systems use word determining word boundaries. There are two type of
segmentation approaches combined with machine ambiguity namely cross ambiguity and overlap
learning algorithms [9]. Word segmentation also plays ambiguity. In cross ambiguity, some syllables
a significant role in parsing and machine translation. themselves have meaning (can be a word), and their
Word segmentation faces different challenges combination also has meaning. For example, in the
depending on the type of language. This task for sentence ““Bàn là một công cụ học tập”, “Bàn” means
Eastern Asian languages, such as Chiness, Japaness, “Desk”, “là” means “is”, and “Bàn là” means “iron”. It
Thai and Vietnamese, is very difficult since they are is very difficult to tackle this problem. However, in
isolating languages in which spaces are not always the Vietnamese, it occurs less frequently than the overlap
boundaries among words and one word can consist of ambiguity [10]. Overlap ambiguity occurs in a
more than one token [1, 2, 7, 10]. Thus, to determine situation that a syllable when combined with previous
words boundaries we need higher-level analysis like syllable or following syllable in a sentence generates
word form, morphology, syntax or semantics analysis. words. For example, in the sentence “Tốc độ truyền
Part-of-speech (POS) is also a useful piece of thông tin ngày càng cao”, “truyền thông” and “thông
information for word segmentation. In the view of tin” both are words. Since overlap ambiguity occurs
linguistics, the difference in word segmentation gives a more often, solving this kind of ambiguity can improve
difference result in POS tagging, and a POS tagging the Vietnamese word segmentation system.
Besides word boundary ambiguity, Vietnamese proves to be useful when deploying in different
word segmentation faces with a problem in which there languages.
are lots of new words appearing in a document. These Hybrid approaches combine different approaches to
new words are normally names that refer to people, make use of individual advantages and overcome
location, abbreviation of foreign words, currency units, disadvantages. Many hybrid models are published and
etc [5, 10]. applied in many different languages. They consist of
dictionary-based techniques (Maximum matching,
3. Related works Longest matching), statistics-based techniques (N-
gram, CRFs, ME) and machine learning algorithms
There are many approaches tackling the word (Support Vector Machines - SVMs, Genetic Algorithm
segmentation task. They are divided into 3 main - GA) [11, 14, 18].
categories: dictionary-based, statistics-based and These approaches are well applied for the task of
hybrid-based [7, 16]. Figure 1 shows a diagram of Vietnamese word segmentation. Some studies using
these categories. techniques like N-gram model with 10 million
Traditionally, most studies use dictionary-based syllables corpus [13], Hidden Markov Model [15],
approaches because of their simplicity. This type of Maximum Entropy [6] show high accuracy. Some
approach is the base for further studies. Two most models are combination of Maximum matching and
effective techniques are Maximum matching (MM) SVMs [19], statistics of the Internet and Genetic
and Longest matching (LM). Some word segmentation algorithm [17], WFST and Neural network [7],
systems like MMSEG for Chinese language also use Maximum matching and N-gram language model [10].
these techniques [12]. In this system, the authors use
Maximum matching algorithm together with rules built 4. Our approach
from observation. This approach is simple yet gives a
promising result. However, most of the dictionary- In Vietnamese, a sentence itself has diversified
based approaches often do not give high accuracy structures where words may have many POS tags when
because they fail to solve many ambiguous cases. taken out of context. However, when a sentence is put
into a specific situation, the POS tagging sequence is
unique. Therefore, there is a relationship between POS
Word tagging and word segmentation in Vietnamese. In a
Segmentation
particular situation, a wrong POS tagging can bring
about wrong word segmentation correspondingly. On
the contrary, a right POS tagging can lead to right
Dict-based Statiscal Hybrid correlative word segmentation. Based on this
observation, we propose a Vietnamese word
segmentation technique using POS tags. Our approach
LM N-Gram MM&SVMs does not aim to tackle the task of POS tagging and
word segmentation in parallel but aims to take
MM ME MM&N-Gram advantage of the POS tag information to address the
word ambiguity problem.
MMSeg CRFs WFST&Neural
4.1. Overview of our approach

… HMM Online As mentioned, our approach is a hybrid technique


Corpus & GA
combining Maximum matching algorithm, POS tags
and statistical models to tackle the overlap ambiguity.
… … Given a sentence S as input, the general formula for
Figure 1. Categories of word segmentation our approach is:
approaches , , |  
,
Statistical approaches utilize information from a , ,
                
very large annotated corpus. Examples include N-gram ,
Language Model [2], Hidden Markov Model (HMM)                , ,  
,
[4], Conditional Random Fields (CRFs) [8] and
Maximum Entropy (ME) [1]. This type of approach                ,  
,
               | For example: “Tốc độ truyền thông tin ngày càng
, cao” can be segmented as:
Where W = w1, w2, …, wn is a word segmentation
• “Tốc độ | truyền | thông tin | ngày càng | cao”
solution for S, and T = t1, t2, …, tn is the corresponding
• “Tốc độ | truyền thông | tin | ngày càng | cao”
POS sequence.
The output of the Maximum matching algorithm is
Our system consists of 5 main steps: preprocessing
a set of segmented sequences with the smallest number
to standardize the input data, segmentation to give
of words.
potential segmentation candidates, POS tagging based
on Markov-1, Markov-2 and “front-back” hypothesis
for each candidate, calculating the probability of 4.4. POS tagging
corresponding segmentation candidates, and finally,
choosing the best segmentation candidate. Figure 2 We tag each segmented sequences received from
shows the architecture of our system called WS4VN. the previous step. We build a POS tagger by using
Hidden Markov approach with 3 hypotheses: Markov-
1, Markov-2 and “front-back”. We also use a published
Dictionary POS tagger VnQTAG [20] with the accuracy of 94%.

4.4.1. POS tagger using Hidden Markov model.


Input Maximum Hidden Markov Model (HMM) is a statistical model
Sentence Preprocessing matching that is used to determine hidden parameters based on
observed parameters. It is widely used, especially in
POS tagging for an input sequence. We will describe
the POS tagging as a Hidden Markov model (Figure 3).
Segmentations • Hidden part: tag sequence T
Set • Observed part: word sequence from previous step
W
• Transition probability: ai-1,i = P(ti|ti-1) with
hypothesis Markov-1, or ai-1,i = P(ti|ti-1, ti-2) with
Output Calculating POS hypothesis Markov-2
Sentence Probability Tagging • Output probability: bi = P(wi|ti)

ai-1,i ai,i+1
ti-1 ti ti+1
Corpus

Figure 2. Architecture of our system WS4VN bi- b bi+1

4.2. Preprocessing
wi-1 wi wi+1
There is ambiguity of spelling rules in Vietnamese
language. The most typical ones include using “i” or Figure 3. HMM for POS tagging task
“y” and where to put marks on syllables. Thus, the
preprocessing step aims to normalize data for further A tagged sequence T t , t , … , t satisfies:
analysis using one standard. Moreover, this step aims |  
identify new words in the form of named entities such
as names of companies, people, factoids, etc. We use |
      
regular expression as the main technique to recognize
named entities.      |  
     , ,…, | , ,…,
4.3. Maximum matching
, ,…,
We use Maximum matching algorithm to find the Suppose that if we know the POS of a word, it is
segmentation candidates by segmenting the input possible to determine this word. Thus, the probability
sentence into a sequence with the smallest number of P(W|T) only depends on basic probability like
words. With the overlap ambiguity, there is often more | :
than one solution.
Then:
| | , ,…, , ,…,  
             | , ,…, ,
Further calculation for P(T): , ,…, ,
, ,…,   | , ,…,
            | … | , ,…, , ,…,  
Apply the Markov-1 hypothesis in which              | , , ,…,
probability of a POS appearing in a sequence can be | , ,…,  
predicted when we know the probability of the              | , | ,
previous POS. That means | , ,…, , ,…,  
  |                  | , | ,
Then: , ,…,  
  , ,…,                 
             | … | , ,…,
             | | … |                | ,  
             | Where t0 and tn+1 represents for sentence’s first and last
element respectively. Finally, we get:
( is a special tag representing for sentence’s |  
beginning)
Finally we get: |
      
| |      |  

Where probabilities | and | can be      |  


estimated by Maximum likelihood technique based on
an annotated corpus.      | | ,  
In a similar manner, when apply the Markov-2
hypothesis we get:
     | ,
| | ,
| ,           3
Therefore, we then can use dynamic programming Thanks to the general calculation (3), we propose a
algorithm Viterbi to solve the POS tagging task. dynamic programming algorithm to tackle the POS
tagging task. The algorithm is described as below:
4.4.2. POS tagger using “front-back” model. With • Let , , is the maximum probability
the tagger using Hidden Markov model, the probability according to (3) when calculate to the ith word and
of a POS tag appearing depends only on the previous its POS is tI and POS tag of the i+1th word is ti+1
tags. That means there is no relationship between • Initialize 0, , 1  ∈ 
previous tags and the next tags. However, in reality, • At ith:
these relationships still exist. Thus, we propose a , ,
hypothesis called “front-back” in order to make use of – 1, , | ,
this relationship. According to the “front-back”
hypothesis, the probability of a POS tag ti depends on | , , ∈ 
the previous tag ti-1 and the next tag ti+1, then we can We use a Trace array to store progress of
calculate , ,…, as follows: calculation then we can explore back and forth to
  , ,…, , ,…,   determine the result POS tag sequence.
              | , ,…, , , ,
, ,…, , – 1, , | ,
| , ,…,
| , ,   ∈ 
, ,…,
Then:
| , ,…, , | ,  
, ,…, ,           , ,…,  
| , ,…,                 | ,
4.4.3. Outer POS tagger VnQTAG. VnQTAG is a set of 47 types of POS. POS level 1 contains 8
Vietnamese POS tagger published by H. Nguyen It is different tags where some tags in level 2 having similar
published with an accuracy of 94%. So we use semantics are grouped together.
VnQTAG as an optional component in our system Vocabulary dictionary: 37454 words and their
[20]. possible tags from corpus of VnQTAG [20]. If a word
is not defined, it will be tagged as X label.
4.5. Calculating true probabilities and N-gram training data: documents of various topics
choosing the best solution like literature, science, news journal, etc. They are
manually segmented and POS tagged. The data
After the 3th step of deciding the POS tags for each contains 74756 words with 47 POS level 2 and some
word segmentation candidate, we calculate the labels for special characters like sentence’s marks. We
probability for each segmented sequence P(T|W). We also do a statistical calculation on the training data with
consider it as the true probability representing for the 8 POS level 1 of VnQTAG [20].
possibility of T as the true tag sequence of the input Testing data: the testing data consists of 19417
sentence W: sentences with approximate 490116 words, provided
According to the Markov-1: by H.P. Le [10]. The data has been manually word
segmented.
| | |
5.3. Experimental parameters setup
According to the Markov-2:
Table 1 describes a list of parameters for the
| | | , experiments. They are used to measure performance of
According to the “front-back”: the system in each case. The results will be discussed
and analyzed. The experiments can be grouped as
| | , | , follows:
Table 1: Experimental parameters list
The best candidate is the one with the highest Description for correlative POS
probability. No Parameters
level and hypothesis using
  0 POS level 1 and Markov 1
P1&M1
5. Experiment hypothesis
1 POS level 1 and Markov 2
P1&M2
5.1. Evaluation method hypothesis
2 POS level 2 and Markov 1
P2&M1
We use F-measure to evaluate performance of our hypothesis
word segmentation approach: 3 POS level 2 and Markov 2
P2&M2
• Precision: 100 hypothesis
4 POS level 1 and “front-back”
• Recal: 100 P1&M0
hypothesis
• F-Measure: 5 POS level 2 and “front-back”
P2&M0
hypothesis
Where 6 Only Maximum matching, no
• N1 is the number of words recognized by the OMM
POS
system 7 Without Maximum matching,
• N2 is the number of words in reality appearing in AS11 choose the best solution from all
the corpus of segmentation results.
• N3 is the number of words that are correctly 8 Using VnQTAG tagger instead of
recognized by the system EPT11
the tagger that we build

5.2. Data preparation Parameters for POS level and hypothesis (No.0-
5): This set of parameters is to consider which level of
Our corpus consists of: POS tags and which hypothesis Markov-1, Markov-2
POS dictionary: we use 2 sets of POS tags: POS or “front-back” will be used.
level 1 and POS level 2 with all tags defined in the Parameters for Maximum matching step (No.6-7):
Corpus of VnQTAG [20]. POS level 2 contain the full This parameter is to evaluate effectiveness of
Maximum matching algorithm when applying for the segmentation sequences. Moreover, System with
word segmentation task. We calculate accuracy of our Maximum matching algorithm has better complexity.
system in both following cases and compare them with
each other: Table 3: Results of the system when running with
• Using Maximum matching for word segmentation and without the Maximum matching algorithm.
• Not using Maximum matching for word Score
segmentation Precision Recall F-Measure
Parameters for choosing POS taggers (No.8):
They are to evaluate performance of system in Parameter
following cases: OMM 91.242 94.852 93.012
• Using our POS tagger with different hypothesis that AS11 95.338 95.784 95.560
we build P1&M1 97.185 96.740 96.962
• Using outer POS tagger VnQTAG
Compare performance with other systems: We Table 4 shows the performance of our system when
compare the performance of our system with choosing different POS taggers. There is no great
JVnSegmenter using CRF approach [5] and difference between using VnQTAG and using our POS
VnTokenizer with Maximum matching and N-gram tagger. This proves that the accuracy of the system
[10]. does not depend on which tagger is used. However, our
POS tagger results in better running time for the whole
5.4. Result and discussion system as we don’t have to call outer POS tagger to get
the POS sequence.
Table 2 shows the results of our system with Table 4. Performance of the system when
parameters for POS level and hypothesis. choosing different POS taggers
There is no great difference among results with Score
respect to parameters’ values. Thus, all of POS levels
and hypotheses appear to have similar influence in the Precision Recall F-Measure
system performance. However, with the promising Parameter
result and simplicity, we propose using the POS level 1 EPT11 96.978 96.533 96.755
and Markov-1 model. Moreover, it proves that the P1&M1 97.185 96.740 96.962
“front-back” model can bring about good performance
and can be applied for many further applications. Figure 4 shows the chart of the results with
different experimental parameters.
Table 2: Results of system with parameters for POS
level and hypothesis
Score F‐Measure of the system with 
experimental parameters
Precision Recall F-Measure
98
Parameter 96
P2&M0 97.156 96.711 96.933 94
P1&M0 97.161 96.716 96.938 92
P2&M2 97.147 96.703 96.924 90
P2&M1 97.169 96.724 96.946
P1&M2 97.177 96.732 96.954
P1&M1 97.185 96.740 96.962

Table 3 shows the results of the system when Figure 4. Performance of system with different
considered choosing Maximum matching algorithm or parameters
not. The system run with Maximum matching
algorithm and no information of POS gives the lowest The Figure 5 and Table 5 show the accuracy of our
result while the system run with Maximum matching system against two existing Vietnamese word
algorithm and use information of POS gives highest segmentation systems. These two systems have been
result. It proved that within our experiment, using retrained and tested on the same training and test
Maximum matching to choose set of segmentations corpus respectively as ours. It can be seen that our
that have the smallest number of words together with system gives the best result.
POS information is better than using all of word
[4] C.P.
C Papageorgioou. 1994. Japaneese Word Segmentation
Performance of three Word  by Hidden Markov Model. Proceeedings of thee HLT
segmentaation systemss F‐Measuree Workshhop, p. 283 – 2888
98 [5] C.T.
C Nguyen, T.K K. Nguyen, X.H H. Phan, L.M. Nguyen,
N
96 & Q.T T. Ha. 2006. Vietnamese
V Worrd Segmentationn with
94 CRFs anda SVMs: An Investigation. Proceedings
P of thhe 20th
92 PACLIIC, Wuhan, China, p.215-222.
[6] D. D Dinh, & T.. Vu. 2006. A Maximum Entropy E
90
Approaach for Vietnam mese Word Segm mentation. Proceeedings
88 of 4th RIVF,
R VietNam, p.12-16.
p
[7] D. D Dinh, K. Hoanng, & V.T. Nguyyen. 2001. Vietnnamese
Word Segmentation.
S T 6th NLPRS, Tokyo, Japan, p.
The p 749-
756.
[8] F. F Peng, F. Fenng, & A. McC Callum. 2004. Chinese
C
Segmenntation and New w Word Detecttion using Condditional
Figure 5. Perform
mance of the system
s when Random me Field. Proceeedings of COLIN NG, p.562-568.
comparin
ng with other word segmentation systemss [9] F. F Peng, X. Huanng, D. Schuurmaans, & S. Wang. 2003.
Text classification
c inn Asian Langguage without Word
Table 5: Accuracy of word
w segmenttation systemss Segmenntation. Proceeddings of 6th IRALL, p.41- 48.
[10] H.P.
H Le, T.M. H. Nguyen, A. Rooussanaly, & T.V. Ho.
Scoore F- 2008. A Hybrid Appproach to Woord Segmentatiion of
Preccision Recaall
System Measuree Vietnam mese Text. Procceedings of 2nd LAATA.
[11] J..S. Zhou, X.Y. Dai, R.Y. Ni, & J.J. Chen. A Hybrid
JvnSegm
menter 94.768 92.473 93.606 Approaach to Chinese Word Segmenntation around CRFs.
VnToken
nizer 96.09
98 94.8669 95.480 Proceedings of 4th SIGGHAN Workshopp on Chinese Lannguage
WS4VN 97.185 96.470 96.962 Processsing.
[12] K.J.
K Chen, & S.H H. Liu. 1992. Word
W identification for
Mandarrin Chinese sentences. Procceedings of the 15th
The traaining data withh 74756 wordss is probably not
n COLIN NG.
big enouggh for JvnSeg gmenter and VnTokenizer
V to [13] L.A.
L Ha. 2003. A method for Word W segmentaation in
perform att their best. Buut this highlighhts an advantagge Vietnam mese. Proceedinngs of the Corp rpus Linguisticss 2003,
of our apprroach that by using
u POS infoormation, we caan Lancasster, UK.
afford to acchieve good reesult with a smaaller corpus. [14] N.N Xue, S.P. Coonverse. 2002. Combining Classifiers
for Chiinese Word Seggmentation. Firsst SIGHAN Woorkshop
attacheed with the 19th COLING, p.57-663.
6. Concllusion and future
f workss [15] P.T.
P Nguyen, V.V.
V Nguyen, & A.C. Le. 2003.
Vietnam mese Word Seegmentation Ussing Hidden Markov M
In thiss paper, we propose
p a new w approach for
f Model. International Workshop
W for Computer,
C Inform
mation,
Vietnamesse word segm mentation task. It is a hybrrid and Coommunication Technologies
T onn State of the Art
A and
approach combining Maximum
M matcching algorithhm Future Trends of Infoormation Technologies in Koreea and
and statisttical models on
o POS tags. The experimeent Vietnam m.
[16] S.S Foo, H. Li. 20004. Chinese Word
W Segmentation and
shows a promising reesult with ann F-measure of
Its Effeect on Informatiion Retrieval. Innformation Proccessing
approximaately 97%. It aims to addrress the overlaap & Mannagement: An International Journal.
J Vol.40, No.1,
ambiguity.. p.161-1190.
One off the limitationns is that we have not a bigb [17] T.H.
T Nguyen. 2006. Wordd Segmentationn for
corpus for better training
g. In the futuree, we would likke Vietnam mese Text Categgorization: An online corpus appproach.
to extend our
o corpus and d try some otheer approaches to Proceedings of 4th RIVF
VF, Ho Chi Minh,, VietNam.
improve peerformance of our system. [18] X.X Lu. 2005. Toowards a Hybriid Model for Chinese C
Word Segmentation.
S P
Proceedings of 4th SIGHAN Woorkshop
on Chinnese Language Processing.
P
Referencce [19] D.D Vu, N.L. Ngguyen, D. Dinh. 2006. Applicattion of
[1] A.J. Jacobs,
J & W.Y. Wong. 2006. Maximum
M Entroppy Maxim mum matching and SVMs for fo Vietnamese word
Word Segm mentation of Chiinese Text. Procceedings of the 5th segmenntation ICT.rda’006, Đà Lạt.
SIGHAN Workshop
Wo on Chin
nese Language Processing,
P p.1008- [20] T.M.H.
T Nguyen,, X.L. Vu, & H.P. H Le. 2003. Using
117. QTAG POS tagging foor Vietnamese documents.
d ICT.rrda’03,
[2] B. Caapenter. 2006. Character Langguage Models for f Ha Noii.
Chinese Woord Segmentatioon and Named Entity
E Recognitioon.
ACL, p.169--172.
[3] C.H. Chang, & C..D. Chen. 19993. A Study on
Integrating Chinese Word Segmentation
S annd Part-Of-Speech
Tagging. Coommunications of
o COLIPS, Vol.3, No.2, p.69-777.

You might also like