Professional Documents
Culture Documents
A Hybrid Approach To Vietnamese Word Segmentation Using Part of Speech Tags
A Hybrid Approach To Vietnamese Word Segmentation Using Part of Speech Tags
ai-1,i ai,i+1
ti-1 ti ti+1
Corpus
4.2. Preprocessing
wi-1 wi wi+1
There is ambiguity of spelling rules in Vietnamese
language. The most typical ones include using “i” or Figure 3. HMM for POS tagging task
“y” and where to put marks on syllables. Thus, the
preprocessing step aims to normalize data for further A tagged sequence T t , t , … , t satisfies:
analysis using one standard. Moreover, this step aims |
identify new words in the form of named entities such
as names of companies, people, factoids, etc. We use |
regular expression as the main technique to recognize
named entities. |
, ,…, | , ,…,
4.3. Maximum matching
, ,…,
We use Maximum matching algorithm to find the Suppose that if we know the POS of a word, it is
segmentation candidates by segmenting the input possible to determine this word. Thus, the probability
sentence into a sequence with the smallest number of P(W|T) only depends on basic probability like
words. With the overlap ambiguity, there is often more | :
than one solution.
Then:
| | , ,…, , ,…,
| , ,…, ,
Further calculation for P(T): , ,…, ,
, ,…, | , ,…,
| … | , ,…, , ,…,
Apply the Markov-1 hypothesis in which | , , ,…,
probability of a POS appearing in a sequence can be | , ,…,
predicted when we know the probability of the | , | ,
previous POS. That means | , ,…, , ,…,
| | , | ,
Then: , ,…,
, ,…,
| … | , ,…,
| | … | | ,
| Where t0 and tn+1 represents for sentence’s first and last
element respectively. Finally, we get:
( is a special tag representing for sentence’s |
beginning)
Finally we get: |
| | |
5.2. Data preparation Parameters for POS level and hypothesis (No.0-
5): This set of parameters is to consider which level of
Our corpus consists of: POS tags and which hypothesis Markov-1, Markov-2
POS dictionary: we use 2 sets of POS tags: POS or “front-back” will be used.
level 1 and POS level 2 with all tags defined in the Parameters for Maximum matching step (No.6-7):
Corpus of VnQTAG [20]. POS level 2 contain the full This parameter is to evaluate effectiveness of
Maximum matching algorithm when applying for the segmentation sequences. Moreover, System with
word segmentation task. We calculate accuracy of our Maximum matching algorithm has better complexity.
system in both following cases and compare them with
each other: Table 3: Results of the system when running with
• Using Maximum matching for word segmentation and without the Maximum matching algorithm.
• Not using Maximum matching for word Score
segmentation Precision Recall F-Measure
Parameters for choosing POS taggers (No.8):
They are to evaluate performance of system in Parameter
following cases: OMM 91.242 94.852 93.012
• Using our POS tagger with different hypothesis that AS11 95.338 95.784 95.560
we build P1&M1 97.185 96.740 96.962
• Using outer POS tagger VnQTAG
Compare performance with other systems: We Table 4 shows the performance of our system when
compare the performance of our system with choosing different POS taggers. There is no great
JVnSegmenter using CRF approach [5] and difference between using VnQTAG and using our POS
VnTokenizer with Maximum matching and N-gram tagger. This proves that the accuracy of the system
[10]. does not depend on which tagger is used. However, our
POS tagger results in better running time for the whole
5.4. Result and discussion system as we don’t have to call outer POS tagger to get
the POS sequence.
Table 2 shows the results of our system with Table 4. Performance of the system when
parameters for POS level and hypothesis. choosing different POS taggers
There is no great difference among results with Score
respect to parameters’ values. Thus, all of POS levels
and hypotheses appear to have similar influence in the Precision Recall F-Measure
system performance. However, with the promising Parameter
result and simplicity, we propose using the POS level 1 EPT11 96.978 96.533 96.755
and Markov-1 model. Moreover, it proves that the P1&M1 97.185 96.740 96.962
“front-back” model can bring about good performance
and can be applied for many further applications. Figure 4 shows the chart of the results with
different experimental parameters.
Table 2: Results of system with parameters for POS
level and hypothesis
Score F‐Measure of the system with
experimental parameters
Precision Recall F-Measure
98
Parameter 96
P2&M0 97.156 96.711 96.933 94
P1&M0 97.161 96.716 96.938 92
P2&M2 97.147 96.703 96.924 90
P2&M1 97.169 96.724 96.946
P1&M2 97.177 96.732 96.954
P1&M1 97.185 96.740 96.962
Table 3 shows the results of the system when Figure 4. Performance of system with different
considered choosing Maximum matching algorithm or parameters
not. The system run with Maximum matching
algorithm and no information of POS gives the lowest The Figure 5 and Table 5 show the accuracy of our
result while the system run with Maximum matching system against two existing Vietnamese word
algorithm and use information of POS gives highest segmentation systems. These two systems have been
result. It proved that within our experiment, using retrained and tested on the same training and test
Maximum matching to choose set of segmentations corpus respectively as ours. It can be seen that our
that have the smallest number of words together with system gives the best result.
POS information is better than using all of word
[4] C.P.
C Papageorgioou. 1994. Japaneese Word Segmentation
Performance of three Word by Hidden Markov Model. Proceeedings of thee HLT
segmentaation systemss F‐Measuree Workshhop, p. 283 – 2888
98 [5] C.T.
C Nguyen, T.K K. Nguyen, X.H H. Phan, L.M. Nguyen,
N
96 & Q.T T. Ha. 2006. Vietnamese
V Worrd Segmentationn with
94 CRFs anda SVMs: An Investigation. Proceedings
P of thhe 20th
92 PACLIIC, Wuhan, China, p.215-222.
[6] D. D Dinh, & T.. Vu. 2006. A Maximum Entropy E
90
Approaach for Vietnam mese Word Segm mentation. Proceeedings
88 of 4th RIVF,
R VietNam, p.12-16.
p
[7] D. D Dinh, K. Hoanng, & V.T. Nguyyen. 2001. Vietnnamese
Word Segmentation.
S T 6th NLPRS, Tokyo, Japan, p.
The p 749-
756.
[8] F. F Peng, F. Fenng, & A. McC Callum. 2004. Chinese
C
Segmenntation and New w Word Detecttion using Condditional
Figure 5. Perform
mance of the system
s when Random me Field. Proceeedings of COLIN NG, p.562-568.
comparin
ng with other word segmentation systemss [9] F. F Peng, X. Huanng, D. Schuurmaans, & S. Wang. 2003.
Text classification
c inn Asian Langguage without Word
Table 5: Accuracy of word
w segmenttation systemss Segmenntation. Proceeddings of 6th IRALL, p.41- 48.
[10] H.P.
H Le, T.M. H. Nguyen, A. Rooussanaly, & T.V. Ho.
Scoore F- 2008. A Hybrid Appproach to Woord Segmentatiion of
Preccision Recaall
System Measuree Vietnam mese Text. Procceedings of 2nd LAATA.
[11] J..S. Zhou, X.Y. Dai, R.Y. Ni, & J.J. Chen. A Hybrid
JvnSegm
menter 94.768 92.473 93.606 Approaach to Chinese Word Segmenntation around CRFs.
VnToken
nizer 96.09
98 94.8669 95.480 Proceedings of 4th SIGGHAN Workshopp on Chinese Lannguage
WS4VN 97.185 96.470 96.962 Processsing.
[12] K.J.
K Chen, & S.H H. Liu. 1992. Word
W identification for
Mandarrin Chinese sentences. Procceedings of the 15th
The traaining data withh 74756 wordss is probably not
n COLIN NG.
big enouggh for JvnSeg gmenter and VnTokenizer
V to [13] L.A.
L Ha. 2003. A method for Word W segmentaation in
perform att their best. Buut this highlighhts an advantagge Vietnam mese. Proceedinngs of the Corp rpus Linguisticss 2003,
of our apprroach that by using
u POS infoormation, we caan Lancasster, UK.
afford to acchieve good reesult with a smaaller corpus. [14] N.N Xue, S.P. Coonverse. 2002. Combining Classifiers
for Chiinese Word Seggmentation. Firsst SIGHAN Woorkshop
attacheed with the 19th COLING, p.57-663.
6. Concllusion and future
f workss [15] P.T.
P Nguyen, V.V.
V Nguyen, & A.C. Le. 2003.
Vietnam mese Word Seegmentation Ussing Hidden Markov M
In thiss paper, we propose
p a new w approach for
f Model. International Workshop
W for Computer,
C Inform
mation,
Vietnamesse word segm mentation task. It is a hybrrid and Coommunication Technologies
T onn State of the Art
A and
approach combining Maximum
M matcching algorithhm Future Trends of Infoormation Technologies in Koreea and
and statisttical models on
o POS tags. The experimeent Vietnam m.
[16] S.S Foo, H. Li. 20004. Chinese Word
W Segmentation and
shows a promising reesult with ann F-measure of
Its Effeect on Informatiion Retrieval. Innformation Proccessing
approximaately 97%. It aims to addrress the overlaap & Mannagement: An International Journal.
J Vol.40, No.1,
ambiguity.. p.161-1190.
One off the limitationns is that we have not a bigb [17] T.H.
T Nguyen. 2006. Wordd Segmentationn for
corpus for better training
g. In the futuree, we would likke Vietnam mese Text Categgorization: An online corpus appproach.
to extend our
o corpus and d try some otheer approaches to Proceedings of 4th RIVF
VF, Ho Chi Minh,, VietNam.
improve peerformance of our system. [18] X.X Lu. 2005. Toowards a Hybriid Model for Chinese C
Word Segmentation.
S P
Proceedings of 4th SIGHAN Woorkshop
on Chinnese Language Processing.
P
Referencce [19] D.D Vu, N.L. Ngguyen, D. Dinh. 2006. Applicattion of
[1] A.J. Jacobs,
J & W.Y. Wong. 2006. Maximum
M Entroppy Maxim mum matching and SVMs for fo Vietnamese word
Word Segm mentation of Chiinese Text. Procceedings of the 5th segmenntation ICT.rda’006, Đà Lạt.
SIGHAN Workshop
Wo on Chin
nese Language Processing,
P p.1008- [20] T.M.H.
T Nguyen,, X.L. Vu, & H.P. H Le. 2003. Using
117. QTAG POS tagging foor Vietnamese documents.
d ICT.rrda’03,
[2] B. Caapenter. 2006. Character Langguage Models for f Ha Noii.
Chinese Woord Segmentatioon and Named Entity
E Recognitioon.
ACL, p.169--172.
[3] C.H. Chang, & C..D. Chen. 19993. A Study on
Integrating Chinese Word Segmentation
S annd Part-Of-Speech
Tagging. Coommunications of
o COLIPS, Vol.3, No.2, p.69-777.