Professional Documents
Culture Documents
Automatic Clustering of Part-Of-Speech For Vocabulary Divided PLSA Language Model
Automatic Clustering of Part-Of-Speech For Vocabulary Divided PLSA Language Model
Authorized licensed use limited to: Velammal Engineering College. Downloaded on July 30, 2009 at 23:38 from IEEE Xplore. Restrictions apply.
the training data. If the training data consists of several topic
PLSA input speech
speaking styles (such as formal, casual, dialog, and so adaptation
on), latent models also represent the speaking style
corresponding to the training data. For example, there corpora
style
are three topics (A, B, and C) in a training data; topic A PLSA
adaptation decoder
is spoken with a formal style (indicated as “f”), whereas
topics B and C are spoken with a casual style (“c”).
general
PLSA makes three latent models, Af, Bc and Cc. PLSA
One of the biggest problems of PLSA is that both
topic and style are dealt with together when latent recognized
context text
models are estimated. If the target speech has topic B adaptation
Authorized licensed use limited to: Velammal Engineering College. Downloaded on July 30, 2009 at 23:38 from IEEE Xplore. Restrictions apply.
Table 1. Manually categorized POS. (From [7]) of documents.
Category POS We assume that all of the words included in a POS
Topic Noun (general) are assigned to the same category. In other words, there
Proper noun is no overlapping word among any categories. For
Alphabet example, all of the words included in the proper noun
Verb (independence) should be assigned to the same category, which may be
Noun as adverb topic-related category. The frequency of each word
Noun (conjunctive) included in proper noun should vary with the topic. In
Style Pronoun other words, the probability distribution of words should
Number depend on the topic, but not on the style. Let be U Ts a
Prefix unigram of proper noun words calculated using a
Conjunction
Adjective document with topic T and speaking style s . From
Adverb the above discussion, the distance between U Af and
Adnoun
Filler
U Ac is small, however, the distance between U Af and
Exclamation U Bf is large. The new clustering method is based on
General Case-marking particle
Adnominal particle the difference between probability distributions of words
Conjunctive particle included in a POS for each document.
As you know, there are many POS definitions.
Charge particle
Especially, a number of words in a POS can be
controlled by the definition. For example, “noun” is one
W ( x | hi , wi − 2 , wi −1 ) ∝ ∑ Px ( w | hi , wi − 2 , wi −1 ) POS in the coarse definition. However, it is divided into
w∈ x many POS in the fine definition, such as “proper noun”,
p ( x | hi ) (4) “numeral”, “pronoun”, and so on. In the same way, the
×
∑ p(w)
w∈ x
POS “proper noun” can be divided into “person’s name”,
“place”, “food”, etc. in the finer definition.
The finer definition is more appropriate for the
Note that the sub model corresponding to general
proposed algorithm because all of the words included in
words is not adapted to context h because the a POS should be assigned to the same category. However,
“general” model does not depend on h . Therefore, finer definition makes many numbers of POS each of
PG (⋅) can be calculated for any context h using: which has a few words. Unigram consists of a few words
becomes similar to other unigrams calculated from a
PG ( wi | hi , wi −2 , wi −1 ) = PG ( wi | wi −2 , wi −1 ) (5) document with any topic and style. It is not suitable for
In this method, all POS are divided into three the clustering. An appropriate size of POS is needed to
categories by hand. In the experiments described in the represent a topic or style dependency of words.
[7], 90 kinds of POS was defined, and divided into three
categories. Categorized POS is shown in Table 1. It 3.2 Distance between corpora with different styles
seems to be reasonable, however, nobody knows whether
it is optimum clustering or not. It is assumed that a text corpus consists of many
documents, each of which has a single topic and single
style. Moreover, all documents in the corpus have the
3. Automatic clustering of part-of-speech same style. For all words included in a POS M , the
word unigram p ( w) is calculated using all of the
3.1. Concept
documents in the corpus. In the same way, word unigram
The frequency of a topic-related word in a p d ( w) is also calculated using only document d .
document is not changed even if the speaking style is The distance J ( p d , p ) between two unigrams is
changed. If the topic is changed, however, the frequency
is also changed. In the same way, the frequency of calculated by the following equation, which is called
style-related words depends only on the speaking style. Jeffery divergence:
The frequency of general words does not change for any p d ( w)
documents. We therefore propose an automatic J ( pd , p | M ) = ∑ {p
w∈M
d ( w) − p ( w)}log
p ( w)
clustering method based on the relationship between
frequency of words and difference of topic and/or style (6)
Authorized licensed use limited to: Velammal Engineering College. Downloaded on July 30, 2009 at 23:38 from IEEE Xplore. Restrictions apply.
If the POS M relates to the topic, the distance distance D( S i , S j | M ) is small only if i = j . This
becomes large because p d is specified by the topic of
means that the difference between D( S i , S j | M )
document d , even though p has the average
and D( S j , S j | M ) is large for all i ≠ j . On the
distribution of many topics. On the other hand, if M
relates to style or general category, the distance becomes other hand, if a POS M is not related to the style
small. Topic-related POS can be divided from other category, the difference becomes small because both
categories using the average distance calculated from D( S i , S j | M ) and D( S j , S j | M ) are large ( M
many documents included in the corpus.
is related to the topic category), or both distances are
In order to divide style-related POS from general
POS, we considered the distance between corpora. small ( M is related to the general category). The style
tendency score s (M ) is defined as follows:
Several text corpora S i are prepared. It is assumed that
s ( M ) = ∑∑ {D( Si , S j | M ) − D( Si , Si | M )} (9)
2
each corpus has a different speaking style. The distance
D( S i , S j | M ) for POS M is defined as follows: i j
Authorized licensed use limited to: Velammal Engineering College. Downloaded on July 30, 2009 at 23:38 from IEEE Xplore. Restrictions apply.
25
Word classification
number of words, therefore reliable statistics could not
General Class threshold
Style Class threshold be acquired.
20
Authorized licensed use limited to: Velammal Engineering College. Downloaded on July 30, 2009 at 23:38 from IEEE Xplore. Restrictions apply.
Table 5. Acquired clusters speech. The top-1 result, which is the same as final result
Category POS of conventional speech recognizer, is used as adaptation
Topic General noun data for the VD-PLSA. After adaptation of the
(23,584 words) Proper noun VD-PLSA, linguistic score is calculated using the
Noun (base of adjective verb) VD-PLSA for each recognition results. Finally, total
Number score is calculated as a weighted sum of acoustical and
Noun (special) linguistic scores, and the recognition result with the
Noun (suffix) highest score is output as a final result.
Noun (conjunctive) A language model is usually used in the decoding
Noun (quotation) process, however, using the VD-PLSA in the decoding
Prefix process requires a conversion of the existing decoder. In
Adjective order to investigate the effectiveness of the VD-PLSA
Particle (special) rapidly, the rescoring method is employed in this
Interjection experiments, even though the rescoring method may give
Style Pronoun slightly lower performance than that given by the
(6,372 words) Filler converted decoder.
Prefix (with noun)
Verb 5.2. Experimental conditions
Adverb
Noun (special suffix) Ten lecture data is selected from the evaluation data
Case particle (collocation) used in Section 4. The detail of the selected test data is
Particle (adverb) shown in Table 7. In this table, “PP” means a perplexity
Impression words calculated by the trigram. “OOV” means
Conjunction out-of-vocabulary ratio.
General Particle (conjunctive) Julius[10] was used as a decoder, and
(44 words) Case particle (general) speaker-independent HMM was used as an acoustic
Case particle (quotation) model. 500 recognition results were output for the
Particle (attributive) rescoring step. If the best candidate were selected in the
rescoring step, total recognition accuracy became
Table 6. Evaluation results for each language model 68.42%.
Language model Perplexity Reduction rate
Trigram 143.5 --- 5.3. Recognition results
Conventional PLSA 114.7 20.1%
VD-PLSA (manually) 111.9 22.0% Table 8 shows speech recognition results. From
(automatic) 109.4 23.8% these results, the recognition accuracy was slightly
improved by using the VD-PLSA. Some lectures (F0073,
Table 6 shows perplexities given by each language F0404, M0267 and M1250) showed higher improvement.
model and reduction rate compared with the trigram. However, the recognition accuracies of other lectures
From this experiment, the VD-PLSA model gave higher were not improved or slightly descended. Especially, the
performance than both the trigram and conventional accuracies of M0149 and M0846 showed larger
PLSA model, and automatic determination of categories descending.
gave slightly higher performance than the
manually-defined model. Table 7. Details of the evaluation data
ID PP OOV Topic
5. Speech recognition experiments F0073 130.82 1.7% Juvenile delinquency
F0226 126.58 2.4% Genetic operation
5.1. Rescoring method F0404 107.43 1.4% Aging society
M0149 157.08 3.2% Loan
We applied the VD-PLSA to the large vocabulary M0267 156.12 2.9% Mobile phone
continuous speech recognition system. In this experiment, M0557 179.47 1.4% Olympic
the proposed model was applied to the “rescoring M0846 189.01 4.4% Blast incident
method”. M0872 111.82 3.4% Okinawa summit meeting
At first, the input speech data is recognized using a M1250 121.24 1.7% Rugby football
conventional n-gram. The speech recognizer outputs M1564 152.30 2.7% Soccer
top-N recognition results corresponding to an input Average 143.19 2.5%
Authorized licensed use limited to: Velammal Engineering College. Downloaded on July 30, 2009 at 23:38 from IEEE Xplore. Restrictions apply.
References
Table 8. Speech recognition accuracy
ID Trigram VD-PLSA Improvement [1] A. I. Rudnicky, “Language modeling with limited
F0073 68.21% 68.86% +0.65 domain data,” in Proc. ARPA Spoken Language
F0226 77.22% 77.16% -0.06 Systems Technology Workshop, 1995, pp. 66–69.
F0404 78.24% 78.49% +0.25 [2] M. Federico, “Bayesian estimation methods for
M0149 58.77% 58.07% -0.70 N-gram language model adaptation,” in Proc.
M0267 60.73% 61.47% +0.74 ICSLP, 1996, pp. 240–243.
M0557 58.75% 58.84% +0.09 [3] R. M. Iyer and M. Ostendorf, “Modeling long
M0846 63.82% 63.12% -0.70 distance dependence in language: topic mixtures
versus dynamic cache models,” IEEE Trans.
M0872 65.69% 65.60% -0.09
Speech and Audio Processing, vol. 7, no. 1, pp.
M1250 65.44% 65.85% +0.41
30–39, 1999.
M1564 67.26% 67.00% -0.26
[4] T. Hofmann, “Probabilistic latent semantic
Average 66.41% 66.45% +0.04 analysis,” in Proc. The Twenty-Second Annual
International SIGIR Conference on Research and
From the comparison between Table 7 and Table 8, Development in Information Retrieval (SIGIR-99),
lectures with higher OOV rate (excepted M0267) 1999.
showed larger descending of accuracy. In general, Most [5] D. Gildea and T. Hofmann, “Topic-based language
of out-of-vocabularies are proper noun. If proper nouns models using EM,” in Proc. EUROSPEECH, 1999,
cannot be recognized correctly, VD-PLSA cannot be pp. 2167–2170.
adapted to the topic sufficiently because the proper noun [6] A. Ito, N. Kuriyama, M. Suzuki, and S. Makino,
usually becomes keyword of the topic. Combination of “Evaluation of multiple PLSA adaptation based on
the VD-PLSA with an OOV recognition method can separation of topic and style words,” in Proc.
improve the recognition accuracy more drastically. WESPAC IX, 2006.
[7] N. Kuriyama, M. Suzuki, A. Ito, and S. Makino,
6. Conclusion “Topic and speech style adaptation using
vocabulary divided PLSA language model,” in Proc.
In this paper, an automatic method for clustering 3rd workshop of Yeungnam University and Tohoku
parts-of-speech (POS) was proposed for the vocabulary University, 2006, pp. 16–18.
divided PLSA language model. [8] K. Maekawa, H. Koiso, S. Furui, and H. Isahara,
Several corpora with different styles were prepared, “Spontaneous speech corpus of Japanese,” in Proc.
and the distance between corpora in terms of POS was Second International Conference on Language
calculated. It is defined as the average distance between Resources and Evaluation (LREC), 2000, pp.
the probability distribution calculated from a document 947–952.
and that from all documents in a corpus. After that, [9] K. Maekawa, “Corpus of spontaneous Japanese: Its
“general tendency score” and “style tendency score” for design and evaluation,” in Proc. ISCA & IEEE
each POS were calculated based on the distance between Workshop on Spontaneous Speech Processing and
corpora. All of the POS were divided into three Recognition (SSPR), 2003.
categories using two scores and appropriate thresholds. [10] A. Lee, T. Kawahara, and K. Shikano, “Julius — an
From the experimental results, the proposed open source real-time large vocabulary recognition
clustering method formed appropriate clusters, and the engine,” in Proc. EUROSPEECH, 2001, pp.
VD-PLSA model with acquired categories gave the 1691–1694.
highest performance of all other models.
We applied the VD-PLSA into large vocabulary
continuous speech recognition system. VD-PLSA
improved the recognition accuracy for documents with
lower out-of-vocabulary ratio, while other documents
were not improved or slightly descended the accuracy.
Authorized licensed use limited to: Velammal Engineering College. Downloaded on July 30, 2009 at 23:38 from IEEE Xplore. Restrictions apply.