Automatic Clustering of Part-Of-Speech For Vocabulary Divided PLSA Language Model

Automatic Clustering of Part-of-speech for
Vocabulary Divided PLSA Language Model
Motoyuki SUZUKI Naoto KURIYAMA, Akinori ITO, Shozo MAKINO

The University of Tokushima Grad. School of Engineering, Tohoku University
Tokushima, JAPAN Sendai, JAPAN
moto@m.ieice.org {kriya, aito, makino}@makino.ecei.tohoku.ac.jp
Abstract: adaptation method” and another is “dynamic adaptation

PLSA is one of the most powerful language models for method”. The static adaptation methods (e.g. [1], [2])
adaptation to a target speech. The vocabulary divided require a small amount of adaptation data in advance.
PLSA language model (VD-PLSA) shows higher Both a general corpus and the adaptation data are mixed
performance than the conventional PLSA model because it together, and the adapted n-gram is calculated from the
can be adapted to the target topic and the target speaking
mixed corpus. This type of adaptation methods are
style individually. However, all of the vocabulary must be
manually divided into three categories (topic, speaking style, effective, however, the adaptation data have to be
and general category). In this paper, an automatic method prepared in advance. It means that a new adaptation data
for clustering parts-of-speech (POS) is proposed for should be prepared and the new adapted n-gram has to
VD-PLSA. be trained again whenever a target speech is changed.
Several corpora with different styles are prepared, and On the other hand, the dynamic adaptation methods
the distance between corpora in terms of POS is calculated. (e.g. [3], [4]) can be adapted to a target speech without
The “general tendency score” and “style tendency score” re-training the n-gram. The mixture model[3] captures
for each POS are calculated based on the distance between topics using sentence-level mixture of n-gram
corpora. All of the POS are divided into three categories probabilities. The mixture model is created through the
using two scores and appropriate thresholds.
three steps. The first step clusters the training data using
Experimental results showed the proposed method
formed appropriate clusters, and VD-PLSA with acquired automatic clustering technique. In the second step,
categories gave the highest performance of all other models. individual n-grams are trained from the each cluster.
We applied the VD-PLSA into large vocabulary continuous Finally the mixture weights of the n-grams are estimated
speech recognition system. VD-PLSA improved the and the probabilities from the n-grams are mixed using
recognition accuracy for documents with lower the weights. The mixture model can be adapted to a
out-of-vocabulary ratio, while other documents were not target speech by changing the mixture weights. The
improved or slightly descended the accuracy. problem of the mixture model is that the clustering
algorithm is not related to the mixing process, which
Keywords: means that the clustering result is not optimum from the
perplexity nor recognition performance point of view.
Vocabulary divided PLSA; general/style tendency score; PLSA[4] is one of the most powerful language
part-of-speech; language model; speech recognition models for adaptation to a target speech. It has many
unigrams, which are called “latent models”, and a
1. Introduction prediction probability is calculated by the weighted sum
of these unigrams. PLSA can be also adapted to a target
N-gram is the most popular language model used in speech by changing weights using the previously
the recent large vocabulary continuous speech observed text using EM algorithm. The unigram
recognition system. In general, an n-gram is constructed probability obtained from the PLSA is combined with
using a huge amount of training samples. If a training the n-gram probability using the unigram rescaling
data consists of many topics and speaking styles, the technique[5].
n-gram acquires an average statistics of the training data. The advantage of PLSA is that the clustering of the
It can be used as a general language model, however, it training text and the training of the model are performed
is well known that the n-gram adapted to a target speech jointly through EM algorithm. It means that the
shows higher performance than the general n-gram. clustering result is expected to be optimum from the
Many adaptation methods have been proposed. perplexity point of view. In this paper, we focus on the
There are two types of adaptation methods, one is “static PLSA because of the advantage of the PLSA.
Latent models are automatically acquired by the
EM algorithm. Each model corresponds to each topic in
978-1-4244-2780-2/08/$25.00 ©2008 IEEE
Authorized licensed use limited to: Velammal Engineering College. Downloaded on July 30, 2009 at 23:38 from IEEE Xplore. Restrictions apply.
the training data. If the training data consists of several topic
PLSA input speech
speaking styles (such as formal, casual, dialog, and so adaptation
on), latent models also represent the speaking style
corresponding to the training data. For example, there corpora
style
are three topics (A, B, and C) in a training data; topic A PLSA
adaptation decoder
is spoken with a formal style (indicated as “f”), whereas
topics B and C are spoken with a casual style (“c”).
general
PLSA makes three latent models, Af, Bc and Cc. PLSA
One of the biggest problems of PLSA is that both
topic and style are dealt with together when latent recognized
context text
models are estimated. If the target speech has topic B adaptation
with formal style (Bf), no latent model matches the

target speech even though both topic B and formal style
appearance probability
are included in the training data. In general, any topics estimation model
can be combined with any speaking styles. Adaptation is
done more efficiently if PLSA acquires topic-related Figure 1. Overview of the VD-PLSA model.
models and style-related models individually, and these
models can be combined when recognizing the target
speech. A probability p ( wi | hi ) for word wi at
In order to solve this problem, we have proposed linguistic context hi is calculated from the VD-PLSA
the vocabulary divided PLSA language model
model using Eq.(1).
(VD-PLSA)[6], [7]. In this method, it is assumed that all
parts-of-speech (POS) are divided into three classes: a p( wi | hi ) = W (C ( wi ) | hi , wi −2 , wi −1 )
topic-related class, style-related class, and general class. (1)
× PC ( wi ) ( wi | hi , wi −2 , wi −1 )
For example, proper nouns are related to the topic, and
auxiliary verbs are related to the speaking style. If a where, the function C (w) returns the category (T:
PLSA model is constructed only using topic-related topic, S: style, G: general) including the word w , and
words, each latent model only corresponds to a topic. In
the same way, if a PLSA model is constructed only using
the function Px ( wi | hi , wi − 2 , wi −1 ) denotes the word
style-related words, each model only corresponds to a trigram combined with the sub model corresponding to
speaking style. Combining these two PLSA models the category x using the unigram rescaling method[5].
enables effective adaptation to any target speech on any p x ( wi | hi )
topic with any speaking style. Px ( wi | hi , wi − 2 , wi −1 ) ∝
The VD-PLSA showed higher performance than the p x ( wi ) (2)
conventional PLSA model. However, VD-PLSA requires × p ( wi | wi − 2 , wi −1 )
manually-defined clustering of the POS. It is needed to
define many number of POS for optimum clustering, where, p x ( wi | hi ) is calculated by the PLSA model
however, it is both difficult and time-consuming job. In corresponding to the category x .
this paper, we propose an automatic dividing method of The function W ( x | hi , wi − 2 , wi −1 ) denotes an
POS based on differences of probabilistic distribution.
This method yields more appropriate clustering, and appearance probability for the category x after the
VD-PLSA using the new clustering gives higher word sequence wi − 2 , wi −1 at context hi . This
performance. probability is calculated from the unigram of category
x at context h and word trigram using unigram
2. Overview of the VD-PLSA model
rescaling. The unigram p ( x | h) is defined as Eq.(3).
Figure 1 shows an overview of the VD-PLSA N x ( h)
model. This system has three PLSA models p ( x | h) = (3)
corresponding to topic, speaking style, and general
N ( h)
words. In this paper, each PLSA model is called a “sub where, N (h) denotes the number of words in a
model”. Each sub model is the same as the conventional context h , and N x ( h) denotes the number of words
PLSA model, except that each model consists only of
related vocabulary. included in a category x in a context h . The unigram
rescaling is carried out using:
Table 1. Manually categorized POS. (From [7]) of documents.
Category POS We assume that all of the words included in a POS
Topic Noun (general) are assigned to the same category. In other words, there
Proper noun is no overlapping word among any categories. For
Alphabet example, all of the words included in the proper noun
Verb (independence) should be assigned to the same category, which may be
Noun as adverb topic-related category. The frequency of each word
Noun (conjunctive) included in proper noun should vary with the topic. In
Style Pronoun other words, the probability distribution of words should
Number depend on the topic, but not on the style. Let be U Ts a
Prefix unigram of proper noun words calculated using a
Conjunction
Adjective document with topic T and speaking style s . From
Adverb the above discussion, the distance between U Af and
Adnoun
Filler
U Ac is small, however, the distance between U Af and
Exclamation U Bf is large. The new clustering method is based on
General Case-marking particle
Adnominal particle the difference between probability distributions of words
Conjunctive particle included in a POS for each document.
As you know, there are many POS definitions.
Charge particle
Especially, a number of words in a POS can be
controlled by the definition. For example, “noun” is one
W ( x | hi , wi − 2 , wi −1 ) ∝ ∑ Px ( w | hi , wi − 2 , wi −1 ) POS in the coarse definition. However, it is divided into
w∈ x many POS in the fine definition, such as “proper noun”,
p ( x | hi ) (4) “numeral”, “pronoun”, and so on. In the same way, the
×
∑ p(w)
w∈ x
POS “proper noun” can be divided into “person’s name”,
“place”, “food”, etc. in the finer definition.
The finer definition is more appropriate for the
Note that the sub model corresponding to general
proposed algorithm because all of the words included in
words is not adapted to context h because the a POS should be assigned to the same category. However,
“general” model does not depend on h . Therefore, finer definition makes many numbers of POS each of
PG (⋅) can be calculated for any context h using: which has a few words. Unigram consists of a few words
becomes similar to other unigrams calculated from a
PG ( wi | hi , wi −2 , wi −1 ) = PG ( wi | wi −2 , wi −1 ) (5) document with any topic and style. It is not suitable for
In this method, all POS are divided into three the clustering. An appropriate size of POS is needed to
categories by hand. In the experiments described in the represent a topic or style dependency of words.
[7], 90 kinds of POS was defined, and divided into three
categories. Categorized POS is shown in Table 1. It 3.2 Distance between corpora with different styles
seems to be reasonable, however, nobody knows whether
it is optimum clustering or not. It is assumed that a text corpus consists of many
documents, each of which has a single topic and single
style. Moreover, all documents in the corpus have the
3. Automatic clustering of part-of-speech same style. For all words included in a POS M , the
word unigram p ( w) is calculated using all of the
3.1. Concept
documents in the corpus. In the same way, word unigram
The frequency of a topic-related word in a p d ( w) is also calculated using only document d .
document is not changed even if the speaking style is The distance J ( p d , p ) between two unigrams is
changed. If the topic is changed, however, the frequency
is also changed. In the same way, the frequency of calculated by the following equation, which is called
style-related words depends only on the speaking style. Jeffery divergence:
The frequency of general words does not change for any p d ( w)
documents. We therefore propose an automatic J ( pd , p | M ) = ∑ {p
w∈M
d ( w) − p ( w)}log
p ( w)
clustering method based on the relationship between
frequency of words and difference of topic and/or style (6)
If the POS M relates to the topic, the distance distance D( S i , S j | M ) is small only if i = j . This
becomes large because p d is specified by the topic of
means that the difference between D( S i , S j | M )
document d , even though p has the average
and D( S j , S j | M ) is large for all i ≠ j . On the
distribution of many topics. On the other hand, if M
relates to style or general category, the distance becomes other hand, if a POS M is not related to the style
small. Topic-related POS can be divided from other category, the difference becomes small because both
categories using the average distance calculated from D( S i , S j | M ) and D( S j , S j | M ) are large ( M
many documents included in the corpus.
is related to the topic category), or both distances are
In order to divide style-related POS from general
POS, we considered the distance between corpora. small ( M is related to the general category). The style
tendency score s (M ) is defined as follows:
Several text corpora S i are prepared. It is assumed that
s ( M ) = ∑∑ {D( Si , S j | M ) − D( Si , Si | M )} (9)
2
each corpus has a different speaking style. The distance
D( S i , S j | M ) for POS M is defined as follows: i j
Finally, an estimated category C (M ) can be

1
∑
(S )
D( S i , S j | M ) = J ( p d , p j | M ) (7) determined using two scores and appropriate thresholds
N ( S i ) d∈Si θ s and θ g which are given by hand.
where, N ( S i ) denotes the number of documents in ⎧ S if s( M ) > θ s
the corpus S i , and p ( Si ) denotes the unigram ⎪
C ( M ) = ⎨G if g (M ) > θ g (10)
calculated from all of the documents in the corpus S i . ⎪T others
⎩
If M relates to the style, D ( S i , S j | M ) becomes
small only if i = j . On the other hand, if M relates 4. Evaluation experiments using perplexity
to general category, D ( S i , S j | M ) becomes small In order to investigate the effectiveness of the

for any i and j . Moreover, D( S i , S j | M ) is proposed method, several experiments were carried out.
In this section, perplexity was used as a evaluation index.
always large for any i and j if M is related to the
topic. A POS M can be divided into three categories 4.1. Experimental conditions
using the distance D .
Newspaper articles and the Corpus of Spontaneous
Japanese (CSJ)[8], [9] were used for training data. Both
3.3 General tendency score and style tendency
corpora have many documents on various topics, and the
score
newspaper articles were written in formal Japanese,
whereas the CSJ corpus consists of transcripts of lectures,
In order to divide POS into three categories
oral presentations, dialogs, and so on.
automatically, two scores based on the distance D are The evaluation data were also selected from the CSJ
proposed. At first, “general tendency score” g (M ) is corpus, and these were transcripts of lectures which were
introduced to divide the general category from others. If talking about “recent news”. These data cover similar
a POS M is related to the general category, the topics to newspaper articles, and have a speaking style
distance D ( S i , S j | M ) is small for any combination similar to the CSJ corpus used as training data. This
means that there were no training data which had similar
of i and j . The general tendency score g (M ) is topic and style to the evaluation data, even though some
thus defined as follows: training data had similar topic or style.
1
g (M ) = (8) Table 2. Experimental conditions
1
N2
∑∑ D(S , S
i j
i j | M) Vocabulary
Number of
30,000 + unknown words
Topic: 100
latent Style: 50
where, N denotes the number of corpora.
model General: 1
“Style tendency score” s (M ) is also introduced. Training Newspaper: 67,989 articles (26.9M words)
If a POS M is related to the style category, the CSJ: 2,580 articles (6.7M words)
Evaluation CSJ: 152 articles
25
Word classification
number of words, therefore reliable statistics could not
General Class threshold
Style Class threshold be acquired.
20
4.3. Evaluation using perplexity

15
The VD-PLSA model was constructed based on the
Style tendency
automatically-defined three categories. The number of

10 latent models was set to 100, 50 and 1 for the topic
category, style category and general category,
respectively. The trigram, conventional PLSA model,
5
and VD-PLSA model with the manually-defined three
categories were also evaluated for comparison.
0
0.01 0.1 1 10
General tendency Table 3. POS with higher/lower
Figure 2. Relationship between general and style general tendency score
tendency scores. POS Score
Particle (attributive) 5.59
Other experimental conditions are shown in Table 2. Case particle (general) 0.58
Several parameters such as a number of latent models, an Case particle (quotation) 0.34
annealing schedule used in the tempered EM Particle (conjunction) 0.13
algorithm[4], were determined according to preliminary Particle (adverb) 0.11
experiments. Noun (base of auxiliary verb) 0.08
General noun (dependence) 0.08
4.2. Analysis of acquired categories …
Noun (family name) 0.032
In this experiment, 90 POS were defined. Both Noun (given name) 0.032
“general tendency score” and “style tendency score” Noun (location) 0.032
were calculated for each POS. Figure 2 shows the Noun (organization) 0.031
relationship between the two scores. In this figure, each Adjective (suffix) 0.030
cross symbol denotes a POS. Particle (special) 0.030
It can be seen that the POS can be divided into three Noun (Conjunctive) 0.029
regions using thresholds (0.12 for general tendency score Prefix (with verb) 0.028
and 6.0 for style tendency score). The region with higher Prefix (with adjective) 0.027
general tendency score and lower style tendency score
indicates the general category, the region with lower Table 4. POS with higher/lower style
general tendency score and higher style tendency score tendency score
indicates the style category, and the region with lower POS Score
general and style tendency scores indicates the topic Noun (base of auxiliary verb) 16.25
category. Filler 14.33
Table 3 and 4 shows POS with higher/lower general General noun (dependence) 13.18
tendency score and style tendency score, respectively. Noun (base of adjective verb) 12.70
Table 5 shows acquired clusters. In the table 5, Pronoun 12.41
underlined POS were categorized into another cluster Impression words 12.28
compared with the manual clustering. Conjunction 11.82
Many noun-family POS take lower scores (both …
style tendency score and general tendency score), Adjective (suffix) 0.51
however, some noun-family POS take higher scores Prefix (with adjective) 0.46
(noun (base of auxiliary verb) and general noun Noun (conjunctive) 0.45
(dependence)). Particle-family POS was also separated Particle (special) 0.33
into higher and lower general tendency score. It means Alphabet 0.26
that fine definition of POS is needed to acquire Particle (attributive) 0.25
appropriate category. Noun (person’s name) 0.21
On the other hand, “Interjection” and “Noun Prefix (with verb) 0.13
(quotation)” were categorized as topic-related words, Noun (quotation) 0.015
although these POS should be categorized as Interjection 0.010
style-related or general words. These POS had small
Table 5. Acquired clusters speech. The top-1 result, which is the same as final result
Category POS of conventional speech recognizer, is used as adaptation
Topic General noun data for the VD-PLSA. After adaptation of the
(23,584 words) Proper noun VD-PLSA, linguistic score is calculated using the
Noun (base of adjective verb) VD-PLSA for each recognition results. Finally, total
Number score is calculated as a weighted sum of acoustical and
Noun (special) linguistic scores, and the recognition result with the
Noun (suffix) highest score is output as a final result.
Noun (conjunctive) A language model is usually used in the decoding
Noun (quotation) process, however, using the VD-PLSA in the decoding
Prefix process requires a conversion of the existing decoder. In
Adjective order to investigate the effectiveness of the VD-PLSA
Particle (special) rapidly, the rescoring method is employed in this
Interjection experiments, even though the rescoring method may give
Style Pronoun slightly lower performance than that given by the
(6,372 words) Filler converted decoder.
Prefix (with noun)
Verb 5.2. Experimental conditions
Adverb
Noun (special suffix) Ten lecture data is selected from the evaluation data
Case particle (collocation) used in Section 4. The detail of the selected test data is
Particle (adverb) shown in Table 7. In this table, “PP” means a perplexity
Impression words calculated by the trigram. “OOV” means
Conjunction out-of-vocabulary ratio.
General Particle (conjunctive) Julius[10] was used as a decoder, and
(44 words) Case particle (general) speaker-independent HMM was used as an acoustic
Case particle (quotation) model. 500 recognition results were output for the
Particle (attributive) rescoring step. If the best candidate were selected in the
rescoring step, total recognition accuracy became
Table 6. Evaluation results for each language model 68.42%.
Language model Perplexity Reduction rate
Trigram 143.5 --- 5.3. Recognition results
Conventional PLSA 114.7 20.1%
VD-PLSA (manually) 111.9 22.0% Table 8 shows speech recognition results. From
(automatic) 109.4 23.8% these results, the recognition accuracy was slightly
improved by using the VD-PLSA. Some lectures (F0073,
Table 6 shows perplexities given by each language F0404, M0267 and M1250) showed higher improvement.
model and reduction rate compared with the trigram. However, the recognition accuracies of other lectures
From this experiment, the VD-PLSA model gave higher were not improved or slightly descended. Especially, the
performance than both the trigram and conventional accuracies of M0149 and M0846 showed larger
PLSA model, and automatic determination of categories descending.
gave slightly higher performance than the
manually-defined model. Table 7. Details of the evaluation data
ID PP OOV Topic
5. Speech recognition experiments F0073 130.82 1.7% Juvenile delinquency
F0226 126.58 2.4% Genetic operation
5.1. Rescoring method F0404 107.43 1.4% Aging society
M0149 157.08 3.2% Loan
We applied the VD-PLSA to the large vocabulary M0267 156.12 2.9% Mobile phone
continuous speech recognition system. In this experiment, M0557 179.47 1.4% Olympic
the proposed model was applied to the “rescoring M0846 189.01 4.4% Blast incident
method”. M0872 111.82 3.4% Okinawa summit meeting
At first, the input speech data is recognized using a M1250 121.24 1.7% Rugby football
conventional n-gram. The speech recognizer outputs M1564 152.30 2.7% Soccer
top-N recognition results corresponding to an input Average 143.19 2.5%
References
Table 8. Speech recognition accuracy
ID Trigram VD-PLSA Improvement [1] A. I. Rudnicky, “Language modeling with limited
F0073 68.21% 68.86% +0.65 domain data,” in Proc. ARPA Spoken Language
F0226 77.22% 77.16% -0.06 Systems Technology Workshop, 1995, pp. 66–69.
F0404 78.24% 78.49% +0.25 [2] M. Federico, “Bayesian estimation methods for
M0149 58.77% 58.07% -0.70 N-gram language model adaptation,” in Proc.
M0267 60.73% 61.47% +0.74 ICSLP, 1996, pp. 240–243.
M0557 58.75% 58.84% +0.09 [3] R. M. Iyer and M. Ostendorf, “Modeling long
M0846 63.82% 63.12% -0.70 distance dependence in language: topic mixtures
versus dynamic cache models,” IEEE Trans.
M0872 65.69% 65.60% -0.09
Speech and Audio Processing, vol. 7, no. 1, pp.
M1250 65.44% 65.85% +0.41
30–39, 1999.
M1564 67.26% 67.00% -0.26
[4] T. Hofmann, “Probabilistic latent semantic
Average 66.41% 66.45% +0.04 analysis,” in Proc. The Twenty-Second Annual
International SIGIR Conference on Research and
From the comparison between Table 7 and Table 8, Development in Information Retrieval (SIGIR-99),
lectures with higher OOV rate (excepted M0267) 1999.
showed larger descending of accuracy. In general, Most [5] D. Gildea and T. Hofmann, “Topic-based language
of out-of-vocabularies are proper noun. If proper nouns models using EM,” in Proc. EUROSPEECH, 1999,
cannot be recognized correctly, VD-PLSA cannot be pp. 2167–2170.
adapted to the topic sufficiently because the proper noun [6] A. Ito, N. Kuriyama, M. Suzuki, and S. Makino,
usually becomes keyword of the topic. Combination of “Evaluation of multiple PLSA adaptation based on
the VD-PLSA with an OOV recognition method can separation of topic and style words,” in Proc.
improve the recognition accuracy more drastically. WESPAC IX, 2006.
[7] N. Kuriyama, M. Suzuki, A. Ito, and S. Makino,
6. Conclusion “Topic and speech style adaptation using
vocabulary divided PLSA language model,” in Proc.
In this paper, an automatic method for clustering 3rd workshop of Yeungnam University and Tohoku
parts-of-speech (POS) was proposed for the vocabulary University, 2006, pp. 16–18.
divided PLSA language model. [8] K. Maekawa, H. Koiso, S. Furui, and H. Isahara,
Several corpora with different styles were prepared, “Spontaneous speech corpus of Japanese,” in Proc.
and the distance between corpora in terms of POS was Second International Conference on Language
calculated. It is defined as the average distance between Resources and Evaluation (LREC), 2000, pp.
the probability distribution calculated from a document 947–952.
and that from all documents in a corpus. After that, [9] K. Maekawa, “Corpus of spontaneous Japanese: Its
“general tendency score” and “style tendency score” for design and evaluation,” in Proc. ISCA & IEEE
each POS were calculated based on the distance between Workshop on Spontaneous Speech Processing and
corpora. All of the POS were divided into three Recognition (SSPR), 2003.
categories using two scores and appropriate thresholds. [10] A. Lee, T. Kawahara, and K. Shikano, “Julius — an
From the experimental results, the proposed open source real-time large vocabulary recognition
clustering method formed appropriate clusters, and the engine,” in Proc. EUROSPEECH, 2001, pp.
VD-PLSA model with acquired categories gave the 1691–1694.
highest performance of all other models.
We applied the VD-PLSA into large vocabulary
continuous speech recognition system. VD-PLSA
improved the recognition accuracy for documents with
lower out-of-vocabulary ratio, while other documents
were not improved or slightly descended the accuracy.

Automatic Clustering of Part-Of-Speech For Vocabulary Divided PLSA Language Model

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automatic Clustering of Part-Of-Speech For Vocabulary Divided PLSA Language Model

Uploaded by

Copyright:

Available Formats

Automatic Clustering of Part-of-speech for

Vocabulary Divided PLSA Language Model

Motoyuki SUZUKI Naoto KURIYAMA, Akinori ITO, Shozo MAKINO

Abstract: adaptation method” and another is “dynamic adaptation

with formal style (Bf), no latent model matches the

Finally, an estimated category C (M ) can be

to general category, D ( S i , S j | M ) becomes small In order to investigate the effectiveness of the

4.3. Evaluation using perplexity

automatically-defined three categories. The number of

You might also like