Unsupervised Spoken Language Understanding For A Multi-Domain Dialog System

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO.

11, NOVEMBER 2013 2451

Unsupervised Spoken Language Understanding for a


Multi-Domain Dialog System
Donghyeon Lee, Minwoo Jeong, Kyungduk Kim, Seonghan Ryu, and Gary Geunbae Lee, Member, IEEE

Abstract—This paper proposes an unsupervised spoken lan- identifies the target domain of the input utterance; the utter-
guage understanding (SLU) framework for a multi-domain dialog ance is then delivered to the appropriate domain-specific SLU
system. Our unsupervised SLU framework applies a non-para- module. To develop a multi-domain SDS with a data-driven
metric Bayesian approach to dialog acts, intents and slot entities,
which are the components of a semantic frame. The proposed approach, the domains of interest must first be defined. Next,
approach reduces the human effort necessary to obtain a se- a corpus for each domain must be collected. The collected
mantically annotated corpus for dialog system development. In corpora are then used to train the domain spotter module. The
this study, we analyze clustering results using various evaluation corpus for a specific domain is used to train the corresponding
metrics for four dialog corpora. We also introduce a multi-domain domain-specific module.
dialog system that uses the unsupervised SLU framework. We
argue that our unsupervised approach can help overcome the A semantically annotated corpus is required to automatically
annotation acquisition bottleneck in developing dialog systems. train the statistical SLU module. The process that generates the
To verify this claim, we report a dialog system evaluation, in annotated corpus from the raw corpus can be divided into two
which our method achieves competitive results in comparison sub processes: the design step and the labeling step. In the de-
with a system that uses a manually annotated corpus. In addition, sign step, a linguistic expert creates guidelines for the anno-
we conducted several experiments to explore the effect of our
approach on reducing development costs. The results show that tator. These guidelines specify information regarding the slots
our approach be helpful for the rapid development of a prototype and values contained in semantic frames. The design step might
system and reducing the overall development costs. not be difficult if the system specification is clearly and restric-
Index Terms—Dialog system, spoken language understanding, tively defined. Otherwise, the design step is not easy, even for
unsupervised learning. experts, when the system is developed from an unlabeled dialog
corpus. In the labeling step, the annotator assigns semantic tags
to elements in the raw corpus using the guidelines. The labeling
I. INTRODUCTION step is time-consuming and labor-intensive. For these reasons,
developing a multi-domain SLU module requires considerable
M UCH research has been conducted to develop spoken
dialog systems (SDSs) that provide a natural and effec-
tive interface between humans and machines [1]–[5]. In most
effort and expense. Therefore, this process can be a major bar-
rier that inhibits the rapid development of multi-domain SDSs.
This paper aims to resolve this issue by applying a fully
cases, SDSs have been implemented for a specific domain, e.g.,
unsupervised approach to the SLU problem. Instead of ap-
weather [6], travel [7] and vehicle navigation [8]–[10]. In SDSs,
plying a classical unsupervised approach, such as the k-means
a spoken language understanding (SLU) module is one of the
algorithm, in which the number of clusters must be fixed in
key components that aims to fill domain-specific semantic frame
advance, we propose using a non-parametric Bayesian ap-
slots from input utterances.
proach that can flexibly infer efficient cluster numbers during
Recently, multi-domain SDSs have been employed to support
the learning process. The use of this non-parametric Bayesian
a wide range of tasks [11]–[16]. Multi-domain SDSs generally
approach considerably reduces the human efforts required for
utilize a distributed architecture and consider extensibility and
the design and labeling steps. We employ an unsupervised SLU
scalability. Thus, such architectures contain several different
framework in conjunction with each domain corpus to develop
domain-specific SLU modules. The domain spotter module
a multi-domain SDS. During this process, the domain spotter
model is also trained.
Manuscript received October 24, 2012; revised February 27, 2013, May 19,
Another contribution of this paper is that we use an SDS to
2013, and July 09, 2013; accepted August 26, 2013. Date of publication Au-
gust 29, 2013; date of current version October 15, 2013. This work was sup- evaluate the unsupervised SLU framework. To our knowledge,
ported by the Industrial Strategic technology development program, 10035252, this is the first attempt to do so. For evaluating the unsuper-
development of dialog-based spontaneous speech interface technology on mo-
vised SLU framework, comparing it with human annotation is
bile platform, funded by the Ministry of Knowledge Economy (MKE, Korea).
A preliminary version of this paper appeared in ICASSP 2012, Kyoto, Japan. a simple but limited method. Differences between human an-
The associate editor coordinating the review of this manuscript and approving notations and the results of the unsupervised SLU framework
it for publication was Prof. Renato De Mori.
may not often be a significant problem because human annota-
D. Lee, K. Kim, S. Ryu, and G. G. Lee are with Department of Computer
Science and Engineering, Pohang University of Science and Technology tions are not necessarily the best result. Additionally, the funda-
(POSTECH), Pohang 790-784, Korea (e-mail: semko@postech.ac.kr; mental objective is to use the unsupervised SLU framework in
getta@postech.ac.kr; ryush@postech.ac.kr; gblee@postech.ac.kr).
an SDS. Therefore, the experiments we conduct are intended to
M. Jeong is with Microsoft, Redmond, WA 98052 USA (e-mail: minwoo.
jeong@microsoft.com). determine the effects of the unsupervised SLU approach in the
Digital Object Identifier 10.1109/TASL.2013.2280212 context of its end-use application.

1558-7916 © 2013 IEEE


2452 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 11, NOVEMBER 2013

The organization of this paper is as follows. In Section II, to train the model. However, the proposed method still requires
we introduce traditional SLU techniques and some previous at- a design step and some labeled data.
tempts to reduce the human effort necessary to train the module. Recently, an unsupervised topic modeling approach has been
In Section III, we describe our proposed unsupervised SLU used in tasks associated with SLU modules. This approach is
framework for multi-domain SDSs. In Section IV, we explain based on latent Dirichlet allocation (LDA) [30]. Unsupervised
the structure of a dialog system that uses the unsupervised SLU modeling of dialog acts has been applied to online conversation
framework. In Section V, we present our experimental results. and social networking applications [31], [32]. For the named en-
Finally, we state our conclusions and describe possibilities for tity recognition for news articles and search queries [33], [34], a
future work in Section VI. technique that uses a topic model has also been introduced. Ce-
likyilmaz et al. [35] proposed a generative Bayesian model that
II. RELATED WORK can support joint learning for a domain, dialog act and named
Much research regarding SLU modules has been conducted. entity, in conjunction with all of the labeled and unlabeled ut-
Initially, many studies used a rule-based approach that depends terances. Furthermore, the authors utilized external knowledge,
on handcrafted grammar [17]–[19]. This rule-based approach such as web search logs, to reinforce their model. The unsuper-
often achieves good performance in spite of not requiring a sig- vised approach introduced above eliminates the cost incurred
nificant amount of training data. Moreover, this method fea- during the labeling step; however, the cost of the design step
tures outstandingly good manageability and has been applied to still remains because the number of clusters must be fixed. Thus,
a number of commercial systems. However, a problem remains: human effort is required to analyze the data and obtain the op-
this rule-based approach requires high-level linguistic skills and timal number of clusters.
manual labor to write the grammar, which incurs high costs More recently, the hierarchical Dirichlet Process (HDP) [36],
when the approach must be applied to a new domain or a new [37], which is a non-parametric Bayesian approach, has been
language. This strategy is also not robust against speech recog- used to resolve the clustering problem. Unlike parametric ap-
nition errors. Contrastingly, many data-driven approaches have proaches, this method allows the model to learn an appropriate
utilized statistical models that are automatically learned from cluster number. Crook et al. [38] employed a non-parametric ap-
labeled training data [20]–[22]. Collecting labeled training data proach to conduct an unsupervised classification of dialog acts
is less expensive than writing a handcrafted grammar. Hence, in a travel planning dialog corpus. Furthermore, Higashinaka et
data-driven approaches can be considered portable because they al. [39] applied an HDP hidden Markov model (HDP-HMM) to
can more easily be applied to new domain or a new language. a dialog corpus. HDP-HMM is capable of reflecting the sequen-
Additionally, data-driven approaches tend to be relatively ro- tial structure of dialog. The authors focused only on dialog acts
bust because they generate models in a various forms from data. and considered only one dialog corpus.
However, a problem remains: data-driven methods require a sig- Our work is also based on a non-parametric approach. How-
nificant amount of data to obtain good performance and over- ever, our work differs from the previous works in several as-
come the intrinsic limit of data sparseness. For these reasons, pects. We focus on the entire task performed by an SLU module
many studies have proposed hybrid approaches that take advan- and consider dialog corpora collected from multiple domains.
tage of both types of methods [23]–[25]. In addition, the previous works evaluated the performance in
In an SLU task, a semantic frame can be divided into two in- terms of the clustering results. In contrast, we also conduct ex-
dicative parts [26]: (1) the meaning of input utterance and (2) periments to determine the effects of an unsupervised model in
the named entity. Generally, each part is treated separately as a the context of an end-to-end dialog system.
classification problem and a sequence labeling problem, respec-
tively, by assuming a specific domain context. Jeong et al. [27] III. UNSUPERVISED SLU FRAMEWORK
introduced triangular chain conditional random fields, which
can be applied to joint modeling, to solve the SLU problem. A. Problem Definition
This approach can improve the performance of the SLU module; An SLU module aims at extracting semantic frames from user
however, this strategy still incurs high costs and requires much utterances. The semantic frames, which consist of slot-value
effort to develop the system because of the need for labeled pairs, are used by the dialog manager. In an SLU task, we divide
training data. a semantic frame into three sub-parts: the dialog act (DA), the
Many researchers have focused on reducing the human ef- intent and the slot entity (SE) [26]. The DA and intent indicate
fort that is required to develop SLU modules. To resolve this the meaning of the input utterance at the discourse level.
problem, Tur et al. [28] applied active learning and semi-super- Whereas the DA is a domain-independent and surface-level
vised learning. The authors built a statistical model by applying concept, the intent is a domain-dependent and function-level
a supervised learning method to a small amount of labeled data concept. The slot entity denotes a domain-specific semantic
from a call classification task. Next, the authors applied the constituent at the word or phrase level. Fig. 1 shows an example
trained model to a large amount of unlabeled data. Among those of a semantic frame for a user utterance.
data, the authors selected the data with low confidence scores The supervised approach considers determining DAs and in-
and then performed human labeling for these data. Wu et al. [29] tents as classification problems and determining SEs as a se-
also introduced an SLU approach based on a two-stage classifi- quence labeling problem. Given the input utterance, the DA and
cation that can apply weakly supervised training. In their study, intent are assigned to one of the predefined classes. The SE is as-
the authors could significantly reduce the human effort required signed to one of the classes, which include the class ‘none’ for
LEE et al.: UNSUPERVISED SPOKEN LANGUAGE UNDERSTANDING FOR A MULTI-DOMAIN DIALOG SYSTEM 2453

Fig. 1. A semantic frame example for an input utterance.

the words that compose the input utterance. The classes used Fig. 2. The overall procedure for our unsupervised SLU framework.
for the DA are a collection of surface-level speech acts such as
yn_question, wh_question and request, and they are domain-in-
dependently defined. In contrast, the classes used for the in-
tent and the SE are defined depending on the domain. For ex-
ample, in the building guidance domain, the intent classes in-
clude search_loc, search_phone and guide_loc. The SE classes
include ROOM_NAME, ROOM_TYPE and ROOM_NUMBER.
To develop a multi-domain SDS, we have collected a raw
dialog corpus from the target domains. The raw dialog corpus
consists of dialogs that contain a sequence of user utterances and
system acts. In the building guidance domain, the system acts
include GREET, INFORM, GUIDE and SAY. In the supervised
approach, the process for training the SLU and DM models is as
follows. First, classes for the DAs, intents and SEs are defined Fig. 3. A graphical representation of the dialog act clustering model.
for each target domain. Second, for each user utterance included
in the raw dialog corpus, all of the labels related to the DAs, be used for the domain spotter model. Unlike dialog act clus-
intents and SEs must be annotated prior to further processing. tering, SE and intent clustering are conducted separately for the
Especially for multi-domain SDSs, we must additionally train dialog corpus of each domain. Prior to conducting slot entity
the domain spotter model. In this case, annotation is not required clustering, the SE candidates must be generated by using word
because the dialog corpus is gathered when the specific domain sources that have been assigned during the dialog act clustering
is specified. step. Clustering is performed by applying a virtual context doc-
In this paper, our goal is to apply a fully unsupervised SLU ument concept for each candidate. In the final step, intent clus-
approach to analyze dialog corpus that is gathered from mul- tering, utterance clustering is performed by using words, SEs
tiple domains. We collect the information regarding the intents, and system acts as features. In each clustering step, little human
DAs and SEs that should be used for a semantic frame using a effort is necessary. The overall procedure of our proposed unsu-
clustering method. In this approach, we can omit two processes: pervised SLU framework is illustrated in Fig. 2. The following
(1) the process that defines classes for DAs, intents and SEs and sections explain the details of the model for each clustering step.
(2) the process for human labeling. In the proposed approach,
cluster IDs are assigned to DAs, intents and SEs rather than
classes. Although cluster IDs are not stored in a human-read- C. Dialog Act Clustering
able format, this might not be a problem for SDS development The model used for dialog act clustering is a combined form
because the system can still determine whether two objectives of the HDP-HMM and the content model. The content model,
are identical. We can assign a meaningful label to each cluster which was proposed by Barzilay and Lee [40] for summariza-
after analyzing the results if required. tion tasks, utilizes the HMM model for topic transition with
each topic generating message. By considering dialog act to
be a hidden state, we assume that each user utterance is gener-
B. Overall Procedure
ated from a dialog act. Based on the content model, a non-para-
In this paper, we propose an unsupervised SLU framework metric approach and a Bayesian extension for dialog corpus are
that uses a cascade approach. The framework is divided into applied. A graphical representation of the dialog act clustering
three clustering steps. These steps are performed in the fol- model is presented in Fig. 3.
lowing order: (1) dialog act clustering, (2) slot entity clustering Our model is applied to multiple domain-specific dialog cor-
and (3) intent clustering. The result of each step is required for pora. Each domain-specific dialog corpus consists of dialogs
the subsequent step. for a single domain. Each dialog is a sequence of dialog
In the dialog act clustering step, utterance clustering is con- acts , and each act generates a sentence. The hidden variables
ducted and a word source is assigned to the words that belong to are sampled from the posterior distributions using the following
an utterance. The input for the dialog act clustering algorithm is equations.
an augmented corpus that is a combination of the dialog corpora
from each domain. During the clustering process, we obtain a
word distribution for each domain. The word distribution can
2454 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 11, NOVEMBER 2013

The notation is as follows: is the number of words in


utterance of dialog ; is the vocabulary size,
is the number of bigram transitions; is the number
of source-word pair occurrences; is the number
of source-word pair occurrences in a specific domain ; and
is the number of act-word pair occurrences. Fig. 4. The words , entities and sources for an example sentence.
Our model uses a Dirichlet process (DP) to define an a priori
distribution for transition matrices in the infinite state spaces.
The state transitions are generated by , whose prior with symmetric hyperparameter . In a background
is . The transition prior is generated by with model , the word distribution is defined as follows:
strength parameter , which presents the inverse-variance of
the DP, and base distribution , which presents the mean of the
DP. The base distribution is generated using with
hyperparameter . GEM is the GEM distribution, which is also
known as the stick-breaking construction. The hyperparameter
controls the number of states, and controls the sparsity of If the word source is the domain words , then a
the transition matrix. For the transition matrices , the distri- word is generated using with a prior generated
bution is defined as follows. using with symmetric hyperparameter . The domain
model can be used for the domain spotter, which is a key
issue in multi-domain SDSs. For a specific domain , the
word distribution of the domain model can be defined as
Each sentence is represented by a bag of words that is rep- follows:
resented using the plate, respectively. We assume that each
word is generated from one of three sources: words in the cur-
rent dialog act , general words and domain
words . Thus, a new hidden variable determines the
source of each word. This variable is initially assigned a value
of zero and is drawn from with a parameter gen-
erated by with a parameter . In a specific domain , D. Slot Entity Clustering
the distribution for the source of each word is defined as
Slot entity clustering is separately applied to the dialog corpus
follows:
for each specific domain. First, the SE candidates must be ex-
tracted. Next, clustering is performed for the SE candidates.
1) Slot Entity Candidate Generation: To generate SE
candidates, we use the source of each word. The sources are
determined by the hidden variable in the dialog act clus-
tering model. Primarily, we generate temporary SE candidates
by extracting consecutive words, which are drawn from the
domain words. However, segmentation is required for certain
temporary SE candidates, as shown in Fig. 4. That figure
includes the words, human-annotated entities and word sources
General words that frequently occur in many sentences and for an example sentence. A word source is 2 when a corre-
domain words that frequently occur only in a specific domain sponding word is generated from the domain words. Therefore,
are not helpful for identifying a dialog act. By applying different the last four words in the example sentence are considered a
emission distributions, depending on the word source, sentence SE candidate. According to human annotation, the candidate
clustering can be conducted in a more tailored manner for each should be divided into two candidates. To determine the final
dialog act. SE candidates, we apply a segmentation algorithm using the
If the word source is the words in the current dialog act point mutual information (PMI)-based stickiness function
, then the word is generated using with a proposed by Li et al. [41].
prior generated using with symmetric hyperparam- The goal of segmentation is to divide a temporary SE can-
eter . For a specific dialog act , the word distribution of the didate into consecutive segments that each contain one or
dialog act model is defined as follows: more words, where . When the optimal segmen-
tation is obtained, each segment is added to the final SE candi-
date list. To identify the optimal segmentation out of all possible
segmentations, the following objective function is used:

If the word source is the general words , then the


word is generated using with a prior generated using
LEE et al.: UNSUPERVISED SPOKEN LANGUAGE UNDERSTANDING FOR A MULTI-DOMAIN DIALOG SYSTEM 2455

Fig. 5. A graphical representation of our model for slot entity clustering.

Fig. 6. The features of a virtual context document for a slot entity candidate.
To measure the stickiness of a segment, we use PMI, which
is a measurement of word collocation. The stickiness function,
, is defined by mapping the generalized PMI to E. Intent Clustering
the range [0, 1]. Given the segment , the equa- The objective of this process is to determine clustering ut-
tions are as follows, where Pr(.) is the probability of the word terances that correspond to an identical intent. The intent clus-
sequence and is calculated for a given corpus: tering model has a form similar to that of the dialog act clus-
tering model, which is also used for clustering utterances. DAs
are domain-independent. However, intents denote domain-de-
pendent meanings of utterances. Because of this dependence,
intent clustering is applied to the dialog corpus of a specific do-
main. Additionally, a DA is a surface-level concept; thus, it uses
only words as features. However, an intent is a functional-level
By using the method mentioned above, SE candidates can be concept; thus, it uses SEs and system acts in addition to words as
extracted without using external knowledge. When extracting features. According to the study of Jeong et al. [27], a cascade or
the SE candidates, the performance can be improved by incor- joint approach that combines intent classification and SE recog-
porating a general domain SE recognizer or domain dictionary nition for SLU can perform better than non-hybrid methods.
that has been built in advance. In addition, intents are utilized as significant information when
2) Slot Entity Candidate Clustering: For classification and predicting the system act in the DM. Thus, SEs and system acts
latent semantic association of SEs, topic models, such as LDA, are significantly correlated with intents and assumed to be useful
can be used by applying a virtual context document concept. features for intent clustering.
Our method for slot entity clustering is inspired by such ap- A graphical representation of the intent clustering model is
proaches. The goal of slot entity clustering is to group the SE shown in Fig. 7. The transition matrices can be defined in the
candidates into SE types. For this purpose, we use a non-para- same manner as for the dialog act clustering model. Each dialog
metric Bayesian approach. Fig. 5 shows a graphical represen- is a sequence of intents , and each intent generates an utter-
tation of our model for slot entity clustering. We assume that ance and system acts . The hidden variables are sampled from
each SE candidate has SE type and that each SE type gener- the posterior distributions given by the following equations:
ates a virtual context document, which is represented by a bag
of context features using plate. The SE type is generated
using , whose prior is . The prior is the distribution
of the SE types of all the SE candidates and is generated using
a DP with hyperparameter and base distribution . The base
distribution is generated using a GEM distribution with hyper- We define additional notation as follows: is the number
parameter . The context feature is generated using , of entities in utterance of dialog , is the number of unique
whose prior is . The prior is a distribution of the context SEs, is the number of unique system acts and is the
features in the SE types and is generated using with number of intent-system act pair occurrences.
symmetric hyperparameter . By generating the virtual con- Each utterance is represented by a bag of words and entities
text documents from all of the SE candidates and then learning that shown using the , and plates, respectively. The
the model, we can cluster the SE candidates. entity word is replaced with the entity class name. The entities
Context words and collocations are included in the virtual are generated using with a prior that is generated
context document for each SE candidate. Context words and using Dir with symmetric hyperparameter . As for the
collocations were used as features for slot classification by Wu specific intent , the entity distribution can be defined as
et al. [29]. If the words in a sentence follows:
are extracted to an SE candidate , the features that compose the
virtual context document for the SE candidate are presented in
Fig. 6.
2456 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 11, NOVEMBER 2013

Fig. 7. A graphical representation of the intent clustering model.

The system acts are generated using , with a prior


that is generated using Dir with symmetric hyperpa-
rameter . For specific intent , the system act distribution
can be defined as follows.

In intent clustering, unlike in dialog act clustering, each word Fig. 8. Overall strategy of the EBDM framework.
is generated from one of two sources: the general words or
the words in the current intent. A new hidden variable de-
termines the source of each word and is drawn from EBDM framework’s advantages, such as supporting rapid pro-
(the Bernoulli distribution) with parameter , which is gener- totyping in various applications and incurring reasonably low
ated using (the beta distribution) with parameter . costs during development.
The word distribution for a specific intent and the word dis- With a traditional EBDM framework that uses a supervised
tribution for a general word in the background model can be approach, the dialog corpora are collected at each domain and
applied in the same manner as in the dialog act clustering model. then, semantic tags (e.g., DA, intent and SE) are manually as-
signed to user utterance. With the proposed method, however,
F. Inference
semantic tags are automatically assigned by applying an unsu-
To perform inference in our models, we use Gibbs sampling pervised SLU module to a dialog corpus. Then, a dialog ex-
[42], which is a Markov chain Monte Carlo algorithm. The ample database (DEDB) is automatically built using the anno-
hidden variables are sampled from the posterior distributions. tated dialog corpus.
In our models, several hyperparameters must be provided. The Fig. 8 presents the overall strategy of the EBDM framework
simplest way to choose hyperparameters is a grid search. How- for SDSs. The EBDM framework employs a statistical SLU
ever, a grid search is time-consuming and expensive. Therefore, approach that performs modeling by using conditional random
instead of using a grid search, we use Bayesian priors for all fields (CRFs) [45] to extract semantic frames from user utter-
of the hyperparameters. In the Bayesian approach, we can ances. To train the SLU model, the EBDM framework uses user
automatically select the hyperparameters by treating them as utterances and the semantic tags attached to the user utterances
additional hidden variables. We sample the hyperparameters in a dialog corpus that has been automatically annotated using
using Gamma(0.1, 0.1) priors after each iteration of Gibbs an unsupervised SLU module.
sampling [37]. The dialog manager performs three steps to determine a
system act: query generation, example search, and example
IV. MULTI-DOMAIN SDS USING UNSUPERVISED selection. First, an SLU module result is generated by using
SLU FRAMEWORK a semantic frame and discourse history for the current utter-
ance (this step is the query generation step). Next, the dialog
A. An Example-Based Dialog Management Framework manager searches the DEDB for a dialog example with a
Our multi-domain SDS is implemented using an example- semantically similar form (this step is the example search step).
based dialog management (EBDM) framework [14], [43]. The If no matched item is found, the search is conducted again after
EBDM framework, which is a data-driven approach, is inspired the search condition is relaxed (this is referred to as a relaxation
by example-based machine translation [44]. The EBDM frame- strategy). Finally, the best dialog example is selected according
work is a simple and powerful method for rapidly developing to the score of the searched examples. The score is determined
SDSs for multi-domain dialog processing. In addition to these based on utterance similarity and discourse history similarity
features, the use of an unsupervised SLU module strengthens the (this step is referred to as the example selection).
LEE et al.: UNSUPERVISED SPOKEN LANGUAGE UNDERSTANDING FOR A MULTI-DOMAIN DIALOG SYSTEM 2457

The content database contains several contents that are TABLE I


related to the domain. To determine the content that the user CORPORA STATISTICS

wants, SEs are used. The discourse history stores and maintains
previous discourse information. The natural language gener-
ation (NLG) module is a semantic representation of a system
act. The NLG module generates the system utterance. To do so,
the NLG module utilizes a system template that is predefined
for each system act.
When the unsupervised SLU framework is applied, the
intents, DAs and SEs of the semantic frames are filled with cause k-means is a type of parametric approach. To resolve this
problem, we propose an enhanced method that automatically
cluster IDs. The cluster IDs are not in a human-readable format;
generates an agenda graph by modifying the intent clustering
however, this format does not create any problems for the EBDM
method. We assume that the sub-task generates words, entities
framework. Therefore, the human effort traditionally required in
and a system act similar to that of intent clustering. Addition-
the design and annotation steps of the SLU module and DM in the
ally, we consider the entities accumulated from the beginning
EBDM framework can be eliminated, because of the application
of the dialog until the current utterance, which is dissimilar to
of the unsupervised SLU framework. Regardless, during the
what is done for the intent clustering. An agenda graph can be
process of accessing knowledge source such as content DBs,
automatically generated after clustering in this manner. Upon
DB fields cannot be matched successfully using only unsuper-
clustering completion, the state denoting the sub-task is consid-
vised tags of slot entities. Therefore, a human must link the
ered to be the node and the transition probability between states
unsupervised tags and DB fields. However, this problem is not a
is considered to be the edge weight.
problem only for unsupervised approach. Human intervention is
also required in the supervised approach because the DB field V. RESULTS AND DISCUSSION
related to a given slot entity label must be specified in advance.
In this section, we discuss in detail the experiments that we
B. Domain Spotter conducted by applying the proposed models to dialog corpora.
Our experimental results consist of two parts. First, we com-
Our multi-domain SDS employs a distributed architecture pare the automatically labeled corpora, which were generated
that can be used to easily augment and modify the domain. using unsupervised methods, with the human-labeled corpora
The distributed structure includes a domain spotter and sev- by performing a clustering evaluation. Next, we present a di-
eral domain-specific modules. The domain spotter module is alog system evaluation to demonstrate the effectiveness of our
responsible for determining the domain of an input utterance models in an end-use application.
and for delivering the input utterance to the target domain-spe-
cific module. In this study, we employ a hybrid method, which A. Dialog Corpora
was used by Lee et al. [14], that combines keyword-based and Korean-language dialog corpora from various domains
feature-based approaches for the domain spotter module. This were collected to construct a dialog system. We used the
method uses linguistic (words), semantic (dialog acts) and key- Wizard-of-Oz method [50] to construct the corpus. Given
word features (n-best keyword and n-best class), as is done domain-specific tasks (e.g., “Find the president’s room for a
for maximum entropy classifiers [46]. Lee et al. used term fre- visit” in the building guidance domain), human-human dialogs
quency and inverse document frequency [47] to ex- were collected by ten individuals who were acquainted with the
tract keyword features. However, we utilized a domain model dialog system. In such an environment, unrealistic dialogs can
for dialog act clustering. Our method is related to LDA-style sometimes be included. To create a refined corpus for training,
topic identification [30]. we filtered out erroneous dialogs by considering utterance pat-
terns, vocabulary and dialog flow. We collected goal-oriented
C. Agenda Graph corpora in the following four domains: car navigation, weather
The EBDM framework can use an agenda graph as prior information, TV program guidance and building guidance.
knowledge [48]. In this approach, an agenda graph is used to After collecting the raw dialog corpora, we defined the DA,
address the robustness problem for practical applications. An intent and SE classes for each corpus through human analyses.
agenda graph provides a simple method of encoding domain- Next, we manually assigned the DA and intent classes to the ut-
specific dialog control to complete the task. An agenda graph is terances and the SE classes to the words in all of the utterances
composed of nodes that denote subtasks and edges that connect in the raw corpora. Statistics describing the manually labeled
two nodes. Each node has a precondition that is satisfied before corpora are presented in Table I. When using our unsupervised
the subtask is completed. The precondition is defined as the in- approach, only the raw corpora were required to generate auto-
tent of the current utterance and the accumulated SEs from the matically labeled corpora. The manually labeled corpora were
beginning of the dialog until the current utterance. used for reference in our experiments.
A clustering technique based on k-means has been proposed
to reduce the human effort required to construct an agenda graph B. Clustering Evaluation
[49]. However, this technique requires one to fix the number 1) Methods and Measures: We performed clustering eval-
of nodes included in an agenda graph prior to clustering be- uation by measuring the distance between the clustering result
2458 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 11, NOVEMBER 2013

TABLE II
RESULTS OF THE DIALOG ACT CLUSTERING FOR VARIOUS MODELS;
MEAN FOR EACH CRITERION

Significantly better than HDP .


Significantly better than HDP-HMM .
Significantly better than HDP-HMM-A .

and the gold standard. Here, we regarded the manually labeled


corpora as the gold standard. For the clustering evaluation, we
Fig. 9. The confusion matrix for the clustering result of our dialog act clus-
computed the following measures: purity (Pur), the Rand index tering model.
(RI), the F-measure (F-M) and the V-measure (V-M) [51], [52].
For automated quantitative analyses, we obtained the clus- TABLE III
tering result after 1000 iterations and computed the measures RESULTS OF THE SLOT ENTITY CLUSTERING FOR THE HDP AND LDA
described above over 100 trials for each method. These mea- MODELS USING THE GENERATED SLOT ENTITY CANDIDATES;
MEAN FOR EACH CRITERION
sures range between 0 and 1, and a higher value indicates
better clustering quality. In addition, we analyzed the clustering
quality by examining the confusion matrix, which can be used
to visualize how much a cluster differs from the gold standard.
2) Dialog Act Clustering: To verify the effects of our dialog
act clustering method (HDP-HMM-A), which was introduced
in Section III-C, we compared it with other methods, such as
HDP-HMM, HDP and LDA-HMM-A. The HDP-HMM method
uses only bags-of-words as features for the dialog act clustering,
and HDP is a method that ignores the state transition probability Significantly better than HDP .
in HDP-HMM. Finally, LDA-HMM-A is a method that applies
a parametric approach to our approach by specifying the cluster clustering result can be observed from the confusion matrix.
number in advance. In this case, for the cluster number, we use However, for ‘request’ and ‘wh_question’ which are the most
the number as defined by human annotation. frequent dialog acts in the human-annotated corpora, the clus-
For the dialog act clustering, all of the dialog corpora from the tering was successfully performed.
four domains were used. The clustering performance of various 3) Slot Entity Clustering: We also evaluated the SE
methods is presented in Table II. This table includes the various candidate generation and clustering method (introduced in
measures previously mentioned and the number of clusters (#C). Section III-D). First, the SE candidate generation used the word
Each resulting value is obtained by calculating the average of source from the dialog act clustering model. We used the result
the values for 100 attempts. that applied the same criteria as were used for the confusion
The experimental results indicate that the transition proba- matrix in dialog act clustering, as mentioned above. The perfor-
bility contributes to enhanced performance. Considering the di- mance of the SE candidate generation yielded an F-measure of
alog structure yields an even better effect. Our method demon- 0.667 (precision and recall ). The reason why
strated improvements in the dialog act clustering performance the precision is low is that domain-specific expressions that are
because it alleviated the tendencies of the utterances being clus- not SEs were extracted along with SE candidates. To improve
tered together at the same domains by using the additional vari- the SE detection performance, we filtered out erroneous SE
ances for the word source. The LDA-HMM-A method exhibited candidates using hand-written rules. The rules were defined by
worse performance than HDP-HMM-A because the number of specifying POS tag sequences that cannot be SE candidates.
DAs defined by human annotation was not the optimal number For example, an SE candidate that ends with a post-position
of clusters. In most applications in which parametric approaches or verb was filtered out because such POS tag sequences are
are used, it might be necessary to conduct several experiments unnatural for SEs in the Korean language. When filtering was
that use different number of clusters to determine the optimal applied, the F-measure increased to 0.849 (
number of clusters. In our non-parametric approach, however, and ).
the model infers the optimal number of clusters. Second, we performed SE candidate clustering evaluation
To construct the confusion matrix, we selected the dialog act for each domain (Table III). We constructed the reference by
clustering result that was closest to the average value of the matching the SE candidates with the human labels. If the candi-
number of clusters calculated from 100 attempts (Fig. 9). The date was not an SE, the answer was set to the “none” class. Our
cluster ID is determined during the learning process. The figure slot entity clustering method (HDP) uses a virtual context docu-
indicates the cluster IDs and corresponding human labels. A sig- ment as a feature without using the transition probability. For the
nificant difference between human annotation and dialog act comparison method, we used a parametric approach (LDA) that
LEE et al.: UNSUPERVISED SPOKEN LANGUAGE UNDERSTANDING FOR A MULTI-DOMAIN DIALOG SYSTEM 2459

TABLE V
RESULTS OF THE INTENT CLUSTERING FOR VARIOUS MODELS; MEAN FOR
EACH CRITERION . (a) Car navigation. (b) Weather information.
(c) TV program guide. (d) Building guidance

Fig. 10. A confusion matrix for the clustering result of our slot entity clustering
model for the building guidance domain.

TABLE IV
RESULTS OF THE SLOT ENTITY CLUSTERING FOR THE HDP AND
LDA MODELS USING THE CORRECT SLOT ENTITY CANDIDATES;
MEAN FOR EACH CRITERION

Significantly better than HDP .

has the same structure. The performance was improved because


the cluster number suggested by the human annotation was used.
For our method, the confusion matrix for the slot entity clus-
tering result in the building guidance domain is illustrated in
Fig. 10. The clustering result selection criteria are the same as
for the dialog act clustering. Some different SEs that have sim-
ilar context words are observed in the same cluster.
In addition, we evaluated the slot entity clustering method Significantly better than HDP .
by extracting the correct SE candidates from the human labeled Significantly better than HDP-HMM .
corpus (Table IV). When the correct SE candidates were given, Significantly better than HDP-HMM-A .
a better performance was obtained. Although we focused on a
fully unsupervised approach in these experiments, we believe (the same as HDP-HMM-A except that a specified number of
that the use of external resource in the SE candidate genera- clusters was used).
tion step resulted in performance superior to that of slot entity We also evaluated the intent clustering for each domain
clustering. (Table V). Different tendencies were observed for different
4) Intent Clustering: In Section III-E, we introduced the domains. Whereas the use of word source negatively impacted
various features used for our intent clustering method. To deter- the performance for all domains except for the weather infor-
mine the effects of each feature, we employed various models mation domain, the use of the SE distribution enhanced the
in our experiments. Among them, we used HDP, which em- clustering performance for all domains except for the weather
ploys only bags-of-words as a feature, and HDP-HMM, which information domain. Use of the system act distribution yielded
considers the state transition probability as a baseline. We also a significant improvement in performance for every domain.
separately applied the model extension of HDP-HMM. Briefly, The experimental results indicate that the proposed method is
we measured the respective performances in three difference effective for intent clustering. The confusion matrix for the
cases: HDP-HMM-W, in which the word source is applied; intent clustering result of our method applied to the building
HDP-HMM-N, in which the SE distribution is applied; and guidance domain is shown in Fig. 11. The clustering result
HDP-HMM-S, in which the system act distribution is applied. selection criteria are the same as were used when dialog act
Moreover, we also measured the performance in two additional clustering was conducted. Some utterances that were labeled
cases: HDP-HMM-A (in which our intent clustering methods with different classes were assigned to the same clusters. Such
were used to apply all of the features) and LDA-HMM-A instances occurred when those utterances appeared in similar
2460 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 11, NOVEMBER 2013

TABLE VI
RESULTS OF THE DIALOG SYSTEMS FOR THE SIMULATED USER
EVALUATION; THE ATL, THE TCR AND SCORE ARE AVERAGED
OVER 1000 DIALOGS

Fig. 11. A confusion matrix for the clustering result of our intent clustering
model for the building guidance domain.

starts with 30 points, and the points are decremented in each


dialog turn. If the task of a dialog fails, SCORE is 30 points.
dialog contexts (e.g., bye and thank in the building guidance
In the simulated user evaluation, the ATL, the TCR and
corpus). In contrast, some utterances that were labeled with the
SCORE were automatically calculated by the simulator. In the
same class were assigned to different clusters. Such instances
real user evaluation, the participants completed questionnaires
occurred because of differences in the dialog context and
about successful turn and task completion. Using the results of
differences in the SEs contained in utterances (e.g., search_loc
questionnaires, the ATL, the TCR and SCORE were calculated.
and guide_loc in the building guidance corpus).
Additionally, we computed the successful turn rate (STR).
2) Simulated User Evaluation: We used a dialog sim-
C. Dialog System Evaluation
ulator that is composed of user-intention, user-surface and
In the previous evaluation, we measured the clustering per- ASR channel simulators [54]. The intention simulator was
formance using the manually annotated labels as the target clus- implemented by using the conditional random field model [45]
tering. For two reasons, this strategy is not the optimal method to consider dialog sequences. This simulator generated the
to evaluate the clustering results. First, we cannot guarantee that next user intention given the current discourse context. After
human annotation is the optimal answer. Second, the ultimate selecting the user intention, the surface simulator was used to
goal is to use the automatically labeled dialog corpus in a di- generate the user utterance to express the selected intention
alog system. Thus, for an enhanced evaluation, evaluating per- using intention-specific part-of-speech tag sequences and dic-
formance using the dialog system is necessary. tionaries. Then, the ASR channel was used to automatically
1) Methods and Measures: In our experiments, we used the transform the raw utterance into a noisy utterance according to
EBDM model [14] that was introduced in Section IV. We eval- a specified word error rate (WER).
uated the two dialog systems using the agenda graphs for each To automatically measure the success of the generated
corpus. The dialog systems were trained on the human anno- dialogs, the final states of task completion were defined. The
tated corpus (HC) or the automatically annotated corpus (AC). models for the simulator were trained on the manually an-
To construct models using the AC, we selected one result among notated dialog corpus. For each dialog system, we used 1000
the multiple clustering results using the criteria that were used simulated dialogs and the 5 best recognition hypotheses to mea-
to build the confusion matrix in the experiment in Section V-B. sure the TCR, the ATL, and SCORE under WER conditions
The agenda graphs were generated in a handcrafted manner for ranging from 0% to 50%.
the HC and automatically generated using a non-parametric ap- Table VI shows the performance of the two dialog system
proach for the AC, as explained in Section IV-C. for each domain when WER is zero. In addition, Fig. 12 shows
Even in the AC, a small amount of human intervention was SCORE of the two dialog systems under various WER condi-
required for the content DB manager and natural language gen- tions. In most cases, the HC exhibits better performance than
eration module of the dialog system. We performed two evalua- the AC, especially for SCORE. In spite of this result, the dif-
tions: a simulated user evaluation and a real user evaluation. The ferences between the two systems are not considerable except
evaluations were conducted for each domain. An experiment for for the weather information domain. As the value of WER in-
multi-domain was performed only using only real user evalua- creases, the differences tend to be smaller. In the case of TCR,
tion because our dialog simulator does not consider multi-do- the two dialog systems have similar performance. Even in some
main environments. Additionally, we conducted experiments on noisy environments, AC performed better than HC. Thus, we
the domain spotter used for our multi-domain SDS. conclude that the performance difference between the two sys-
To measure the quality of the dialog system, we computed the tems is not significant for the limited tasks investigated, partic-
average turn length (ATL) and the task completion rate (TCR) ularly in goal-oriented dialog systems with noisy environments.
for the dialogs. In addition, we defined an average score function 3) Real User Evaluation: To compare the performance of di-
(SCORE), which represents a combined measure of the ATL alog systems in the real world, ten undergraduate students were
and the TCR. SCORE is similar to the reward score commonly employed in a real-user evaluation. To prevent the participants
used in reinforcement learning-based dialog systems [2], [53]. from knowing whether the system was trained on the HC or
SCORE is calculated as follows. For each dialog, the system the AC, the participants tested two dialog systems in a random
LEE et al.: UNSUPERVISED SPOKEN LANGUAGE UNDERSTANDING FOR A MULTI-DOMAIN DIALOG SYSTEM 2461

TABLE VIII
RESULTS OF THEDOMAIN SPOTTER; THE ACCURACY IS CALCULATED
USING FIVE-FOLD CROSS VALIDATION

TABLE IX
RESULTS OF THE COLLECTION STEP FOR THREE EXPERIMENTAL CONDITIONS

trained during our dialog act clustering. For each respective


method, a five-fold cross validation was used.
Fig. 12. SCORE of the dialog systems from simulated user evaluation; SCORE The experimental results are presented in Table VIII. The re-
is averaged over 1000 dialogs under various WER conditions. sults indicate that extracting keyword features using our domain
model yields an increased accuracy compared with
TABLE VII
. The results suggest that the domain model that can
RESULTS OF THE DIALOG SYSTEMS FOR THE REAL USER EVALUATION; THE be obtained using the proposed unsupervised SLU framework
ATL, THE TCR AND SCORE ARE AVERAGED OVER 50 DIALOGS to develop multi-domain SDS can be helpfully utilized for the
domain spotter. The domain spotter using our domain model
was applied to multi-domain SDS, which was used in the user
evaluation of the dialog system. Using our domain model, the
proposed method can be more effective than the baseline be-
cause it enables the domain spotter to separate the words that
are unrelated to domain keywords by using word sources.

D. Development Costs
Several processes are necessary in developing a data-driven
dialog system. Among these processes, the collection step, de-
sign step and labeling step require more human effort and time
order. In the evaluation, we considered only text inputs. There- than any other processes. We conducted several experiments re-
fore, the WER was 0%. garding the ‘TV program guide’ domain to explore the effect of
We provided the participants with five pre-defined tasks for our unsupervised approach on reducing development costs.
each domain and three pre-defined tasks for multi-domain. The First, to collect utterances in the ‘TV program guide’ domain,
participants used the dialog system to accomplish the tasks. The we used the WOZ method and established the following three
participants judged the task success for each dialog and the turn experimental conditions:
success for each utterance. Finally, we measured the ATL, the • WOZ_NON: A situation where no dialog system exists.
TCR, STR and SCORE for each dialog system for the resulting • WOZ_SDS: A situation where a dialog system exists.
dialogs. Similar to the simulated-user evaluation, the system • WOZ_CON: A situation where a dialog system exists, and
with the HC and the system with the AC performed. The re- the user understands a function to be handled by the system
sults are shown in Table VII. Although these results may not be Ten participants were employed for each experimental con-
significant because of the small test set sizes, HC demonstrated dition. We collected 50 utterances for each participant. For each
better performance compared with AC in all cases. However, the experimental condition, we measured the elapsed time required
performance of AC is acceptable. We believe that our approach for a total of 500 utterances and the percentage of valid utter-
will be applicable for developing various domain systems for ances among the 500 utterances (Table IX). A valid utterance
real users. means an utterance that can be handled by the dialog system
4) Domain Identification Evaluation: To evaluate the do- and can be used in the training step.
main spotter introduced in Section IV-B, we confirmed if it cor- When the results of WOZ_NON and WOZ_SDS were com-
rectly predicted the target domain for each utterance. For the pared, the elapsed time was reduced by 55 minutes and 46 sec-
evaluation, we used linguistic, semantic and keyword features onds, as the dialog system supports system response generation.
for a maximum entropy classifier, similar to the work of Lee When the results of WOZ_SDS and WOZ_CON were com-
et al. [14]. To extract the keyword features, we considered two pared, the elapsed time was reduced by 57 minutes and 43 sec-
different cases: using (1) and (2) the domain model onds, and the percentage of valid utterances was increased from
2462 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 11, NOVEMBER 2013

TABLE X The experimental results show that our approach would be


RESULTS OF THE DESIGN AND ANNOTATION STEP FOR
helpful for the rapid development of a prototype system and
FOUR EXPERIMENTAL CONDITIONS
determining the function range that could be handled by the
system. In addition, when observing the preceding experimental
results that led to effective data collection, it could be contribute
to reducing the overall development costs.

E. Discussion
76.2% to 99.8%. The reason for this result is that valid utter- 1) Design and System Performance: The dialog act, intent
ances were rapidly generated by the participants who under- and slot entity clustering results generated using our unsuper-
stood the functions of the dialog system. vised SLU framework might not be satisfactory when compared
The experimental result shows that the rapid development of with human annotation. According to the clustering evaluations,
the prototype system and the definition of the function range that the values of the F-measure and the V-measure ranged from 0.3
could be handled by the system are important for the efficiency to 0.6. However, our objective is to generate clustering results
of the collection step. Usually, an initial raw dialog corpus is as an alternative to human annotation in dialog system devel-
collected without implementation of the dialog system. To de- opment. The ultimate objective of an SDS is to perform system
velop the prototype system using the corpus, the design step and acts that comply with the desired intention of the user. To ef-
the labeling step should be performed on the corpus. ficiently support this objective, a semantic frame is used in the
To measure the costs required for the design step and the la- SLU module and the DM. However, achieving an optimal de-
beling step, the following four experimental conditions were es- sign for a semantic frame to fully meet the ultimate objective of
tablished: the SDS is difficult. When determining a system act, we use var-
• HUMAN: two steps are performed by a human under the ious features and consider the semantic frame and the discourse
condition that only raw dialog corpus is provided. history of the current utterance. Although the DA, intent and SE
• HUMAN_K: two steps are performed by a human under were designed to be different, the system was able to determine
the condition that the raw dialog corpus and knowledge an identical system act for an identical utterance. Therefore, the
database are provided. dialog system did not yield a significant difference in system
• AUTOMATIC: two steps are replaced with clustering re- performance during the evaluation, despite the differences be-
sults (Our approach) tween the clustering results and the human annotation.
• SEMI_AUTO: two steps are performed by a human 2) Joint Model: Our SLU framework uses a cascade ap-
under the condition that the dialog corpus, knowl- proach. In this approach, dialog acts, slot entities and intents
edge database and clustering results are provided are clustered sequentially. During the clustering process, each
Our approach human intervention is clustered using different model, respectively. Therefore, our
For each experimental condition, five participants with ex- method is constrained to consider only one-way dependency as
perience in developing a dialog system performed the design specified in advance. To overcome this constraint, a joint ap-
step and the labeling step. We measured the average elapsed proach can be useful for a unified learning. Jeong et al. [27] used
time and gain rate (Table X). The gain rate measures the relative a joint model in an SLU framework with a supervised approach.
improvement level and was calculated as follows: Gain rate Many studies about unsupervised grammar induction [55] also
, where and represent the elapsed time consider unified learning. By considering these methods, we
under the HUMAN and the corresponding experimental con- will apply a joint approach, in which dialog acts, slot entities
dition, respectively. The initial raw dialog corpus used for the and intent are clustered using one model.
experiment is the ‘TV Program guide’ domain introduced in
Table I, which includes 138 dialogs and 1272 utterances.
VI. CONCLUSIONS AND FUTURE WORK
With the HUMAN regarded as the baseline, the elapsed
time was reduced by 11.2% when the knowledge database was In this paper, we presented an unsupervised SLU framework
provided, which means that the information in the knowledge that uses a non-parametric Bayesian approach. Our unsuper-
database is useful for the design step and the labeling step. vised SLU framework was used to perform dialog act, intent
The AUTOMATIC is applied by our approach introduced in and slot entity clustering for dialog corpora. DAs, intents and
this paper, and the elapsed time shown in the table represents SEs are components of semantic frames. The primary advan-
the running time of the program. In the AUTOMATIC, the tage of our method is the considerable reduction of the human
development of prototype system is significantly faster than effort required for the design and labeling steps yielded by the
under any other condition, but it has a disadvantage in that application of a non-parametric Bayesian approach. We also ap-
it shows relatively low performance (Table VI and Fig. 12), plied our unsupervised SLU framework to a dialog system. In
and the labeled corpus is not in a human readable format. In our experiments, we not only conducted clustering evaluation
the SEMI_AUTO, the elapsed time was reduced by 48.0% but also performed a dialog system evaluation using four di-
compared with the HUMAN. In addition, in contrast to the AU- alog corpora. Moreover, we evaluated the domain spotter for a
TOMATIC, because the labeled corpus from the SEMI-AUTO multi-domain SDS. The experimental results indicate that our
is in a human readable format, it would be useful in providing proposed method is applicable and useful for the rapid develop-
a guideline for an additional data collection step. ment of multi-domain SDSs.
LEE et al.: UNSUPERVISED SPOKEN LANGUAGE UNDERSTANDING FOR A MULTI-DOMAIN DIALOG SYSTEM 2463

Several aspects of our approach require further research. We [18] S. Seneff, “TINA: A natural language system for spoken language ap-
focused on a fully unsupervised method. However, the clus- plications,” Computational Linguistics, vol. 18, no. 1, pp. 61–86, 1992.
[19] Y. Y. Wang, “A robust parser for spoken language understanding,” in
tering quality can be improved by applying partial supervision Proc. Eurospeech, 1999, pp. 2055–2058.
and domain knowledge. We plan to develop a method that com- [20] N. Gupta, G. Tur, D. Hakkani-Tur, S. Bangalore, G. Riccardi, and M.
bines our unsupervised model with heuristic rules. We also plan Gilbert, “The AT&T spoken language understanding system,” IEEE
Trans. Audio, Speech, and Language Processing, vol. 14, no. 1, pp.
to perform a real user evaluation that uses many more partici- 213–222, 2006.
pants and verify our framework using a more complicated di- [21] Y. He and S. Young, “Semantic processing using the hidden vector state
alog system. In addition, we must implement a dialog system model,” Computer Speech & Language, vol. 19, no. 1, pp. 85–106,
2005.
development toolkit that provides an unsupervised method for [22] W. Minker, S. Bennacef, and J. L. Gauvain, “A stochastic case frame
faster development and more efficient management of dialog approach for natural language understanding,” in Proc. Int. Conf.
systems. Spoken Lang. Process., 1996, pp. 1013–1016.
[23] M. Rochery, R. Schapire, M. Rahim, N. Gupta, G. Riccardi, S. Ban-
galore, H. Alshawi, and S. Douglas, “Combining prior knowledge and
REFERENCES boosting for call classification in spoken language dialogue,” in Proc.
Int. Conf. Acoust,, Speech, Signal Process., 2002, pp. I-29–I-32.
[1] D. Bohus and A. I. Rudnicky, “The RavenClaw dialog management [24] Y. Y. Wang, A. Acero, M. Mahajan, and J. Lee, “Combining statistical
framework: Architecture and systems,” Comput. Speech Lang., vol. 23, and knowledge-based spoken language understanding in conditional
no. 3, pp. 332–361, 2009. models,” in Proc. Comput. Linguist., Annu. Meeting Assoc. Comput.
[2] J. D. Williams and S. Young, “Partially observable Markov decision Linguist., 2006, pp. 882–889.
processes for spoken dialog systems,” Comput. Speech Lang., vol. 21, [25] C. Wutiwiwatchai and S. Furui, “Combination of finite state automata
no. 2, pp. 393–422, 2007. and neural network for spoken language understanding,” in Proc. Eu-
[3] L. F. Hurtado, D. Griol, E. Sanchis, and E. Segarra, “A stochastic rospeech, 2003, pp. 2761–2764.
approach to dialog management,” in Proc. IEEE Workshop Autom. [26] A. Stolcke, K. Ries, N. Coccaro, E. Shriberg, R. Bates, D. Jurafsky,
Speech Recogn. Understand., 2005, pp. 226–231. P. Taylor, R. Martin, C. V. Ess-Dykema, and M. Meteer, “Dialogue
[4] C. Lee, S. Jung, K. Kim, D. Lee, and G. G. Lee, “Recent approaches to act modeling for automatic tagging and recognition of conversational
dialog management for spoken dialog systems,” J. Comput. Sci. Eng., speech,” Comput. Linguist., vol. 26, no. 3, pp. 339–373, 2000.
vol. 4, no. 1, pp. 1–22, 2010. [27] M. Jeong and G. G. Lee, “Triangular-chain conditional random
[5] S. Jung, C. Lee, and G. G. Lee, “Using utterance and semantic level fields,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 7, pp.
confidence for interactive spoken dialog clarification,” J. Comput. Sci. 1287–1302, Sep. 2008.
Eng., vol. 2, no. 1, pp. 1–25, 2008. [28] G. Tur, D. Hakkani-Tür, and R. E. Schapire, “Combining active and
[6] V. Zue, S. Seneff, J. R. Glass, J. Polifroni, C. Pao, T. J. Hazen, and L. semi-supervised learning for spoken language understanding,” Speech
Hetherington, “JUPlTER: A telephone-based conversational interface Commun., vol. 45, no. 2, pp. 171–186, 2005.
for weather information,” IEEE Trans. Speech Audio Process., vol. 8, [29] W. L. Wu, R. Z. Lu, J. Y. Duan, H. Liu, F. Gao, and Y. Q. Chen,
no. 1, pp. 85–96, Jan. 2000. “Spoken language understanding using weakly supervised learning,”
[7] M. Walker, J. Aberdeen, J. Boland, E. Bratt, J. Garofolo, L. Hirschman, Comput. Speech Lang., vol. 24, no. 2, pp. 358–382, 2010.
A. Le, S. Lee, S. Narayanan, and K. Papineni, “DARPA communi- [30] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,”
cator dialog travel planning systems: The June 2000 data collection,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, 2003.
in Proc. Eur. Conf. Speech Commun. Technol., 2001, pp. 1371–1374. [31] S. Joty, G. Carenini, and C. Y. Lin, “Unsupervised modeling of dialog
[8] O. Lemon, K. Georgila, J. Henderson, and M. Stuttle, “An ISU dialogue acts in asynchronous conversations,” in Proc. Int. Joint Conf. Artif.
system exhibiting reinforcement learning of dialogue policies: Generic Intell., 2011, pp. 1807–1813.
slot-filling in the TALK in-car system,” in Proc. Eur. Chap. Assoc. [32] A. Ritter, C. Cherry, and B. Dolan, “Unsupervised modeling of twitter
Comput. Linguist., 2006, pp. 119–122. conversations,” in Proc. North Amer. Chap. Assoc. for Comput. Lin-
[9] W. Minker, U. Haiber, P. Heisterkamp, and S. Scheible, “The SENECA guist.: Human Lang. Technol., 2010, pp. 172–180.
spoken language dialogue system,” Speech Commun., vol. 43, no. 1, [33] J. Guo, G. Xu, X. Cheng, and H. Li, “Named entity recognition in
pp. 89–102, 2004. query,” in Proc. Assoc. Comput. Machinery’s Special Interest Group
[10] F. Weng, S. Varges, B. Raghunathan, F. Ratiu, H. Pon-Barry, B. Inf. Retrieval, 2009, pp. 267–274.
Lathrop, Q. Zhang, H. Bratt, T. Scheideck, and K. Xu, “CHAT: A [34] D. Newman, C. Chemudugunta, and P. Smyth, “Statistical entity-topic
conversational helper for automotive tasks,” in Proc. Int. Conf. Spoken models,” in Proc. 12th Assoc. Comput. Machinery’s Special Interest
Language Processing, 2006, pp. 1061–1064. Group Knowl. Discov. Data Mining Int. Conf. Knowl. Discov. Data
[11] J. Allen, D. Byron, M. Dzikovska, G. Ferguson, L. Galescu, and A. Mining, 2006, pp. 680–686.
Stent, “An architecture for a generic dialogue shell,” Natural Language [35] A. Celikyilmaz and D. Hakkani-Tur, “A joint model for discovery of
Engineering, vol. 6, no. 3&4, pp. 213–228, 2000. aspects in utterances,” in Proc. Assoc. Comput. Linguist., 2012, pp.
[12] K. Komatani, N. Kanda, M. Nakano, K. Nakadai, H. Tsujino, T. Ogata, 330–338.
and H. G. Okuno, “Multi-domain spoken dialogue system with [36] E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky, “An
extensibility and robustness against speech recognition errors,” in HDP-HMM for systems with state persistence,” in Proc. Int. Conf.
Proc. 7th SIGDIAL Workshop on Discourse and Dialogue, 2009, pp. Mach. Learn., 2008, pp. 312–319.
9–17. [37] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical
[13] O. Lemon, A. Gruenstein, A. Battle, and S. Peters, “Multi-tasking and dirichlet processes,” J. Amer. Statist. Assoc., vol. 101, no. 476, pp.
collaborative activities in dialogue systems,” in Proc. 3rd SIGDIAL 1566–1581, 2006.
Workshop on Discourse and Dialogue, 2002, pp. 113–124. [38] N. Crook, R. Granell, and S. Pulman, “Unsupervised classification of
[14] C. Lee, S. Jung, S. Kim, and G. G. Lee, “Example-based dialog mod- dialogue acts using a Dirichlet process mixture model,” in Proc. Assoc.
eling for practical multi-domain dialog system,” Speech Communica- Comput. Linguist., 2009, pp. 341–348.
tion, vol. 51, no. 5, pp. 466–484, 2009. [39] R. Higashinaka, N. Kawamae, K. Sadamitsu, Y. Minami, T. Meguro,
[15] B. Lin, H. Wang, and L. Lee, “A distributed architecture for coopera- K. Dohsaka, and H. Inagaki, “Unsupervised clustering of utterances
tive spoken dialogue agents with coherent dialogue state and history,” using non-parametric Bayesian methods,” in Proc. Interspeech, 2011,
in Proc. IEEE Workshop on Automatic Speech Recognition and Under- pp. 2081–2084.
standing Workshop, 1999, p. 4. [40] R. Barzilay and L. Lee, “Catching the drift: Probabilistic content
[16] I. O’Neill, P. Hanna, X. Liu, D. Greer, and M. McTear, “Implementing models, with applications to generation and summarization,” in Proc.
advanced spoken dialogue management in Java,” Science of Computer North Amer. Chap. Assoc. Comput. Linguist.: Human Lang. Technol.,
Programming, vol. 54, no. 1, pp. 99–124, 2005. 2004, pp. 113–120.
[17] J. Dowding, J. M. Gawron, D. Appelt, J. Bear, L. Cherny, R. Moore, and [41] C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B. S. Lee, “Twiner:
D. Moran, “Gemini: A natural language system for spoken-language Named entity recognition in targeted twitter stream,” in Proc. Assoc.
understanding,” in Proc. Association for Computational Linguistics, Comput. Machinery’s Special Interest Group Inf. Retrieval, 2012, pp.
1993, pp. 54–61. 721–730.
2464 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 11, NOVEMBER 2013

[42] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions, Minwoo Jeong is an applied scientist at Microsoft
and the Bayesian restoration of images,” IEEE Trans. Pattern Anal. Corporation. He received his B.S. degree at the De-
Mach. Intell., vol. PAMI-6, no. 6, pp. 721–741, Nov. 1984. partment of Computer Engineering from Chonbuk
[43] H. Murao, N. Kawaguchi, S. Matsubara, Y. Yamaguchi, and Y. In- National University, Jeonju, South Korea. He re-
agaki, “Example-based spoken dialogue system using WOZ system ceived his M.S./Ph.D. degrees at the Department of
log,” in Proc. 4th Annu. Meeting Spec. Interest Group Discourse Dial., Computer Science and Engineering from POSTECH,
2003, pp. 140–148. Pohang, South Korea. His research interests include
[44] M. Nagao, “A framework of a mechanical translation between Japanese spoken language understanding and spoken dialog
and English by analogy principle,” in Proc. Int. NATO Symp. Artif. system.
Human Intell., 1984, pp. 173–180.
[45] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional
random fields: Probabilistic models for segmenting and labeling
sequence data,” in Proc. Int. Conf. Mach. Learn., 2001, pp. 282–289.
[46] A. Ratnaparkhi, “Maximum entropy models for natural language ambi-
guity resolution,” Ph.D. dissertation, Univ. of Pennsylvania, Philadel- Kyungduk Kim is a Ph.D. student at the Department
phia, PA, USA, 1998. of Computer Science and Engineering at POSTECH,
[47] G. Salton and C. S. Yang, “On the specification of term values in auto- Pohang, South Korea. He received his B.S./M.S. de-
matic indexing,” J. Document., vol. 49, no. 4, pp. 351–372, 1973. grees at the Department of Computer Science and En-
[48] C. Lee, S. Jung, K. Kim, and G. G. Lee, “Hybrid approach to robust di- gineering from POSTECH. His research interests in-
alog management using agenda and dialog examples,” Comput. Speech clude multi-modal dialog system and robust dialog
Lang., vol. 24, no. 4, pp. 609–631, 2010. management.
[49] C. Lee, S. Jung, K. Kim, and G. G. Lee, “Automatic agenda graph
construction from human-human dialogs using clustering method,”
in Proc. North Amer. Chap. Assoc. Comput. Linguist.: Human Lang.
Technol., 2009, pp. 89–92.
[50] V. Rieser and O. Lemon, “Learning effective multimodal dialogue
strategies from Wizard-of-Oz data: Bootstrapping and evaluation,” in
Proc. Assoc. Comput. Linguist., 2008, pp. 638–646.
Seonghan Ryu is a Ph.D. student at the Department
[51] N. X. Vinh, J. Epps, and J. Bailey, “Information theoretic measures for
of Computer Science and Engineering at POSTECH,
clusterings comparison: Is a correction for chance necessary?,” in Proc.
Pohang, South Korea. He received his B.S. degree
Int. Conf. Mach. Learn., 2009, pp. 1073–1080.
at the Department of Computer Science and Engi-
[52] A. Rosenberg and J. Hirschberg, “V-measure: A conditional entropy-
neering at Dongguk University, Seoul, South Korea.
based external cluster evaluation measure,” in Proc. Empir. Meth. Nat-
His research interests include multi-domain spoken
ural Lang. Process. Comput. Nat. Lang. Learn., 2007, pp. 410–420.
dialog system, robust dialog management, and ex-
[53] E. Levin, R. Pieraccini, and W. Eckert, “A stochastic model of human-
ploiting web resources.
machine interaction for learning dialog strategies,” IEEE Trans. Speech
Audio Process., vol. 8, no. 1, pp. 11–23, Jan. 2000.
[54] S. Jung, C. Lee, K. Kim, M. Jeong, and G. G. Lee, “Data-driven
user simulation for automated evaluation of spoken dialog systems,”
Comput. Speech Lang., vol. 23, no. 4, pp. 479–509, 2009.
[55] E. Ponvert, J. Baldridge, and K. Erk, “Simple unsupervised grammar
induction from raw text with cascaded finite state models,” in Proc. Gary Geunbae Lee received his B.S. and M.S.
Proc. 49th Annu. Meeting Assoc. Comput. Linguist.: Human Lang. degrees in Computer Engineering from Seoul Na-
Technol., 2011, pp. 1077–1086. tional University in 1984 and 1986 respectively. He
received the Ph.D. degree in Computer Science from
UCLA in 1991 and was a research scientist in UCLA
from 1991 to 1991. He has been a professor at the
CSE department, POSTECH in Korea since 1991.
Donghyeon Lee is a Ph.D. student at the Depart- He is a director of the Intelligent Software laboratory
ment of Computer Science and Engineering at which focuses on human language technology
POSTECH, Pohang, South Korea. He received his research including natural language processing,
B.S. degree at the Department of Computer Science speech recognition/synthesis, and speech translation.
and Engineering at Sung Kyun Kwan University, He authored more than 100 papers in international journals and conferences,
Suwon, South Korea. His research interests include and has served as a technical committee member and reviewer for several
unsupervised spoken language understanding and international conferences such as ACL, COLING, IJCAI, ACM SIGIR, AIRS,
spoken dialog system. ACM IUI, Interspeech-ICSLP/EUROSPEECH, EMNLP and IJCNLP. He
is currently leading several national and industry projects for robust spoken
dialog systems, computer assisted language learning, and expressive TTS.

You might also like