W Y P W P Y W P W: Speech Data Mining

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

Audio and Speech Processing for Data Mining

from an utterance. It is achieved through the following This is an elegant compromise between feature-based
Bayesian decision rule: and model-based compensation and is considered an
interesting addition to the category of joint feature and
Wˆ = arg max P(W | Y ) = arg max P(W ) P(Y | W ) model domain compensation which contains well-
W W known techniques such as missing data and weighted
Viterbi decoding.
where P(W) is the a priori probability of observing Another recent research focus is on robustness
some specified word sequence W and is given by a against transmission errors and packet losses for
language model, for example tri-grams, and P(Y|W) is speech recognition over communication networks
the probability of observing speech data Y given word (Tan, Dalsgaard, & Lindberg, 2007). This becomes
sequence W and is determined by an acoustic model, important when there are more and more speech traffic
often being HMMs. through networks.
HMM models are trained on a collection of acous-
tic data to characterize the distributions of selected Speech Data Mining
speech units. The distributions estimated on training
data, however, may not represent those in test data. Speech data mining relies on audio diarization, speech
Variations such as background noise will introduce recognition and event detection for generating data de-
mismatches between training and test conditions, lead- scription and then applies machine learning techniques
ing to severe performance degradation (Gong, 1995). to find patterns, trends, and associations.
Robustness strategies are therefore demanded to reduce The simplest way is to use text mining tools on speech
the mismatches. This is a significant challenge placed transcription. Different from written text, however,
by various recording conditions, speaker variations textual transcription of speech is inevitably erroneous
and dialect divergences. The challenge is even more and lacks formatting such as punctuation marks. Speech,
significant in the context of speech data mining, where in particular spontaneous speech, furthermore contains
speech is often recorded under less control and has more hesitations, repairs, repetitions, and partial words. On
unpredictable variations. Here we put an emphasis on the other hand, speech is an information-rich media
robustness against noise. including such information as language, text, mean-
Noise robustness can be improved through feature- ing, speaker identity and emotion. This characteristic
based or model-based compensation or the combination lends a high potential for data mining to speech, and
of the two. Feature compensation is achieved through techniques for extracting various types of informa-
three means: feature enhancement, distribution nor- tion embedded in speech have undergone substantial
malization and noise robust feature extraction. Feature development in recent years.
enhancement attempts to clean noise-corrupted features, Data mining can be applied to various aspects of
as in spectral subtraction. Distribution normalization speech. As an example, large-scale spoken dialog sys-
reduces the distribution mismatches between training tems receive millions of calls every year and generate
and test speech; cepstral mean subtraction and vari- terabytes of log and audio data. Dialog mining has been
ance normalization are good examples. Noise robust successfully applied to these data to generate alerts
feature extraction includes improved mel-frequency (Douglas, Agarwal, Alonso, Bell, Gilbert, Swayne, &
cepstral coefficients and completely new features. Two Volinsky, 2005). This is done by labeling calls based on
classes of model domain methods are model adaptation subsequent outcomes, extracting features from dialog
and multi-condition training (Xu, Tan, Dalsgaard, & and speech, and then finding patterns. Other interesting
Lindberg, 2006). work includes semantic data mining of speech utterances
Speech enhancement unavoidably brings in uncer- and data mining for recognition error detection.
tainties and these uncertainties can be exploited in the Whether speech summarization is also considered
HMM decoding process to improve its performance. under this umbrella is a matter of debate, but it is nev-
Uncertain decoding is such an approach in which the ertheless worthwhile to refer to it here. Speech sum-
uncertainty of features introduced by the background marization is the generation of short text summaries
noise is incorporated in the decoding process by using a of speech (Koumpis & Renals, 2005). An intuitive
modified Bayesian decision rule (Liao & Gales, 2005). approach is to apply text-based methods to speech

00

You might also like