Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

RECOGNITION OF REVERBERANT SPEECH BY MISSING DATA IMPUTATION AND NMF

FEATURE ENHANCEMENT
Heikki Kallasjoki
1
, Jort F. Gemmeke
2
, Kalle J. Palom aki
1
, Amy V. Beeston
3
, Guy J. Brown
3
1
Department of Acoustics and Signal Processing, Aalto University, Finland
2
Department ESAT-PSI, KU Leuven, Belgium
3
Department of Computer Science, University of Shefeld, UK
ABSTRACT
The problem of reverberation in speech recognition is ad-
dressed in this study by extending a noise-robust feature
enhancement method based on non-negative matrix factor-
ization. The signal model of the observation as a linear
combination of sample spectrograms is augmented by a mel-
spectral feature domain convolution to account for the effects
of room reverberation. The proposed method is contrasted
with missing data techniques for reverberant speech, and
evaluated for speech recognition performance using the RE-
VERB challenge corpus. Our results indicate consistent gains
in recognition performance compared to the baseline system,
with a relative improvement in word error rate of 42.6% for
the optimal case.
Index Terms Speech dereverberation, non-negative
matrix factorization, missing data
1. INTRODUCTION
Practical automatic speech recognition (ASR) applications of-
ten require methods for coping with highly variable environ-
mental distortions. In particular, when restricted to signals
recorded with a single distant microphone, room reverbera-
tion typically causes severe degradation in the performance of
conventional ASR systems. While the topic of speech recog-
nition in noisy environments has been widely studied, many
proposed systems are limited by an underlying assumption
that the observed signal is an additive mixture of speech and
noise, often with the latter having spectral characteristics un-
like those of speech. The distortions introduced by the mul-
tiple reected signals inherent in reverberation do not t this
model well.
In the missing data model [1], individual spectro-temporal
components of the observed signal are classied as either re-
liable or unreliable. For the purposes of recognition, the re-
gions considered unreliable can be entirely ignored, or used
This research was funded by the Hecse graduate school (H. Kallasjoki),
Academy of Finland grants 251170 (H. Kallasjoki, K.J. Palom aki) and
136209 (K.J. Palom aki), the TEKES FuNeSoMo project (K.J. Palom aki),
IWT-SBO project ALADIN, grant 100049 (J.F. Gemmeke), UK EPSRC,
grant EP/G009805/1 (A.V. Beeston) and FP7-ICT project TWO!EARS, grant
618075 (G.J. Brown).
to provide an upper bound for the clean speech energy un-
der the assumption of additive noise. In approaches based on
data imputation, this model is realized by replacing the com-
ponents that are considered unreliable with estimates derived
from their surrounding context.
Modeling speech in terms of individual training samples
has recently gained prominence in the eld of speech recog-
nition [2]. Non-negative matrix factorization (NMF) algo-
rithms [3] are commonly utilized in these exemplar-based
methods. A source separation feature enhancement method
for noise-robust automatic speech recognition, based on rep-
resenting observations as sparse linear combinations of exem-
plars, obtained with NMF, is presented in [4].
The NMF framework can also be extended to non-
negative matrix factor deconvolution (NMFD) [5]. In pre-
vious research, the NMFD model has been applied to speech
dereverberation in a spectral feature domain, by considering
reverberated speech as a convolution of clean speech and
room response features under different constraints [6, 7].
In the present study, we evaluate previous work on
producing missing data masks specically for reverber-
ant speech, based on modulation ltering of spectral fea-
tures [8, 9], on the REVERB challenge corpus [10]. Ad-
ditionally, we adapt a method for room reverberation esti-
mation, inspired by human auditory models [11], for use in
mask estimation. A recently proposed bounded conditional
mean imputation [12] algorithm is used for missing feature
imputation.
We further propose an extension of the NMF-based noise-
robust feature enhancement algorithm of [4] to the context
of reverberant speech, and contrast its performance with the
missing data methods. Compared to speech dereverberation
based on the NMFD framework, our work is distinguished by
retaining the model of speech as a linear combination of ex-
emplars, but extending it to allow the inclusion and optimiza-
tion of an arbitrary convolutional lter in the mel-spectrogram
domain. The proposed NMF reverberant speech enhancement
method also includes a missing data imputation step to pro-
duce an initial estimate of the sparse representation of the
clean speech signal. Filtering in the domain of the sparse
representation is used to bias the reconstruction, leading to
stronger attenuation of reverberation.
The rest of this paper is organized as follows. In Sec-
tion 2, we describe the evaluated missing data methods. The
proposed NMF feature enhancement scheme is introduced in
Section 3. The experimental setup is detailed in Section 4,
and the obtained results presented in Section 5. The results
are discussed in Section 6, and concluding remarks presented
in Section 7.
2. MISSING DATA TECHNIQUES
2.1. Missing Data Imputation
Several methods that model the clean speech with Gaussian
mixtures have been introduced for the reconstruction task
of missing data imputation. The bounded conditional mean
imputation variant used in this work for reconstructing the
unreliable regions of the observation was recently proposed
in [12]. In conventional conditional mean imputation, the
distribution of clean speech features p(x) is modeled using
a Gaussian mixture model. Denoting by x
r
and x
u
, respec-
tively, the reliable and unreliable components of a single
observation frame or a window of consecutive frames, an
estimate x
u
is produced based on the conditional distribution
p(x
u
| x
r
). Bounded conditional mean imputation (BCMI)
extends the model by further assuming that the observed x
u
is an upper bound for the signal of interest. In [12], this
approach is implemented by deriving an approximate para-
metric model for the posterior distribution of the bounded
features.
2.2. Mask Estimation
A missing data mask that labels the reliable and unreliable re-
gions of the observation is required in order to apply missing
data methods. If the corresponding clean speech is known a
priori, an oracle mask can be constructed by comparison with
the observation. In general, however, this information is not
available and the reliable regions must be identied based on
the distorted observation alone. The accuracy of this mask
estimation process has a major inuence on the speech recog-
nition performance, as shown by comparisons between the
recognition performance obtained with estimated and oracle
masks [1]. Three mask estimation schemes are evaluated for
the missing data experiments presented in this work.
A mask estimation method designed for reverberant
speech in particular, based on modulation ltering, has been
presented in [8] and extended in [9]. In the original formu-
lation, it is based on spectral features derived from a gam-
matone lterbank. For this work, the gammatone lterbank
is replaced by the mel-spectral lterbank used in the feature
extraction of the REVERB challenge baseline recognizer.
We denote by y(t, b) the bth mel channel component of
frame t of the reverberant observation, compressed by raising
it to the power 0.3. These features are further processed with
a band-pass modulation lter with 3 dB cutoff frequencies of
1.5 Hz and 8.2 Hz and an automatic gain control step, and nor-
malized by subtracting a channel-specic constant selected so
that the minimum value for each channel over a single utter-
ance is zero. The resulting signal is denoted by y
agc
bp
(t, b).
The missing data mask m
R
(t, b) is derived by threshold-
ing the AGC features,
m
R
(t, b) =

1 if y
agc
bp
(t, b) > (b),
0 otherwise.
(1)
The threshold (b) for mel channel b is selected for each ut-
terance based on the blurredness metric B of [8], as
(b) =
1
N

N
t=1
y
agc
bp
(t, b)
1 + exp ((B ))
, (2)
where , and are set based on small-scale experiments.
An alternative mask estimation approach, denoted m
LP
,
is based on a room-reverberation estimation method for a
computational auditory modeling task [11]. Reverberation
tails are located in the signal y(t, b) by rst estimating the
smoothed temporal envelope in each channel, y
lp
(t
d
, b),
using a second-order low-pass Butterworth lter with cutoff
frequency at 10 Hz, and identifying regions for which the
derivative y

lp
(t
d
, b) < 0. The parameter
d
corrects for
the lter delay. The amount of energy in each decaying region
of one frequency channel is quantied by
L(t, b) =

1
|n(t,b)|

kn(t,b)
y(k, b) if y

lp
(t
d
, b) < 0,
0 otherwise,
(3)
where n(t, b) is the set of contiguous time indices around t
where the derivative for channel b remains negative. Un-
der the assumption that reverberant signals result in greater
L(t, b) values than dry speech, the m
LP
mask is dened as:
m
LP
(t, b) =

1 if L(t, b) <
LP
,
0 otherwise.
(4)
A data-driven approach to the mask estimation problem is
to label the unreliable regions with the aid of a binary clas-
sier, trained on oracle masks based on a data set where the
underlying clean speech signal, or an approximation of it, is
known. In the context of noisy speech, such methods have
been successfully applied using different classiers, includ-
ing Bayesian classiers based on Gaussian mixture models
(GMM) [13] and support vector machines (SVM) [14]. A
large variety of acoustic features have been used for such
classier-based masks [13, 14, 15]. In this work, the data-
driven mask estimation approach is evaluated with the fol-
lowing set of six features, denoted F1 through F6, related to
the m
R
and m
LP
mask estimation methods.
The compressed mel-spectral features y(t, b) are used di-
rectly as the rst feature F1. Features F2 and F3 correspond to
the GRAD and MOD features previously used in [15], respec-
tively. GRAD is an estimate of the local slope of y(t, b), while
MOD is the bandpass-ltered spectrum y
agc
bp
(t, b) used in (1).
Feature F4 is set to the ratio of y
agc
bp
(t, b) and the threshold
(b), and F5 to the blurredness metric B. Finally, L(t, b)
dened in (3) is used as feature F6.
Individual binary classiers are trained for each mel chan-
nel based on oracle data. Masks generated with both GMM-
based and SVM classiers, denoted by m
GMM
and m
SVM
,
respectively, are evaluated.
3. NMF FEATURE ENHANCEMENT
3.1. Speech model
Recently, approaches based on non-negative matrix factor-
ization (NMF) have shown to be an effective approach to
model additive noise. In NMF, observed spectrograms are
modeled as a sparse, non-negative linear combination of spec-
trographic basis atoms, collected in a dictionary. With both
speech and noise modeled as linear combinations of speech
and noise atoms, respectively, noisy speech is modeled as the
linear combination of noise and speech [4].
In this work, we do not rely on a noise model, but do
exploit the same compositional speech model. In short, we
write:
Y SA, (5)
with Y the observed speech, a dictionary S and activa-
tion weights A. Here, Y is a TC N matrix composed of
a collection of C-dimensional mel-scale magnitude spectro-
grams, reshaped into column vectors by stacking T consec-
utive frames, with N the number of windows extracted from
the observed utterance. S is a TCK dictionary matrix with
K atoms, similarly organized in windowed column vectors,
and Aa non-negative K N activation matrix.
We obtain the dictionary S in advance by randomly ex-
tracting spectrograms from clean speech training data. This
exemplar-based approach has been shown to be an effective
method to obtain large dictionaries that can accurately model
long temporal contexts [4].
3.2. Reverberant speech
To account for the effects of reverberation, we modify the
model (5) to have the form
Y RSA, (6)
where Ris a T
r
C TC matrix of the form
R =

r
1,1
0 0
0 r
1,2
0
0 0 r
1,3
.
.
.
.
.
.
r
2,1
0 0
0 r
2,2
0
0 0 r
2,3
.
.
.
.
.
.

C
r
1,1
0 0
0 r
1,2
0
0 0 r
1,3
.
.
.
.
.
.

.
(7)
Left multiplication of a T-frame spectrogram in the stacked
column vector form by Ris equivalent to the discrete convo-
lution of the contents of each mel band b with the sequence

r
1,b
r
2,b
. . . r
T
f
,b

of length T
f
, resulting in a stacked
vector of T
r
= T + T
f
1 frames.
In this work, the dictionary S contains only non-rever-
berated speech exemplars, while the lter matrix R encodes
the effect of the room reverberation on the signal. The ap-
proximation RSA can equivalently be interpreted as either
the clean speech reconstruction distorted by reverberation,
R(SA), or as the sparse representation of the observation
using an articially reverberated dictionary, (RS)A.
3.3. Optimization
Given an observed utterance, S is kept xed. Estimates of R
and A are obtained by minimizing, for each window t, the
cost function
d(Y
t
, RSA
t
) + A
t

1
, (8)
where the d(, ) term measures the distance between the ob-
servation and the approximation, with Y
t
and A
t
denoting
the corresponding columns of Y and A, respectively. The
second term is a sparsity-inducing L
1
-norm weighted by a
sparsity coefcient . In this work we use the generalized
Kullback-Leibler divergence for d.
To optimize (8) a simple alternative multiplicative update
algorithm can be used [4, 3]. The optimization problem is
not convex, however, and preliminary experiments revealed it
is difcult to obtain stable estimates for both R and A. In-
stead, we use the following algorithm to derive the factoriza-
tion RSA:
1. An initial estimate

X of the non-reverberant speech is
constructed using the m
R
missing data mask estimation
and BCMI imputation methods described in Section 2.2
and Section 2.1, respectively.
2. Using the initial estimate, the activation matrix A ele-
ments are obtained from the non-negative factorization

X RSA, with I
1
rounds of multiplicative updates
and Rxed to the identity matrix.
3. As an ad-hoc method to bias the reverberation lter es-
timation, the resulting sequences of activation values
for each individual exemplar are ltered with H
A
(z) =
1 0.9z
1
0.8z
2
0.7z
3
and clamped to be non-
negative. This ltering step has the effect of suppress-
ing consecutive activations of a single exemplar, which
is typical of reverberant signals.
4. The R matrix is initialized to contain the constant T
f
-
length lter
1
T
f
[1 . . . 1] on all mel channels.
5. Keeping A xed, the lter matrix R is updated for I
2
iterations to approximate the reverberant observation,
based on the factorization Y RSA. While the mul-
tiplicative update of matrix R does preserve elements
set to zero, it does not necessarily result in a matrix
of the form shown in (7), or one corresponding to a
physically plausible lter. To prevent this, we extract
the lter coefcients r
t,b
from R after every iteration
by averaging across all occurrences of each coefcient.
These are restricted to satisfy t : r
t+1,b
< r
t,b
by
clamping overly large values and normalized by scaling
to

t,b
r
t,b
= C. The R matrix is then reinitialized to
have the form of (7).
6. Finally, the R matrix is kept xed, and the A parame-
ters are updated for another I
3
iterations.
In the simple sliding window model presented here, in-
dividual windows are processed independently of each other,
and averaging over the overlapping frames is used to form the
nal clean and reverberant speech reconstructions. In rever-
berant conditions, if the start of a window coincides with the
start of a pause in the speech, the early frames of the window
are lled with sound energy originating only from reections,
with no direct speech component. When such a window is
represented using a reverberated dictionary, this energy is in-
terpreted as direct sound, and is therefore not properly atten-
uated in the enhanced features.
To avoid this issue, we replace the use of the reverberant
speech reconstruction RSA in the iterative updates of [3],
in steps 2, 5 and 6 of the above algorithm, by a version ob-
tained by summing over the overlapping frames in RSA.
This makes it possible for a reverberated exemplar activated
in an earlier window to explain away the reverberant en-
ergy in later windows it overlaps with. The resulting model
is similar to that of non-negative matrix factor deconvolution
(NMFD) [5].
3.4. Feature enhancement
After obtaining R and A, we make a clean speech recon-
struction,

X = SA, and a reconstruction of the reverberated
observation,

Y = RSA. Using the same Wiener lter ap-
proach as in [4], we use these reconstructions to construct a
time-varying lter as the ratio of

X and

Y, after overlap-add
averaging across the overlapping windows. Feature enhance-
ment is then carried out by ltering the observed reverberant
speech with the obtained mel-spectral lter.
4. EXPERIMENTAL SETUP
4.1. Data
The data sets for all experiments described in this work are
provided by the REVERB challenge, and described in detail
in [10]. All test utterances are taken from the WSJCAM0
British English continuous speech recognition corpus [16].
In order to test speech recognition performance in reverber-
ant environments, both clean speech articially distorted us-
ing measured room impulse responses (SimData) as well as
speech recorded in a reverberant room (RealData) are used.
The SimData data set contains utterances from 6 simu-
lated reverberation conditions, resulting from the combina-
tion of three different rooms (small, medium, large) and two
speaker-to-microphone distances (near, far). The T
60
rever-
beration times of the small, medium and large rooms are ap-
proximately 0.25 s, 0.5 s and 0.7 s, while the near and far
microphone distances are 0.5 m and 2.0 m, respectively. In
all six conditions, measured noise signals are also added to
the distorted utterances at a xed signal-to-noise ratio (SNR)
of 20 dB.
The RealData data set consists of real recordings of speak-
ers in a reverberant (T
60
of 0.7 s) meeting room, recorded at
two microphone distances: approximately 1.0 m and 2.5 m
for the near and far conditions, respectively. Prompts from
the WSJCAM0 corpus are used for the content of the utter-
ances.
As the room impulse responses of SimData and test utter-
ances of RealData are measured with similar 8-channel cir-
cular microphone arrays, multi-microphone methods can be
used with both data sets. The data sets also share the 5000-
word WSJ vocabulary, so a single set of acoustic and language
models can be used for all utterances.
The test sets are divided to separate development and eval-
uation subsets. In addition to the test sets, a multi-condition
training set of simulated reverberant speech is also provided.
T
60
reverberation times of the utterances range approximately
from 0.1 s to 0.8 s. The room impulse responses used in the
construction of the training set are separate from the three
rooms of the SimData data set. Duration of the training data
set is approximately 17.5 hours (7861 utterances), while the
SimData and RealData evaluation sets have durations of ap-
proximately 4.8 hours (2176 utterances) and 0.6 hours (372
utterances), respectively.
Subsets of the multi-condition training set, along with the
corresponding clean speech signals, are used in all experi-
ments involving oracle data. This use of aligned clean and re-
verberant speech is limited to the construction of oracle masks
as training targets for missing data masks m
GMM
and m
SVM
,
and the verication (but not selection) of parameters for m
R
.
No other experiments assume the availability of such data.
The oracle mask training subset contains 240 utterances, se-
lected to have an equal amount of samples distorted with each
of the room responses used in the multi-condition training set
construction.
4.2. Speech Recognition System
The speech recognition performance evaluation is performed
with the REVERB challenge baseline recognizer, built on
the HTK toolkit [17]. The features used for recognition
are 13 mel-frequency cepstral coefcients (MFCCs), along
with their rst and second time derivatives. These features
are modeled with tied-state hidden Markov models, with
mixtures of 10 Gaussian components as the emission distri-
butions.
Two sets of acoustic models are used for the experiments.
The clean speech acoustic model is trained with the WSJ-
CAM0 corpus training set, while the multi-condition model
is based on retraining the clean model using the REVERB
challenge multi-condition training set. When the presented
feature enhancement methods are used, the enhancement is
also performed on the utterances of the multi-condition train-
ing set.
In addition, constrained maximum likelihood linear re-
gression (CMLLR) adaptation is optionally applied in con-
junction with the multi-condition model. The adaptation is
performed over the whole test set of a single test condition
in an unsupervised manner, by using initial recognition re-
sults as the transcriptions. A regression tree of 256 regression
classes is optimized for each condition.
4.3. Feature Enhancement Processing
The proposed feature enhancement methods operate in the
mel-spectral domain, and are performed during the corre-
sponding stage of the feature extraction processing of the RE-
VERB challenge baseline recognizer system. All the methods
operate independently on each utterance, and therefore fall
in the class of utterance-based batch processing. As a result,
the full recognition system performs either utterance-based
or full batch processing, depending on whether the CMLLR
adaptation is disabled or enabled, respectively.
A robust channel normalization method presented in [8] is
used to counteract simple convolutional distortions between
clean speech models and test data caused by, e.g., differences
in recording equipment. With the exception of the baseline
results, this normalization step is consistently applied to both
training and test data prior to feature enhancement processing.
In order to take advantage of the provided multi-channel
microphone array recordings, we incorporate the PHAT-DS
beamformer implementation of [18]. One channel is cho-
sen as the reference signal, and the relative time delays of
arrival for the direct sound component of the speech in the
other recording channels are estimated based on the PHAT-
weighted generalized cross-correlation. The signals are then
aligned and summed, so that constructive interference ampli-
es the clean speech signal. Experiments involving the multi-
channel recordings are performed by generating correspond-
ing single-channel signals using the PHAT-DS beamformer
for all reverberant data sets, including the multi-condition
training, development and evaluation data sets. The recogni-
tion process is identical to the single-channel experiments in
all other respects.
Regarding the computational cost of the presented meth-
ods, missing data imputation has relatively modest require-
ments, with a real-time factor of approximately 0.1 for the
feature enhancement processing, as measured on a single
thread on a conventional workstation (Intel Xeon E3-1230
V2). By contrast, the proposed NMF feature enhancement
is more costly, having a real-time factor of 6.9 on the same
system, for the implementation used in this study. How-
ever, in related work on the NMF feature enhancement for
noisy speech, signicant performance improvements have
been obtained by modifying the implementation to, e.g., take
advantage of GPU computation.
4.4. Missing Data Imputation
For the missing data mask m
R
, parameters = 19, = 0.43
and = 1.4 in (2) are chosen based on previous work [9].
A small-scale grid search of , and on a subset of 25 ut-
terances of the multi-condition training set is used to conrm
the suitability of the selected parameter values, by comparing
the resulting missing data masks to the corresponding oracle
masks. The m
LP
mask threshold of (4) is set to
LP
= 0.5,
based on inspection of generated missing data masks. A grid
search with a word error rate criterion is used to select the
threshold for generating oracle masks for the training of the
m
GMM
and m
SVM
methods. 32-component GMMs are used
by the m
GMM
mask classiers. The imputation algorithm of
Section 2.1 is performed using a 5-component GMM trained
on a random 1000-utterance subset of the clean speech train-
ing set, with a time context of 3 consecutive frames for each
window.
4.5. NMF Feature Enhancement
The windowand lter length, sparsity coefcient and iteration
count parameters of the NMF feature enhancement method of
Section 3 are set to T = 10, T
f
= 20, = 1, I
1
= 50,
I
2
= 50 and I
3
= 100, all based on small-scale experi-
ments on development set data. A clean speech dictionary
S of K = 7681 exemplars is constructed by selecting a single
random T-frame segment from each of the utterances in the
WSJCAM0 clean speech training set.
5. RESULTS
5.1. Missing Data Imputation
Word error rates for the evaluation of the competing mask es-
timation methods on the SimData and RealData development
data sets are presented in Table 1. The recognition was per-
formed using clean speech acoustic models and without CM-
LLR adaptation. The best performing mask is highlighted in
bold, and comparable results for the NMF feature enhance-
ment method of Section 3, denoted as NMF, are provided
for reference.
As the m
R
mask estimation method is the best overall
method, producing competitive results for the SimData data
set while slightly outperforming the other methods for Real-
Data, it is used as the method of choice for the missing data
initialization step in the NMF feature enhancement algorithm.
The m
LP
masks are relatively unsuitable for conditions in-
volving low amounts of reverberation. The classier-based
masks (m
GMM
, m
SVM
) show some evidence of overtting,
as they both outperform the m
R
mask estimation method on
the oracle mask training set, but are on average approximately
equally efcient on the SimData development set and fall be-
hind in the RealData experiments. The effect is more notable
for the m
SVM
mask than the m
GMM
mask.
5.2. NMF Feature Enhancement
Final results on the REVERB challenge evaluation data set
are presented in Table 2. Four scenarios are considered:
single-channel feature enhancement using acoustic models
trained on clean speech with no adaptation (denoted Clean);
acoustic models trained on multi-condition data both without
(MC) and with (MC+ad.) CMLLR adaptation; and multi-
channel feature enhancement with the DS-PHAT beamformer
in the multi-condition training and CMLLR adaptation case
(8-ch.). Within each scenario, the best obtained result is
indicated in bold.
In addition to the REVERB challenge baseline (denoted
Baseline), results are provided for both the BCMI impu-
tation with m
R
missing data masks (BCMI), and with the
proposed NMF feature enhancement method (NMF). Rela-
tive word error rate reductions over the baseline system in the
average results for the SimData and RealData data sets for all
four scenarios are summarized in Table 3.
Table 1. Recognition word error rates (%) for evaluated mask estimation methods on development set data using clean speech
acoustic models. The best performing mask is highlighted in bold. Comparable results for the NMF feature enhancement, with
the m
R
mask used for initialization, are also shown for reference.
SimData RealData
Room 1 Room 2 Room 3 Ave. Room 1 Ave.
Near Far Near Far Near Far Near Far
Baseline 15.29 25.29 43.90 85.80 51.95 88.90 51.81 88.71 88.31 88.51
BCMI, mask m
R
15.00 24.53 32.49 64.56 37.69 66.35 40.07 68.37 67.40 67.88
BCMI, mask m
LP
22.49 27.53 48.56 67.37 49.95 72.35 48.01 73.67 72.45 73.06
BCMI, mask m
GMM
14.97 23.06 30.93 64.75 37.78 68.37 39.94 71.62 70.13 70.87
BCMI, mask m
SVM
15.19 21.78 34.09 67.91 35.04 70.87 40.78 74.80 73.48 74.14
NMF 13.13 18.88 22.95 41.66 26.11 46.93 28.26 53.21 54.48 53.84
Table 2. Word error rates for evaluation data experiments. Results are presented for clean speech acoustic models (Clean)
and multi-condition models both without (MC) and with (MC+ad.) CMLLR adaptation, as well as multi-condition train-
ing, adaptation and testing on the microphone array signals preprocessed with the DS-PHAT beamformer (8-ch.). The best
performing method within each category is highlighted in bold.
Model Method
SimData RealData
Room 1 Room 2 Room 3 Ave. Room 1 Ave.
Near Far Near Far Near Far Near Far
Clean
Baseline 18.32 25.77 42.71 82.71 53.56 87.97 51.82 90.07 88.01 89.04
BCMI 17.77 24.72 30.89 56.63 39.34 65.62 39.14 73.87 69.48 71.67
NMF 16.77 20.13 23.75 39.03 29.21 49.65 29.74 61.00 57.26 59.13
MC
Baseline 20.79 21.38 23.36 38.69 28.25 45.21 29.60 57.97 55.20 56.58
BCMI 20.01 20.86 23.06 34.35 27.44 37.83 27.25 52.12 50.51 51.31
NMF 18.30 18.74 20.90 28.48 23.56 34.76 24.11 47.46 46.66 47.06
MC+ad.
Baseline 16.27 18.67 20.65 32.64 24.71 39.36 25.37 49.82 47.94 48.88
BCMI 16.42 18.84 21.47 30.60 24.48 35.73 24.58 45.86 46.25 46.05
NMF 15.89 17.15 19.22 26.24 21.24 31.75 21.91 40.79 42.03 41.41
8-ch.
Baseline 14.42 16.40 15.84 23.73 18.38 29.86 19.76 38.74 41.69 40.21
BCMI 15.15 16.52 16.59 23.09 18.32 26.75 19.40 37.85 38.72 38.28
NMF 14.47 15.83 15.55 19.92 16.64 24.42 17.80 34.43 35.15 34.79
Table 3. Relative reduction of average word error rates over the baseline recognizer in Table 2.
Clean MC. MC+ad. 8-ch.
SimData RealData SimData RealData SimData RealData SimData RealData
BCMI 24.5% 19.5% 7.9% 9.3% 3.1% 5.8% 1.8% 4.8%
NMF 42.6% 33.6% 18.5% 16.8% 13.6% 15.3% 9.9% 13.5%
6. DISCUSSION
With the exception of the near-distance Room 1 test condition
of the multi-channel experiments, where the results of all sys-
tems are very close, the proposed NMF feature enhancement
method consistently outperforms both the challenge baseline
as well as the evaluated missing data imputation system. The
missing data imputation also provides performance gains over
the baseline systemin most test conditions, especially in cases
involving high levels of reverberation. While the relative im-
provements are lower compared to the clean speech acoustic
model case, both methods remain benecial for reverberant
speech even when used in conjunction with multi-condition
training, CMLLR adaptation and the DS-PHAT beamformer.
For speech signals with negligible amounts of reverbera-
tion, missing data imputation with the m
R
mask estimation
has a relatively minor effect, and the corresponding recog-
nition performance is essentially equivalent to the baseline
system. By contrast, the NMF feature enhancement causes
a marked distortion for clean speech signals, reducing the
recognition performance in non-reverberant environments.
Word error rates for both missing data imputation and NMF
feature enhancement for the three clean speech conditions
of the REVERB evaluation data set are presented in Ta-
ble 4. Results are provided for clean speech acoustic models
with no adaptation, and multi-condition acoustic models both
with and without CMLLR adaptation. When multi-condition
acoustic models are used, both feature enhancement methods
reduce the mismatch between the processed multi-condition
training set and the clean speech test set, and consequently
outperform the baseline system. Finally, if CMLLR adapta-
tion is enabled, all three systems perform comparably, with
an accuracy slightly less than that of the clean speech acoustic
model baseline.
In order to establish an upper bound for the performance
of the missing data imputation approach, the speech recog-
nition performance was measured also for the oracle mask
training subset. In these experiments, the baseline recognizer
had a word error rate of 59.8%, while missing data imputa-
tion with oracle masks achieved a word error rate of 40.6%.
Table 4. Clean speech recognition performance of evaluated
methods in all considered single-channel feature enhance-
ment scenarios.
Model Method
Room
Ave.
1 2 3
Clean Baseline 12.89 12.64 12.13 12.55
Clean BCMI 12.91 12.59 12.61 12.70
Clean NMF 18.08 17.50 16.53 17.37
MC Baseline 30.38 30.47 30.35 30.40
MC BCMI 22.53 21.96 22.73 22.40
MC NMF 24.50 23.74 23.62 23.95
MC+ad. Baseline 16.23 15.58 15.87 15.89
MC+ad. BCMI 15.15 14.94 15.48 15.18
MC+ad. NMF 16.50 15.68 15.45 15.87
For comparison, corresponding results for estimated masks
were 50.152.3%, and 45.4% for the proposed NMF feature
enhancement method. While the recognition performance of
imputation with oracle masks exceeded that of NMF feature
enhancement, realistic mask estimation methods typically fall
far short of oracle performance, as seen both in this study and
previous work [1, 15].
While the REVERB corpus contains microphone array
signals, our main focus for this study is in the single-channel
feature enhancement scenario. However, binaural features
have been used in several missing data mask estimation meth-
ods [15]. Mask generation based on features derived from
binaural and 8-channel signals would be a natural extension
for future work on similar data.
For the optimization algorithm described in Section 3.3,
the activation matrix ltering performed in step 3 seems cru-
cial for obtaining a solution that is capable of signicantly at-
tenuating reverberation, though at the cost of the noted degra-
dation of performance for clean speech signals. In compara-
ble development set experiments, omitting step 3 resulted in
speech recognition performance essentially equivalent to that
of the missing feature imputation alone.
In the related noise-robust NMF feature enhancement sys-
tem [4], a joint dictionary of speech and noise samples is used
to provide a noise model. While a noise dictionary was not
used in this study, as the SNR of 20 dB for the background
noise in the REVERB corpus is relatively high, the possibil-
ity of combining both methods to account for a reverberant,
noisy environment remains a potential topic for future work.
A widely used approach for further improving the ef-
ciency of feature enhancement methods is to take the uncer-
tainty of the enhanced features into account during the recog-
nition process, via frameworks such as observation uncertain-
ties or uncertainty decoding [19]. For the missing data impu-
tation applied in this work, such information is readily avail-
able in the posterior distribution of the imputed features [12].
While the proposed NMF feature enhancement method does
not inherently provide a way to estimate the variance of the
enhanced features, heuristic uncertainty estimates have been
successfully used in conjunction with the related noise-robust
NMF feature enhancement system [20]. The applicability of
similar extensions in the context of reverberant speech could
also be investigated in future work.
7. CONCLUSIONS
In this study, we presented an extension of a NMF-based
noise-robust feature enhancement scheme to account for
speech distorted by reverberation. The speech recognition
performance of the proposed method, along with previous
missing data techniques for reverberant speech, were eval-
uated on the REVERB challenge corpus. Signicant word
error rate reduction was observed under both clean speech
as well as multi-condition acoustic models, with the pro-
posed NMF-based feature enhancement outperforming the
evaluated missing data methods.
8. REFERENCES
[1] M. Cooke, P. Green, L. Josifovski, and A. Vizinho, Ro-
bust automatic speech recognition with missing and un-
reliable acoustic data, Speech Communication, vol. 34,
no. 3, pp. 267285, 2001.
[2] T. N. Sainath, B. Ramabhadran, D. Nahamoo,
D. Kanevsky, D. Van Compernolle, K. Demuynck,
J. F. Gemmeke, J. R. Bellegarda, and S. Sundaram,
Exemplar-based processing for speech recognition: An
overview, IEEE Signal Processing Magazine, vol. 29,
no. 6, pp. 98113, 2012.
[3] D. D. Lee and H. S. Seung, Algorithms for non-
negative matrix factorization, in In NIPS. 2000, pp.
556562, MIT Press.
[4] J. F. Gemmeke, T. Virtanen, and A. Hurmalainen,
Exemplar-based sparse representations for noise robust
automatic speech recognition, IEEE Transactions on
Audio, Speech, and Language Processing, vol. 19, no.
7, pp. 20672080, 2011.
[5] P. Smaragdis, Non-negative matrix factor deconvolu-
tion; extraction of multiple sound sources from mono-
phonic inputs, in Independent Component Analysis and
Blind Signal Separation, C. G. Puntonet and A. Prieto,
Eds., vol. 3195 of Lecture Notes in Computer Science,
pp. 494499. Springer Berlin Heidelberg, 2004.
[6] H. Kameoka, T. Nakatani, and T. Yoshioka, Ro-
bust speech dereverberation based on non-negativity and
sparse nature of speech spectrograms, in Proc. Inter-
national Conference on Acoustics, Speech, and Signal
Processing (ICASSP 2009), 2009, pp. 4548.
[7] K. Kumar, R. Singh, B. Raj, and R. Stern, Gammatone
sub-band magnitude-domain dereverberation for ASR,
in Proc. International Conference on Acoustics, Speech,
and Signal Processing (ICASSP 2011), 2011, pp. 4604
4607.
[8] K. J. Palom aki, G. J. Brown, and J. P. Barker, Tech-
niques for handling convolutional distortion with miss-
ing data automatic speech recognition, Speech Com-
munication, vol. 43, no. 1-2, pp. 123142, 2004.
[9] K. J. Palom aki, G. J. Brown, and J. P. Barker, Recog-
nition of reverberant speech using full cepstral features
and spectral missing data, in Proc. International Con-
ference on Acoustics, Speech, and Signal Processing
(ICASSP 2006), 2006.
[10] K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani,
E. Habets, R. Haeb-Umbach, V. Leutnant, A. Sehr,
W. Kellermann, R. Maas, S. Gannot, and B. Raj, The
REVERB challenge: A common evaluation frame-
work for dereverberation and recognition of reverber-
ant speech, in Proc. IEEE Workshop on Applications
of Signal Processing to Audio and Acoustics (WASPAA-
13), 2013.
[11] A. V. Beeston and G. J. Brown, Modelling reverber-
ation compensation effects in time-forward and time-
reversed rooms, in UK Speech Conference, Cambridge,
UK, Sept. 2013.
[12] U. Remes, Bounded conditional mean imputation with
an approximate posterior, in Proc. INTERSPEECH,
2013, pp. 30073011.
[13] M. L. Seltzer, B. Raj, and R. M. Stern, A Bayesian
classier for spectrographic mask estimation for miss-
ing feature speech recognition, Speech Communica-
tion, vol. 43, no. 4, pp. 379393, 2004.
[14] J. F. Gemmeke, Y. Wang, M. Van Segbroeck, B. Cranen,
and H. Van Hamme, Application of noise robust MDT
speech recognition on the SPEECON and SpeechDat-
Car databases, in Proc. INTERSPEECH, 2009, pp.
12271230.
[15] S. Keronen, H. Kallasjoki, U. Remes, G. J. Brown, J. F.
Gemmeke, and K. J. Palom aki, Mask estimation and
imputation methods for missing data speech recognition
in a multisource reverberant environment, Computer
Speech & Language, vol. 27, no. 3, pp. 798 819, 2013.
[16] T. Robinson, J. Fransen, D. Pye, J. Foote, and S. Renals,
WSJCAM0: a British English speech corpus for large
vocabulary continuous speech recognition, in Proc. In-
ternational Conference on Acoustics, Speech, and Sig-
nal Processing (ICASSP 1995), 1995.
[17] S. J. Young, G. Evermann, M. J. F. Gales, T. Hain,
D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey,
V. Valtchev, and P. C. Woodland, The HTK book, ver-
sion 3.4, Tech. Rep., Cambridge University Engineer-
ing Department, 2006.
[18] M. F. Font, Multi-microphone signal processing for au-
tomatic speech recognition in meeting rooms, M.S. the-
sis, Universitat Polit` ecnica de Catalunya, Spain, 2005.
[19] H. Liao and M. J. F. Gales, Issues with uncertainty de-
coding for noise robust automatic speech recognition,
Speech Communication, vol. 50, no. 4, pp. 265277,
2008.
[20] H. Kallasjoki, J.F. Gemmeke, and K.J. Palomaki, Es-
timating uncertainty to improve exemplar-based fea-
ture enhancement for noise robust speech recognition,
Audio, Speech, and Language Processing, IEEE/ACM
Transactions on, vol. 22, no. 2, pp. 368380, 2014.

You might also like