Computer Speech & Language: Gueorgui Pironkov, Sean UN Wood, ST Ephane Dupont

Computer Speech & Language 64 (2020) 101103
Contents lists available at ScienceDirect
Computer Speech & Language

journal homepage: www.elsevier.com/locate/csl
Hybrid-task learning for robust automatic speech recognition

phane Duponta
Gueorgui Pironkov*,a, Sean UN Woodb, Ste
a
Circuit Theory and Signal Processing Lab, University of Mons, Mons, Belgium
b
NECOTIS, Department of Electrical and Computer Engineering, University of Sherbrooke, Sherbrooke, QC, J1K 2R1, Canada
A R T I C L E I N F O A B S T R A C T
Article History: In order to properly train an automatic speech recognition system, speech with its annotated
Received 18 January 2019 transcriptions is most often required. The amount of real annotated data recorded in noisy
Revised 22 March 2020 and reverberant conditions is extremely limited, especially compared to the amount of data
Accepted 21 April 2020
than can be simulated by adding noise to clean annotated speech. Thus, using both real and
Available online 3 May 2020
simulated data is important in order to improve robust speech recognition, as this increases
the amount and diversity of training data (thanks to the simulated data) while also benefit-
Keywords:
Multi-task learning
ing from a reduced mismatch between training and operation of the system (thanks to the
Robust speech recognition real data). Another promising method applied to speech recognition in noisy and reverberant
Hybrid-task learning conditions is multi-task learning. The idea is to train one acoustic model to solve simulta-
Denoising auto-Encoder neously at least two tasks that are different but related, with speech recognition being the
Real & simulated data training main task. A successful auxiliary task consists of generating clean speech features using a
Noise & reverberation regression loss (as a denoising auto-encoder). This auxiliary task though uses as targets clean
CHiME4 speech, which implies that real data cannot be used. In order to tackle this problem a
Hybrid-Task Learning system is proposed. This system switches frequently between multi
and single-task learning depending on whether the input is real or simulated data respec-
tively. Having a hybrid architecture allows us to benefit from both real and simulated data
while using a denoising auto-encoder as auxiliary task of a multi-task setup. We show that
the relative improvement brought by the proposed hybrid-task learning architecture can
reach up to 4.4% compared to the traditional single-task learning approach on the CHiME4
database. We also demonstrate the benefits of the hybrid approach compared to multi-task
learning or adaptation.
© 2020 Elsevier Ltd. All rights reserved.
1. Introduction
Despite recent advances in deep learning, the most effective deep learning algorithms are still based on supervised
learning (LeCun et al., 2015). This is especially true for classification tasks, Automatic Speech Recognition (ASR)
included (Hinton et al., 2012). Supervised learning implies an annotation of the data which, can be time consuming and
resource intensive. In a scenario of a clean and non-reverberant acoustic environment, the amount of available annotated data
is very substantial for ASR (Panayotov et al., 2015; Hernandez et al., 2018; Voxforge.org, 2006; Linguistic Data Consortium,
1992). This eases speech recognition considerably, with some researchers even suggesting that we may have reached human-
like performance (Xiong et al., 2016). However, these ideal acoustic conditions are not very realistic since in many real-life sit-
uations, we are faced with degradations of the speech signal. Degradations may come from the surrounding noise (e.g., cars,
babble, industrial noises, etc.) (Li et al., 2014) or from the acoustic proprieties of the room (when the microphone used for
*Corresponding author.
E-mail address: gueo.pironkov@gmail.com (G. Pironkov).
https://doi.org/10.1016/j.csl.2020.101103
0885-2308/© 2020 Elsevier Ltd. All rights reserved.
2 G. Pironkov et al. / Computer Speech & Language 64 (2020) 101103
recording is not a close-talking microphone) leading to reverberations of the speech (Kinoshita et al., 2016). These phenomena
make speech recognition a much harder task and significantly deteriorate results. The other problem in this noisy and rever-
berant scenario is the limited amount of annotated real data. A method frequently used to tackle this problem is to artificially
create simulated data by adding noise and reverberation to clean speech, this way the massive amounts of clean annotated
speech can be reused in order to improve ASR in these degrading conditions. Nevertheless, there is a substantial difference
between the simulated and real data. The mismatch between these two types of noisy and reverberant data leads to poor
results in real-life situations when the acoustic model is trained using simulated data only (Vincent et al., 2016). Among the
different explanations for this mismatch is the Lombard effect (Hansen, 1994), when a speaker talking in a noisy environment
naturally tends to raise his/her voice, changing the properties of the speech compared to speech recorded in clean environ-
ment.
In this paper, we propose a Hybrid-Task Learning (HTL) architecture that benefits from both real and simulated data. We use
the word hybrid as the HTL system is a mix of Single-Task Learning (STL) and Multi-Task Learning (MTL) (Caruana, 1997). STL
refers to the traditional ASR training where the acoustic model tries to solve only one task, i.e., the phone-state posterior proba-
bility estimation for ASR, whereas during an MTL training, the acoustic model is trained to jointly solve one main task (the same
one as for STL) plus at least one auxiliary task (for instance, gender recognition or speaker classification). The main motivation for
this HTL setup is that simulated data have an advantage compared to real data: we have access to the original clean speech. Thus,
the MTL system can be used to train the acoustic model where simulated data are applied as input, and the auxiliary task consists
of regenerating the original clean speech, similarly to a Denoising Auto-Encoder (DAE) for instance (Vincent et al., 2008). How-
ever, training an MTL setup exclusively would mean that only simulated data could be used, as we do not have access to the clean
speech when real data are applied. Hence, we investigate this mixed STL/MTL architecture that behaves as an MTL system when
the input is simulated data and as an STL system when the input data are real. An important point is that the system changes
between MTL to STL depending on the random variation of real and simulated data fed to the network. Thus, the HTL system is
able to train simultaneously on real and simulated data, while keeping the benefits of using MTL with a DAE auxiliary task. The
main goal of the HTL system is to take advantage of the large amount of annotated simulated data easily available while simulta-
neously integrating real data to the acoustic model, thus improving ASR performance for real-life acoustic conditions. To evaluate
this HTL setup, we use the CHiME4 database (Vincent et al., 2016), which is mainly composed of simulated data, but also contains
a smaller quantity of real data.
This work is organized as follows. First, the state-of-the-art is presented in Section 2. Section 3 then introduces MTL in greater
detail and describes more specifically the HTL mechanism. A description of the experimental setup, including specifications about
the acoustic model and the database, can be found in Section 4. In Section 5, we then present the performance of HTL when the
auxiliary task is a denoising auto-encoder compared to STL, MTL or adapting the acoustic model. We also briefly discuss the con-
vergence of the main and auxiliary tasks while the number of epochs increases during training in Section 6. Finally, we conclude
and present future work ideas in Section 7.
2. Related work
Using MTL has proven its efficiency on a variety of speech and language processing problems, such as speech synthesis (Wu
et al., 2015; Hu et al., 2015), speaker verification (Chen et al., 2015a), multilingual speech recognition (Dupont et al., 2005; Hei-
gold et al., 2013; Mohan and Rose, 2015), spoken language understanding (Tur, 2006; Li et al., 2011), natural language
processing (Collobert and Weston, 2008), and so on.
Several studies have previously focused on applying MTL for ASR. The main task for the acoustic model in ASR consists of
predicting the phone-state posterior probabilities, which are subsequently used as input of a Hidden-Markov Model (HMM)
that deals with the temporality of speech (or more recently and alternatively, a network with recurrent connections is used
instead). As cited already, some of the earliest studies use gender classification as an auxiliary task (Lu et al., 2004; Stader-
mann et al., 2005), where the goal is to increase the awareness of the acoustic model concerning the correlation between
gender and speech. In addition to predicting the phone-state posterior probabilities of the main task, some methods use
phone-related classes as an auxiliary task, but on a larger time-scale, as they try to directly predict the phone rather than the
probability of its HMM state (Seltzer and Droppo, 2013; Bell and Renals, 2015). Classifying even broader phonetic classes (e.
g., plosive, nasal, fricative, ...) as an auxiliary task has also been investigated, but proved to be inefficient (Stadermann et al.,
2005). In other works, grapheme1 classification was proposed as an auxiliary task (Stadermann et al., 2005; Chen et al.,
2014). Recent studies have also focused on increasing the speaker-awareness of the network in order to increase the gener-
alization ability, by using speaker classification or i-vectors (Dehak et al., 2011) estimation as auxiliary tasks (rather than
concatenating the i-vector with the input features) (Tan et al., 2016; Tang et al., 2016; Pironkov et al., 2016c). Furthermore,
MTL has also proven its effectiveness when the acoustic model is adapted to a particular speaker (Huang et al., 2015). More
details about these auxiliary tasks and their application for ASR can be found in Pironkov et al. (2016a).
Using a variety of different tasks in order to improve speech recognition in noisy and reverberant acoustic environment is also
a field of interest. For instance, some studies have focused on improving ASR in solely reverberant condition, by using reverberant
data for training and applying a de-reverberant auto-encoder as an auxiliary task (Giri et al., 2015; Qian et al., 2016a). Instead of
1
Symbolic representation of speech, e.g., any alphabet, as opposed to phonemes, which describe directly the sound.
G. Pironkov et al. / Computer Speech & Language 64 (2020) 101103 3
Fig. 1. A Multi-Task Learning network with one main task and N auxiliary tasks.
using a regression auxiliary task (the DAE here), other research tries classification auxiliary tasks for robust ASR. In this case, the
auxiliary task recognizes and classifies the type of noise present in the corrupted sentence (Sakti et al., 2016; Kim et al., 2016).
The improvement brought by this approach is very limited though. A far more promising method previously cited consists of
using a denoising auto-encoder as the auxiliary task (Lu et al., 2004; Chen et al., 2015b; Li et al., 2016; Qian et al., 2015), where
the DAE targets for training are the clean features (which implies having access to clean features and making training with real
data almost impossible). Very similarly, another work used the same DAE MTL system for robust ASR, but in this case an addi-
tional bottle-neck layer was added. As a result, this bottle-neck layer was further used as the input of a classic STL ASR
system (Kundu et al., 2016). Finally, a comparable approach to MTL is the Joint Training (JT) (Gao et al., 2015; Ravanelli et al.,
2016; Qian et al., 2016b; Ravanelli et al., 2017). In JT, as in MTL, an auxiliary task is computed, but in JT the auxiliary task is com-
puted at an intermediary level of the system and directly fed to the rest of the system. For robust ASR, JT uses a DAE as intermedi-
ary task, whereas for the MTL approach the DAE is computed in parallel at the output level. In the JT case, having real data makes
training impossible as, once again, clean data are necessary for training but unavailable. An architecture that could adapt to the
input feature type (real or simulated) would be an elegant way to take advantage of all available data.
The novelty of this work focuses on the Hybrid-Task Learning architecture for ASR, leveraging both simulated and real data.
To our best knowledge, it is the first time that such a hybrid architecture is tested for speech recognition. Nonetheless, a similar
kind of architecture was recently presented by Sun et al. (2017), where the auxiliary task of their MTL system is dedicated to
predicting the domain. Two domains are classified as the auxiliary task: the source domain and the target domain. If the input
features come from the training set, they will be classified as belonging to the source domain by the auxiliary task. During
training, of course, features coming from the target domain are also randomly fed to their MTL network. In this case, the target
domain contains speech uttered by speakers present in the test set use for evaluation (as opposed to the source domain that
contains only speech from speakers in the training set). When features from the target domain are used, the connections to the
main task are suppressed, resulting in an STL system (as there are no phone-state posterior probability labels for these fea-
tures). This approach can also be seen as a hybrid-task learning architecture, but in Sun et al.’s work the main task is momen-
tarily ignored (keeping only the auxiliary task), which is slightly counter-intuitive in a multi-task setup and not applicable to
ASR with real and simulated features. Additionally, Sun et al. try to maximize the error of the auxiliary task, in order to make
the two feature domain distributions as similar as possible, whereas commonly for neural network training, the goal is to mini-
mize the error, which may also be an arguable approach in this setup. In the HTL setup that we propose, the auxiliary task is
dropped depending on the input.
3. Hybrid-task learning
In this work, we present and review the capacity and improvement brought by the hybrid-task learning approach compared
to single and multi-task learning, where the auxiliary task investigated in order to improve the ASR robustness is a denoising
auto-encoder. The HTL architecture is directly derived from multi-task learning, and can actually be seen as a special case of MTL,
where the MTL architecture is data dependent.
3.1. Multi-task learning
Multi-Task Learning was initially introduced at the end of the twentieth century (Caruana, 1997). The basic concept of MTL
consists of training one system (e.g. a neural network) to solve multiple different, but still related tasks. More specifically, in an
MTL setup, there should be one main task plus at least one auxiliary task. The purpose of the auxiliary task is to improve the con-
vergence of the system to the benefit of the main task. An MTL system with one main task and N auxiliary tasks is presented in
Fig. 1 as an example.
In order for a system with multiple tasks to be considered an MTL system two characteristics have to be shared:
Both the main and the auxiliary task(s) are trained using the same input features.2
2
This is why multi-lingual ASR can not truly be considered as MTL.
Fig. 2. A Hybrid-Task Learning (HTL) system which adapts its architecture depending on the input features. The same system is represented, where a and b are
two different kind of input features randomly fed to the system (for instance real and simulated data). (a) as the input features are a type, the HTL system behaves
as a single-task learning system. (b) the b type features force the system to behave as a multi-task learning system wit N auxiliary tasks.
The parameters and internal representations (e.g., weights and neurons of a neural network) are shared among all tasks.
The update of the parameters of the network is done by backpropagating a mixture of the error of all tasks, with a term:
X
N
ɛMTL ¼ ɛMain þ λn ɛAuxiliaryn ð1Þ
n¼1
where ɛMTL is the mixture error to be minimized, ɛMain and ɛAuxiliaryn are the errors computed from the main and auxiliary tasks
respectively, λn is a nonnegative weight associated with each auxiliary task, and N is the total number of auxiliary tasks added to
the main task of this system. The influence of the auxiliary task with respect to the main task is controlled by the value of λn. If
the nth auxiliary task has its λn close to 1, then its contribution to the error estimation will be as important as the main task’s con-
tribution. On the contrary, for λn close to 0, the auxiliary task’s influence will be very small (or nonexistent), leading to a single-
task system. Most frequently, only the main task is kept during testing, the auxiliary tasks being withdrawn.
The relevance of the auxiliary task regarding the main task is critical for the convergence of the latter task, as MTL has the abil-
ity to increase the system robustness to unseen data, hence, improve generalization. Also, sharing the system’s parameters
among the different tasks may lead to better performance than processing each task independently (Caruana, 1997).
3.2. Hybrid-task learning mechanism
The proposed hybrid-task learning architecture is based on MTL. The novelty of HTL comes from its flexibility. The core idea is
to have a system that adapts to the number of output tasks, and more specifically the presence or absence of auxiliary tasks,
depending on the input features. The setup is applicable to the specific situation where the same training set contains two types
of data, some of which may be used to train the auxiliary task(s), whereas the rest of the data could not be applied for the auxil-
iary task(s). In this case, a setup that is able to adapt its auxiliary task(s) dynamically is required in order to train the whole sys-
tem using all the available data. An illustration of the proposed hybrid architecture is shown in Fig. 2.
The motivation behind the hybrid architecture is to be able to use all available data while performing training with several
tasks, as using more than one task may improve the overall performance of the system (see Section 3.1). Computing the error to
be backpropagated in this setup will be very similar to the MTL Eq. (1), with the difference being an additional term which value
will depend on the feature type currently processed, leading to:
X
N
ɛHTL ¼ ɛMain þ g feature ð λn ɛAuxiliaryn Þ ð2Þ
n¼1
with g feature a binary variable equal to 1 if the error ɛHTL is computed from features supporting MTL, or equal to 0 if the input fea-
tures can be used only for the main task.
In this paper, the hybrid setup is investigated for robust ASR. This setting is a particularly suitable candidate for HTL. As dis-
cussed previously, on the one hand, the amount of annotated clean speech if far greater than the amount of annotated noisy and
reverberant speech, more specifically the amount of real noisy and reverberant data.
On the other hand, it is possible to generate simulated noisy and reverberant data by adding noise to the original (annotated)
clean speech and convolving it with the impulse response of a reverberant room. The limitation of the simulated data is the mis-
match between the real data and the artificially generated data. Thus, a solution is to create databases containing real and simu-
lated data, where the ratio of real-to-simulated data will be biased towards the simulated data (as the annotations are required
for ASR).
In this case though, the MTL setup could not be applied if the auxiliary task is a denoising auto-encoder (one of the rare truly
effective auxiliary tasks for robust ASR), as there would be no ground-truth for the real data due to the lack of clean features in
real-life conditions.
Applying an HTL setup to this database will allow us to benefit from the DAE task with simulated data while the acoustic
model still learns valuable information from the real data. As the input features will randomly vary between real and simulated
data, the network will be able to simultaneously learn useful knowledge from both datasets (plus learn how to denoise corrupted
features) by dynamically adapting to its number of tasks.
Additionally, HTL with a DAE auxiliary task, as for MTL, is quite easy to implement as the targets for the auxiliary task genera-
tive training are the clean features, which are already available as they are used to generate the simulated data.
Finally, computational time is another argument backing up HTL (and MTL), as the only additional computational cost is the
calculation of the auxiliary task error, which is far from being time consuming compared to the acoustic model training.
It should be noted that the goal of this study is not to propose a novel auxiliary task, but to compare the performance of HTL to
MTL. As a result, we are using the DAE auxiliary task, as this task is the only auxiliary task that has shown its benefit for MTL
applied to robust ASR (see Section 2).
4. Experimental setup
This section presents the tools and techniques specifically used to evaluate the HTL setup for ASR in noisy and reverberant
conditions.
4.1. Database
To evaluate the proposed HTL setup for robust ASR, we use the CHiME4 database (Vincent et al., 2016). The CHiME4 database
was released in 2016 as part of a speech recognition and separation challenge in noisy and reverberant environments. This data-
base contains 1-channel, 2-channel, and 6-channel microphone array data. Real acoustic mixing was recorded in four different
noisy environments (cafe , street junction, public transport and pedestrian area) through a tablet device with 6-channel micro-
phones. Simulated data are also generated using additive noise (recorded in the noisy environments as in the latter sentence) on
the WSJ0 database (Garofolo et al., 1993).
All training, development, and test sets contain real and simulated data provided as 16 bit wav files sampled at 16 kHz. The
training dataset is composed of 7138 simulated utterances ( 15 h) recorded by 83 speakers and 1600 real utterances ( 4 h) of
real noisy and reverberant speech recorded by 4 speakers. The development contains the same division of real and simulated
data, that is a total of 3280 utterances ( 5.6 h) from 4 other speakers respectively. Similarly, the evaluation set consists of a total
of 2640 sentences leading to approximately 4.5 h recorded by 4 speakers for real and 4 others for simulated data.
This dataset perfectly meets the HTL setup requirements, with training and testing performed on a mix of real and simulated
data (and an easy access to the clean data used to generate the simulated data). The goal of the proposed HTL setup is to improve
the speech recognition on real data, while this dataset is the less representative of the whole training corpus.
In this work, a DAE is used as auxiliary task; as a result we use only one channel (channel no5) during training, while the
development and test sets are created from randomly selected channels.
4.2. Features
The features used as input of our system as well as targets for the DAE task are obtained following this traditional ASR
pipeline:
1. 13-dimensional Mel-Frequency Cepstral Coefficients (MFCC) features are extracted from the row audio files, and normalized
via Cepstral Mean-Variance Normalization (CMVN).
2. The adjacent § 3 frames are spliced for each frame.
3. The concatenate features dimension is reduced by a projection into a 40-dimension feature space using Linear Discriminative
Analysis (LDA) transformation.
4. The final features are obtained through feature-space Maximum Likelihood Linear Regression (fMLLR), which is a feature-
space speaker adaptation method.
Furthermore, these 40-dimensional features are spliced one more time with the surrounding § 5 frames for the input fea-
tures of the acoustic model, whereas there is no splicing concerning the DAE task targets. Thus, the splicing of the input features
can be seen as if an additional context was given for the DAE auxiliary task.
4.3. Acoustic model training
Training and testing the HTL setup was done using the nnet3 version of the Kaldi toolbox (Povey et al., 2011).
The acoustic model used to evaluate the HTL performance is a single-task learning feed-forward Deep Neural Network (DNN).
The DNN has 4 hidden layers, each composed of 1024 neurons using rectified linear unit (ReLU) activations. The 1972 phone-state
posterior probabilities of the STL main task are computed after a softmax output layer. The DNN training is achieved through 14
epochs with an initial learning rate of 0.0015 which is progressively reduced to 0.00015. The error of the main ASR task is com-
puted using the cross-entropy loss function, with a term:
Fig. 3. Hybrid-Task Learning (HTL) setup used for the experiments. (a) corresponds to Single-Task Learning (STL) as the training is done with real noisy speech,
while for (b), using simulated noisy speech allows a Multi-Task Learning (MTL) training (where the auxiliary task is a denoising auto-encoder). The system
dynamically switches between the configurations (a) and (b), depending on the input features.
X
n
LCE ¼ li logðyi Þ ð3Þ
i¼1
with li the desired label, yi the DNN’s softmax output and n the total number of neurons in this layer. Whereas for the DAE auxil-
iary task, the quadratic loss function is applied (as this is a regression issue and not a classification task), with a term:
1X n
LQ ¼ ðy0 t Þ2 ð4Þ
2 i¼1 i i
with y0i the DNN’s prior-softmax output (meaning the last linear layer of the DNN), ti the desired target and n the total number of
neurons again. The parameters (weights and biases) of the network are updated by backpropagating the error derivatives using
stochastic gradient descent (SGD) and no momentum nor regularization. The input features are processed through mini-batches
of a size N = 512. Since the HTL DAE task requires knowing the type of features, e.g., features extracted from real or simulated
data, each mini-batch contains features coming from only one of these two datasets (whereas usually all features would be vol-
untarily mixed up in the mini-batch). Section 5.4 discusses more deeply this point and its impact for speech recognition perfor-
mance. The value of the coefficient g feature of the Eq. (2) is automatically updated during training, by keeping track of the origin
(real or simulated) of the mini-batches. The overall HTL experimental setup is shown in Fig. 3.
All these parameters were selected through empirical observations in order to obtain the highest recognition. Additionally,
other neural network algorithms were also considered as acoustic models. Recurrent Neural Networks (RNN) with Long-Short
Term Memory (LSTM) cells as well as Time-Delay Neural Networks (TDNN) were tested without bringing any significant
improvement compared to a feed-forward DNN on the CHiME4 dataset. On the contrary, the computational time (especially for
LSTM-RNN) was far more significant than for the DNN. These observations obtained on an STL system were also confirmed on
the DAE HTL setup, as the denoising auto-encoder auxiliary task does not profit from these more complex deep learning architec-
tures. It can be noted though that some auxiliary tasks depend more on the neural network used as acoustic model, as for
instance the speaker classification auxiliary task (Pironkov et al., 2016b).
During decoding, the most likely transcriptions are obtained using the output state probabilities computed by the network,
and applying them to an HMM system and a language model, the language model being the 3-gram KN language model trained
on the WSJ 5K standard corpus.
4.4. Baseline
Using the settings presented in the previous section, we train and test a feed-forward single-task learning deep neural net-
work as the baseline acoustic model. The Word Error Rate (WER) is computed on both development and test sets, for each data
type (real and simulated), and for all four noisy environments. The results are shown in Table 1.
The effects of a significant mismatch between the development and test datasets can be noticed for both real and simulated
data. For the pedestrian noisy environment for instance, the development set WER on real data is 11.36% whereas for the real
data of the test set the WER more than doubles to 25.37%. Beside the street environment, all other environments suffer from the
mismatch between the development and test set, especially for real data. This tendency is also confirmed on simulated data with
the overall WER going from 18.12% to 26.00% for the development set and test set respectively. This mismatch is even more
noticeable on real data with the overall WER dropping from 16.46% to 29.30%. The mismatch is partially due to the variability of
the recording conditions. Another explanation can be the Lombard effect described in Section 1. Finally, it can also be noted that
both the development and test sets contain speech uttered by 8 different speakers (whereas 83 speakers are used for training).
This lack of diversity can also explain the difference of WER between the two datasets, as the impact of one or two speakers
harder to recognize compared to the others would be much more severe compared to having more speakers in those datasets.
Table 1
Word Error Rate (WER) in % on the development and test sets of the CHiME4
dataset used as baseline. Overall is the mean WER of all 4 environmental noises
and Avg. is the mean WER over real and simulated data.
Avg. Development Set Test Set
Mean Simu Real Mean Simu Real
Overall 22.47 17.29 18.12 16.46 27.65 26.00 29.30

Bus 24.93 18.51 16.02 20.99 31.35 20.58 42.12

Cafe 24.98 19.12 21.81 16.43 30.84 30.01 31.66
Pedestrian 19.33 12.95 14.53 11.36 25.71 26.04 25.37
Street 20.66 18.60 20.12 17.08 22.71 27.36 18.06
Table 2
Performance of the Hybrid-Task Learning architecture when the auxiliary
task is a Denoising Auto-Encoder, with λ the weight attributed to the DAE
auxiliary task during training. The baseline, which is the Single-Task Learn-
ing architecture, is obtained for λ ¼ 0. The Avg. value is computed over all
four datasets.
Value Avg. Development Set Test Set
of λ Mean Simu Real Mean Simu Real
0 (STL) 22.47 17.29 18.12 16.46 27.65 26.00 29.30

0.05 22.07 16.82 17.44 16.20 27.31 25.57 29.05
0.1 21.96 16.77 17.42 16.11 27.16 25.36 28.95
0.15 21.88 16.72 17.32 16.11 27.04 25.23 28.84
0.2 21.93 16.82 17.60 16.04 27.04 25.39 28.69
0.3 22.10 17.02 17.71 16.32 27.18 25.43 28.92
0.5 22.75 17.60 18.43 16.77 27.89 26.32 29.46
0.7 23.08 17.92 19.05 16.79 28.27 26.96 29.57
As a general remark, the word error rates shown in this study are significantly higher than the state-of-the-art WER for
CHiME4, as these results are obtained through extremely optimized neural networks and tremendous training resources. Our
goal here is to provide a proof-of-concept of the benefits of HTL with real and simulated data, compared to MTL, and not directly
challenge the state-of-the-art results.
5. HTL Performance
In this section, we evaluate the performance of the HTL setup on the CHiME4 real and simulated datasets. We compare the
HTL setup to an STL setup. The improvement brought by the hybrid flexibility is also compared to MTL. The impact of the mini-
batch size is also tested with the hybrid-task learning setup. For clarity, we display here the WER results over all four noises
(rather than the results for each noise as in the previous section), as the presented tendencies are following similar variational
patterns. For each experimental situation, the results for the development set and test set are computed, as well as the results on
real and simulated data, and the average over all four datasets.
5.1. Denoising auto-Encoder auxiliary task impact
In order to evaluate the hybrid task setup, we vary the value of λ present in Eq. (2). If λ=0, the setup is behaving as single-task
learning system, which is our baseline. The higher the value of λ, the more influential the DAE task will be compared to the ASR
main task. Both the STL and HTL setups are trained using real and simulated data, with the data fed to the networks being ran-
domly selected between real and simulated. Results are presented and plotted in Table 2 and Fig. 4 respectively.
There is a persistent improvement brought by the hybrid-task learning architecture, overall all four possible datasets, espe-
cially for a value of λ less than 0.4. The relative improvement of HTL compared to STL reaches up to 4.4% for the development
dataset applied on simulated noisy and reverberant speech for λ = 0.15. More generally, an overall relative improvement of 2.6%
is obtained over all four datasets for λ = 0.15, showing the positive impact of the hybrid auxiliary task in all cases.
Interestingly, Fig. 4 highlights a similar pattern for real and simulated data, independently of the test or development set,
when the value of λ increases over 0.5. For simulated data, having a value of λ too high damages the WER results on the main
ASR task, going from 18.12% (STL) to 19.05% (HTL with λ = 0.7) on the development set for instance. This pattern can also be
noticed on the test set for simulated data.
Fig. 4. Evaluating the Hybrid-Task Learning architecture when the auxiliary task is a Denoising Auto-Encoder, with λ the weight attributed to the DAE auxiliary
task during training. The baseline, which is the Single-Task Learning architecture, is obtained for λ ¼ 0.
For real data on the other hand, increasing λ over 0.5 does not significantly deteriorate the ASR results. This is due to the
hybrid-task learning architecture, as the value λ of the DAE auxiliary task is applied only when training with simulated data.
Thus, the impact of λ is much less significant on real data, independently of the development and test sets.
In light of the above best WER obtained while varying the impact of the DAE auxiliary task, we set λ = 0.15 for the next sec-
tions.
5.2. Evaluating the “Hybrid” impact
Despite the improvement brought by the HTL setup compared to single-task learning, it is questionable if this improvement
comes from the hybrid architecture that frequently switches from single to multiple tasks depending on the input features, or
only from the usage of multiple tasks.
In order to evaluate the HTL impact we train an HTL and STL system using both real and simulated data and compare the
results to STL and MTL systems trained using only simulated data, we also train an STL system on real data only. The results are
presented in Table 3.
Before discussing the HTL impact it can be noticed that the idea that ”more data is always better” applies here when compar-
ing all three STL systems. Using only real data (the smallest dataset) for training gives the worst results with a WER of 54.57%
over all four datasets. When looking at the test results, surprisingly, the WER on the simulated data is lower (62.92%) than for
real data (65.57%), despite the mismatch between real and simulated data (simulated data that are unseen here), highlighting an
even larger mismatch between the real data used for training and the real data used for testing. Using simulated and real data
Table 3
Comparing the Word Error Rate (%) of different Task Learning (TL) systems depending of training datasets, where
the hybrid-TL and multi-TL auxiliary tasks are DAE with λ=0.15. Avg. is the average WER over all four datasets.
Training Dataset(s) System Avg. Development Set Test Set
Architecture Mean Simu Real Mean Simu Real
Random mix of Real + Simu Single-TL 22.47 17.29 18.12 16.46 27.65 26.00 29.30
Random mix of Real + Simu Hybrid-TL 21.88 16.72 17.32 16.11 27.04 25.23 28.84
Simu only Single-TL 23.73 18.27 18.45 18.09 29.18 26.55 31.81
Simu only Multi-TL 23.28 17.91 17.99 17.82 28.63 26.06 31.20
Real only Single-TL 54.57 44.89 49.83 39.95 64.25 62.92 65.57
Fig. 5. Evaluation of the relative improvement brought by hybrid-task learning compared to single-task learning versus the relative improvement brought by
multi-task learning compared to single-task learning.
(largest dataset) for training gives an WER of 22.47% over all four datasets, whereas using only simulated data reaches 23.73%.
Again having more data (and more diversified data) helps. Training with added real data to the simulated data significantly
improves the real data results (going from 18.09% to 16.46% for the development set for instance), whereas, as expected this
improvement is much smaller for simulated data (from 18.45% to 18.12% on the same dataset) but still present.
Both the MTL and HTL architectures appear to improve results compared to their respective STL setups, with an averaged WER
of 23.28% over all four datasets for MTL and 21.88% for HTL, but as discussed earlier directly comparing the WER of HTL and MTL
would be incorrect as both systems train on a different amount (and type) of features, making it hard to estimate if the improve-
ment comes from the hybrid architecture or from the greater amount of data used during the HTL training.
Thus, we compute the relative improvement brought by HTL compared to STL when real and simulated data are used for
training and compare it to the relative improvement brought by MTL compared to STL when only simulated is used. The results
are shown in Fig. 5.
It can be observed that on average, and more specifically for three out of the four datasets, using HTL gives better relative
improvement than MTL. The highest gap between the HTL and MTL relative improvements can be noticed on the simulated data-
sets. Interestingly, for the test set using real data, MTL gives a slightly higher relative improvement (0.3% better than for HTL),
where for MTL no real data were used during training. This result can be explained by the fact that the WER on the real data test-
set are worst on STL when training on simulated data only, thus in this situation MTL provides a better generalization than HTL
to the unseen data as a higher gap exists between the STL and MTL WER.
Despite this, the HTL architecture outperforms MTL in most cases, and additionally using HTL allows us to use both datasets
thus starting already from a better STL baseline (WER of 22.47% on average) than using only simulated data for training with an
MTL system (WER of 23.28% on average).
5.3. Evaluating adaptation compared to HTL
Another way to take advantage from both real and simulated data (while using STL and MTL trainings) can be achieved
through adaptation of the acoustic model. Adaptation of the acoustic model in ASR is usually done by adding a linear layer at the
beginning or end of the network (Neto et al., 1995). Then the neural network is trained with the largest available dataset (which
would be the simulated data here and this training could be done through MTL). Once the network is trained, the weights in the
linear layer are updated again with the smaller adaptation dataset (real data here) while all the other weights are kept fixed (and
this would be an STL training here).
Nevertheless, adaptation is mostly used when the adaptation set amount is very limited (usually a few sentences). With
CHiME4, the amount of real data is smaller than for simulated data, but not to that extent. Thus, adapting the acoustic model
trained through simulated or real data would not be the most effective method. With this in mind, instead of updating only a lin-
ear layer, another solution could be to train the acoustic model on one type of data and then retrain the same acoustic model on
the other type of data, while updating all the parameters, and compare it to HTL. While the HTL system frequently switches from
Table 4
Comparing the Word Error Rate (%) for different order of training approaches, with Simu and Real referring to the
corresponding datasets, while the MTL and HTL auxiliary tasks are DAE with λ=0.15 and Avg. the average WER
over all four datasets.
Dataset(s) Order During Architecture Avg. Development Set Test Set
Training Transition Mean Simu Real Mean Simu Real
Random mix of Real + Simu STL 22.47 17.29 18.12 16.46 27.65 26.00 29.30
Random mix of Real + Simu HTL 21.88 16.72 17.32 16.11 27.04 25.23 28.84
Simu ! Real MTL ! STL 24.68 19.45 21.64 17.25 29.91 29.47 30.34
Real ! Simu STL ! MTL 23.06 17.72 17.91 17.53 28.41 26.07 30.74
STL to HTL (and vice versa) depending of the data type, here the network would train following one architecture and when all the
associated data are processed it would switch to the other architecture permanently.
The obtained results are shown in the last two lines of Table 4, where for each data set the model is trained on 14 epochs.
Thus, the final acoustic model for the four compared setups in Table 4 was trained on the same number of epochs of simulated
and real data.
The results highlight the importance of using real and simulated data simultaneously during training rather than training on
one dataset and then retraining on another one (no matter the order). The WER is worst when first training on simulated data
and then on real data (24.68% on average), compared to training the other way round (23.06%), showing that the order of training
has a significant impact. Despite using MTL, both these approaches are surpassed by an STL training when data are randomly
alternated (22.47%). The HTL approach outperforms the other proposed approaches, as the hybrid architecture takes advantage
from both the randomly mixed datasets and the multiple task training.
5.4. Evaluating the mini-batch size impact
In Section 4.3, we described the system and explained that mini-batches are used during training. Thus far, the size of the
mini-batch was fixed to N ¼ 512. The particularity of the mini-batches in the hybrid-task setup is that each mini-batch contains
features coming from only one type of data, either real or simulated data. In traditional trainings, real and simulated data are ran-
domly used inside the mini-batch, whereas here due to the hybrid-task training, the datasets are kept apart. Thus, a relevant con-
cern involves the diversity of the mini-batch during training compared to classic STL training. In order to evaluate this aspect, the
size of the mini-batch is progressively reduced, thus alternating more often between mini-batches containing only simulated
data and mini-batches containing only real data (and also alternating more often between the STL and MTL architectures). The
obtained results are presented in Table 5.
Before directly discussing the impact of mini-batch size, the first two lines of Table 5 compare STL training with mini-batches
containing features from both real plus simulated data and mini-batches containing only one kind of data. The respective WERs
show no real difference in the results obtained by these two methods, with a WER of around 22.5% for both methods.
For HTL, according to the second part of Table 5, there is no real variation in the word error rate while decreasing the mini-
batch size. The mini-batch size is decreased from 512 to 4 with minor variations of the WER ( § 0.04%) that are hardly significant.
As expected, the training time drastically increases when decreasing the size of the mini-batch but without bringing any
improvement. These results confirm the observations made on the STL architecture, showing that having features coming from
only simulated data or only from real data in a mini-batch is not penalizing.
Table 5
Comparing the Word Error Rate (%) for different sizes of mini-batches where the two first
lines are single-task learning systems, whereas the rest is using hybrid-task learning with a
DAE auxiliary task (λ=0.15). For STLy the mini-batches are composed from randomly selected
features computed from real and simulated data. For STLz and for the other HTL configura-
tions each mini-batch contains only one kind of data, ether real or simulated. The training
time for each configuration is displayed plus the average WER (Avg.) over all four datasets.
Mini-batch Training Avg. Development Set Test Set
Size (in hours) Mean Simu Real Mean Simu Real
512 (STLy) 1.2 22.46 17.25 18.09 16.41 27.67 25.96 29.37
512 (STLz) 1.2 22.47 17.29 18.12 16.46 27.65 26.00 29.30
512 1.5 21.88 16.72 17.32 16.11 27.04 25.23 28.84
256 2.1 21.87 16.72 17.39 16.05 27.03 25.31 28.74
128 3.4 21.90 16.81 17.42 16.19 27.00 25.41 28.59
64 6.0 21.92 16.75 17.26 16.23 27.10 25.33 28.87
32 11.3 21.85 16.64 17.25 16.03 27.05 25.09 29.01
16 21.6 21.88 16.74 17.19 16.28 27.03 25.24 28.82
8 43.4 21.91 16.62 17.27 15.96 27.21 25.36 29.06
4 83.9 21.89 16.71 17.22 16.20 27.07 25.19 28.95
Fig. 6. Evolution of the task errors over the increasing number of epochs during training computed on the training set. The Main ASR Task Error is the speech rec-
ognition error computed through the cross-entropy loss function for Hybrid- and Single-Task Learning, whereas the Auxiliary DAE Task Error corresponds to the
denoising auto-encoder task error obtained through the quadratic loss function with λ ¼ 0:15.
Finally it can also be noted, as discussed in Section 3.2, that there is a slight increase of the training time for HTL compared to
STL (1.5 and 1.2 h respectively for mini-batches of size 512) due to the computation of the auxiliary task error.
6. Main and auxiliary tasks convergence
In many studies where MTL is used for speech recognition, the convergence of the main and auxiliary tasks are rarely dis-
cussed. Showing that both tasks are converging during training, and thus decreasing the error of their respective loss functions,
through the increasing number of epochs, is crucial. It has been shown that introducing weak-noise to the neural network param-
eters tends to improve the generalization performance for both regression and classification problems (An, 1996). When using
MTL, and especially with a small value of λ awarded to the auxiliary task, there is a risk that the auxiliary task is not effectively
converging during training, but that the improvement brought to the main task originates from the non-converging weak-noise
coming from the auxiliary task. In order to disambiguate this issue, the error of the loss function, obtained on the training set, of
the ASR main task during STL and HTL training and the DAE auxiliary task with λ ¼ 0:15 are presented in Fig. 6.
The convergence of the error for STL and HTL is very similar on the main ASR task, and this error is even lower for STL between
the 5th and 10th epoch. Nevertheless, after the 10th epoch the error for HTL continues to slowly drop whereas it stagnates for STL.
Concerning the DAE auxiliary task, even for this small value of λ, the auxiliary task converges over the increasing number of
epochs as the error progressively decreases, demonstrating that the DAE auxiliary task is not randomly updated, and that the
improvement brought by HTL compared to STL is relying on this auxiliary task. The convergence of the error for the DAE task
with even smaller values of λ (we tested up to λ ¼ 0:05) is similar to the one depicted in Fig. 6, but for these values, the difference
between STL and HTL training becomes even less significant concerning the error of the main task.
It can also be noted that the convergence speed of both the main and auxiliary tasks is fast during the first 5 epochs. After that,
the auxiliary task error stagnates whereas the error for the main task continues to improve significantly until the 7th epoch. This
slight convergence mismatch between the two tasks might be a limitation to the improvement brought by HTL (and MTL more
generally), as having an auxiliary and a main task that converge similarly tends to perform better (Caruana, 1997).
7. Conclusion
In this work, a novel task learning mechanism is proposed which is referred to as Hybrid-Task Learning. This mechanism is
based on mixing the Multi-Task Learning architecture with the traditional Single-Task Learning architecture, leading to a
dynamic hybrid system that switches between single and multi-task learning depending on the input feature’s type. The motiva-
tion for this approach was that the amount of data and its characteristics may vary substantially in such a way that it might not
be possible to apply MTL on all the available data. In such a case, two options are most commonly considered: use an STL system
with all the available data that does not take advantage of the MTL benefits, or use an MTL approach but only with the compatible
data and thus not profit from the variety of the whole dataset. The proposed HTL approach is a third solution to that problem
allowing the use the MTL architecture while all available data are being processed during training. Such an HTL system is particu-
larly suitable for robust automatic speech recognition, where two major types of data are available: real noisy and reverberant
data recorded in real-life conditions and simulated data that are obtained by artificially adding noise to clean speech recorded ini-
tially in noiseless conditions. MTL can be successfully applied to simulated data, where the auxiliary task is a Denoising Auto-
Encoder generating clean speech, as the clean data are available. This DAE auxiliary task is not applicable on real data, however.
Thus, having an HTL structure is here the solution we proposed for ASR in noisy and reverberant conditions.
We show that the ASR performance on real and simulated data using HTL outperforms the two options mentioned previously
in this paragraph, with up to 4.4% relative improvement brought by HTL compared to STL. In addition to comparing HTL and
MTL, we also show the importance of having randomly mixed real and simulated features during training instead of processing
them independently. Additionally, it can be noted that, as for MTL, implementing and training the proposed HTL setup is not
time consuming and does not require additional information, as the clean data used for training the auxiliary task is already avail-
able when the simulated data are generated.
To summarize, we propose a novel approach capable of adapting a learning process to leverage the target data that are avail-
able on an example-by-example basis at training time. The Hybrid-Task Learning idea is simple, easy to implement and quick to
train (or at least not adding much computational time), leading to a (small, but non-negligible) improvement of the results,
allowing to use multi-task learning on simulated data, while still using real data during training. It should also be noted that, in
this paper, we apply HTL to speech recognition in noisy and reverberant conditions, but this approach can be applied to any prob-
lem where MTL is only applicable on a limited amount of the training dataset. (For instance, there is a lot of annotated data for
speaker identification, compared to speech recognition. With HTL, in order to improve speaker identification, one could combine
speaker identification and speech recognition training sets, while using speech recognition as an auxiliary task, when the labels
are available). The proposed HTL approach holds great potential, and the range of applications of this approach is technically
quite wide.
In future work, we would like to investigate other auxiliary tasks for the proposed HTL setup, for instance generating only the
noise as the auxiliary task (as opposed to the DAE), as well as evaluating HTL performance on other databases and feature combi-
nations other than real and simulated data. For instance, directly using clean and noisy features or enhanced speech as input of
the HTL system. Also, we also would like to investigate more deeply the impact that the convergence of the auxiliary task has on
the main task, and thus select and train auxiliary tasks accordingly.
Declaration of Competing Interest
There are no conflict of interest we would like to share with the editorial board.
Acknowledgment
This work has been funded by the Walloon Region of Belgium through the SPW-DG06 Wallinov Program no: 1610152.
References
An, G., 1996. The effects of adding noise during backpropagation training on a generalization performance. Neural Comput. 8 (3), 643–674.
Bell, P., Renals, S., 2015. Regularization of context-dependent deep neural networks with context-independent multi-task training. IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), 2015. IEEE, pp. 4290–4294.
Caruana, R., 1997. Multitask learning. Mach. Learn. 28 (1), 41–75.
Chen, D., Mak, B., Leung, C.-C., Sivadas, S., 2014. Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-
resource speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014. IEEE, pp. 5592–5596.
Chen, N., Qian, Y., Yu, K., 2015. Multi-task learning for text-dependent speaker verification. Sixteenth Annual Conference of the International Speech Communica-
tion Association.
Chen, Z., Watanabe, S., Erdogan, H., Hershey, J.R., 2015. Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neu-
ral networks.. INTERSPEECH. ISCA, pp. 3274–3278.
Collobert, R., Weston, J., 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th
International Conference on Machine Learning. ACM, pp. 160–167.
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P., 2011. Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process 19 (4), 788–798.
Dupont, S., Ris, C., Deroo, O., Poitoux, S., 2005. Feature extraction and acoustic modeling: an approach for improved generalization across languages and accents.
IEEE Workshop on Automatic Speech Recognition and Understanding, 2005. IEEE, pp. 29–34.
Gao, T., Du, J., Dai, L.-R., Lee, C.-H., 2015. Joint training of front-end and back-end deep neural networks for robust speech recognition. IEEE International Confer-
ence on Acoustics, Speech and Signal Processing (ICASSP), 2015. IEEE, pp. 4375–4379.
Garofolo, J., Graff, D., Paul, D., Pallett, D., 1993. CSR-I (WSJ0) Complete LDC93S6A. Linguistic Data Consortium.Web Download
Giri, R., Seltzer, M.L., Droppo, J., Yu, D., 2015. Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning.
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015. IEEE, pp. 5014–5018.
Hansen, J.H., 1994. Morphological constrained feature enhancement with adaptive cepstral compensation (mce-acc) for speech recognition in noise and lombard
effect. IEEE Trans. Speech Audio Process 2 (4), 598–614.
Heigold, G., Vanhoucke, V., Senior, A., Nguyen, P., Ranzato, M., Devin, M., Dean, J., 2013. Multilingual acoustic models using distributed deep neural networks. IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013. IEEE, pp. 8619–8623.
Hernandez, F., Nguyen, V., Ghannay, S., Tomashenko, N., Este ve, Y., 2018. TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker
adaptation. International Conference on Speech and Computer. Springer, pp. 198–208.
Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al., 2012. Deep neural networks for acoustic
modeling in speech recognition: the shared views of four research groups. Signal Process. Mag. IEEE 29 (6), 82–97.
Hu, Q., Wu, Z., Richmond, K., Yamagishi, J., Stylianou, Y., Maia, R., 2015. Fusion of multiple parameterisations for DNN-based sinusoidal speech synthesis with
multi-task learning. In: Proc. Interspeech.
Huang, Z., Li, J., Siniscalchi, S.M., Chen, I.-F., Wu, J., Lee, C.-H., 2015. Rapid adaptation for deep neural networks through multi-task learning. Sixteenth Annual Con-
ference of the International Speech Communication Association.
Kim, S., Raj, B., Lane, I., 2016. Environmental noise embeddings for robust speech recognition. arXiv:1601.02553.
Kinoshita, K., Delcroix, M., Gannot, S., Habets, E.A., Haeb-Umbach, R., Kellermann, W., Leutnant, V., Maas, R., Nakatani, T., Raj, B., et al., 2016. A summary of the
REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research. EURASIP J. Adv. Signal Process. 2016 (1), 1–19.
Kundu, S., Mantena, G., Qian, Y., Tan, T., Delcroix, M., Sim, K.C., 2016. Joint acoustic factor learning for robust deep neural network based automatic speech recogni-
tion. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016. IEEE, pp. 5025–5029.
LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521 (7553), 436–444.
Li, B., Sainath, T.N., Weiss, R.J., Wilson, K.W., Bacchiani, M., 2016. Neural network adaptive beamforming for robust multichannel speech recognition. In: Proc.
Interspeech.
Li, J., Deng, L., Gong, Y., Haeb-Umbach, R., 2014. An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process 22 (4), 745–777.
Li, X., Wang, Y.-Y., Tu € r, G., 2011. Multi-task learning for spoken language understanding with shared slots.. INTERSPEECH, 20, p. 1.
Linguistic Data Consortium, 1992. The Linguistic Data Consortium (LDC): An Open Consortium of Universities, Libraries, Corporations and Government Research
Laboratories. - ldc.upenn.edu.https://www.ldc.upenn.edu/
Lu, Y., Lu, F., Sehgal, S., Gupta, S., Du, J., Tham, C.H., Green, P., Wan, V., 2004. Multitask learning in connectionist speech recognition. In: Proceedings of the Austra-
lian International Conference on Speech Science and Technology.
Mohan, A., Rose, R., 2015. Multi-lingual speech recognition with low-rank multi-task deep neural networks. IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), 2015. IEEE, pp. 4994–4998.
Neto, J., Almeida, L., Hochberg, M., Martins, C., Nunes, L., Renals, S., Robinson, T., 1995. Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system.
Panayotov, V., Chen, G., Povey, D., Khudanpur, S., 2015. Librispeech: an ASR corpus based on public domain audio books. 2015 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 5206–5210.
Pironkov, G., Dupont, S., Dutoit, T., 2016. Multi-task learning for speech recognition: an overview. In: Proceedings of the 24th European Symposium on Artificial
Neural Networks (ESANN).
Pironkov, G., Dupont, S., Dutoit, T., 2016. Speaker-aware long short-term memory multi-task learning for speech recognition. 24th European Signal Processing
Conference (EUSIPCO), 2016. IEEE, pp. 1911–1915.
Pironkov, G., Dupont, S., Dutoit, T., 2016. Speaker-aware multi-task learning for automatic speech recognition. 23rd International Conference on Pattern Recogni-
tion (ICPR), 2016.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al., 2011. The kaldi speech recognition
toolkit. IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.
Qian, Y., Tan, T., Yu, D., 2016. An investigation into using parallel data for far-field speech recognition. IEEE International Conference on Acoustics, Speech and Sig-
nal Processing (ICASSP), 2016. IEEE, pp. 5725–5729.
Qian, Y., Tan, T., Yu, D., Zhang, Y., 2016. Integrated adaptation with multi-factor joint-learning for far-field speech recognition. IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2016. IEEE, pp. 5770–5774.
Qian, Y., Yin, M., You, Y., Yu, K., 2015. Multi-task joint-learning of deep neural networks for robust speech recognition. IEEE Workshop on Automatic Speech Rec-
ognition and Understanding (ASRU), 2015. IEEE, pp. 310–316.
Ravanelli, M., Brakel, P., Omologo, M., Bengio, Y., 2016. Batch-normalized joint training for DNN-based distant speech recognition. Spoken Language Technology
Workshop (SLT), 2016 IEEE. IEEE, pp. 28–34.
Ravanelli, M., Brakel, P., Omologo, M., Bengio, Y., 2017. A network of deep neural networks for distant speech recognition. arXiv:1703.08002.
Sakti, S., Kawanishi, S., Neubig, G., Yoshino, K., Nakamura, S., 2016. Deep bottleneck features and sound-dependent I-vectors for simultaneous recognition of
speech and environmental sounds. Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, pp. 35–42.
Seltzer, M.L., Droppo, J., 2013. Multi-task learning in deep neural networks for improved phoneme recognition. IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), 2013. IEEE, pp. 6965–6969.
Stadermann, J., Koska, W., Rigoll, G., 2005. Multi-task learning strategies for a recurrent neural net in a hybrid tied-posteriors acoustic model.. INTERSPEECH, pp.
2993–2996.
Sun, S., Zhang, B., Xie, L., Zhang, Y., 2017. An unsupervised deep domain adaptation approach for robust speech recognition. Neurocomputing.
Tan, T., Qian, Y., Yu, D., Kundu, S., Lu, L., Sim, K.C., Xiao, X., Zhang, Y., 2016. Speaker-aware training of LSTM-RNNS for acoustic modelling. IEEE International Con-
ference on Acoustics, Speech and Signal Processing (ICASSP), 2016. IEEE, pp. 5280–5284.
Tang, Z., Li, L., Wang, D., 2016. Multi-task recurrent model for speech and speaker recognition. arXiv:1603.09643.
Tur, G., 2006. Multitask learning for spoken language understanding. IEEE International Conference on Acoustics, Speech and Signal Processing, 2006, 1. IEEE, p. I.
Vincent, E., Watanabe, S., Nugraha, A.A., Barker, J., Marxer, R., 2016. An analysis of environment, microphone and data simulation mismatches in robust speech
recognition. Comput. Speech Lang..
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.-A., 2008. Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th
international conference on Machine learning, pp. 1096–1103.
Voxforge.org, 2006. Open speech dataset set up to collect transcribed speech for use with free and open source speech recognition engines. - voxforge.org. http://
www.voxforge.org/.
Wu, Z., Valentini-Botinhao, C., Watts, O., King, S., 2015. Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis.
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015. IEEE, pp. 4460–4464.
Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M., Stolcke, A., Yu, D., Zweig, G., 2016. Achieving human parity in conversational speech recognition.
arXiv:1610.05256.

Computer Speech & Language: Gueorgui Pironkov, Sean UN Wood, ST Ephane Dupont

Uploaded by

Copyright:

Available Formats

You might also like

Computer Speech & Language: Gueorgui Pironkov, Sean UN Wood, ST Ephane Dupont

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computer Speech & Language: Gueorgui Pironkov, Sean UN Wood, ST Ephane Dupont

Uploaded by

Copyright:

Available Formats

Computer Speech & Language 64 (2020) 101103

Contents lists available at ScienceDirect

Computer Speech & Language

Hybrid-task learning for robust automatic speech recognition

3.1. Multi-task learning

3.2. Hybrid-task learning mechanism

4.3. Acoustic model training

Avg. Development Set Test Set

Mean Simu Real Mean Simu Real

Overall 22.47 17.29 18.12 16.46 27.65 26.00 29.30

Value Avg. Development Set Test Set

of λ Mean Simu Real Mean Simu Real

0 (STL) 22.47 17.29 18.12 16.46 27.65 26.00 29.30

5.1. Denoising auto-Encoder auxiliary task impact

5.2. Evaluating the “Hybrid” impact

Training Dataset(s) System Avg. Development Set Test Set

Architecture Mean Simu Real Mean Simu Real

5.3. Evaluating adaptation compared to HTL

Dataset(s) Order During Architecture Avg. Development Set Test Set

Training Transition Mean Simu Real Mean Simu Real

5.4. Evaluating the mini-batch size impact

Mini-batch Training Avg. Development Set Test Set

Size (in hours) Mean Simu Real Mean Simu Real

6. Main and auxiliary tasks convergence

Declaration of Competing Interest

You might also like