Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models

Dilip Kumar Margam, Rohith Aralikatti, Tanay Sharma, Abhinav Thanda,


Pujitha A K, Sharad Roy, Shankar M Venkatesan

Samsung R&D Institute India, Bangalore


{dilip.margam,r.aralikatti,abhinav.t89,tanay.sharma,
pujitha.k,sharad.roy,s.venkatesan}@samsung.com

Abstract In this work we present a simpler lipreading architecture


with comparable/better performance over traditional and ex-
In recent years, deep learning based machine lipreading has isting state-of-the-art methods. The architecture contains 3D-
gained prominence. To this end, several architectures such as CNN, 2D-CNN, and bi-directional LSTM layers where one 2D-
LipNet, LCANet and others have been proposed which perform CNN layer acts as a bottleneck layer. We analyze two different
extremely well compared to traditional lipreading DNN-HMM approaches involving this architecture.
arXiv:1906.12170v1 [cs.CV] 25 Jun 2019

hybrid systems trained on DCT features. In this work, we pro-


In the first approach, we first train 3D-2D-CNN-BLSTM
pose a simpler architecture of 3D-2D-CNN-BLSTM network
architecture in an end-to-end fashion with CTC loss[9] on char-
with a bottleneck layer. We also present analysis of two dif-
acter labels using curriculum learning. Similar to [10] we ex-
ferent approaches for lipreading on this architecture. In the
tract bottleneck features from the network and train a BLSTM-
first approach, 3D-2D-CNN-BLSTM network is trained with
HMM hybrid model using these features. The motivation be-
CTC loss on characters (ch-CTC). Then BLSTM-HMM model
hind this approach is to compare end-to-end CTC based train-
is trained on bottleneck lip features (extracted from 3D-2D-
ing with traditional BLSTM-HMM hybrid model trained using
CNN-BLSTM ch-CTC network) in a traditional ASR training
cross-entropy loss on HMM labels with the same bottleneck
pipeline. In the second approach, same 3D-2D-CNN-BLSTM
features.
network is trained with CTC loss on word labels (w-CTC).
The first approach shows that bottleneck features perform better Recent works[11, 12] on CTC based end-to-end ASR mod-
compared to DCT features. Using the second approach on Grid els trained with word labels instead of characters/phonemes
corpus’ seen speaker test set, we report 1.3% WER – a 55% have demonstrated promising improvements in WERs. Fol-
improvement relative to LCANet. On unseen speaker test set lowing this, in our second approach we train the 3D-2D-CNN-
we report 8.6% WER which is 24.5% improvement relative to BLSTM network in an end-to-end fashion with CTC loss but
LipNet. We also verify the method on a second dataset of 81 using word labels instead of character labels used by [7, 8]. The
speakers which we collected. Finally, we also discuss the effect reasoning behind this method is that, the drop in WERs due to
of feature duplication on BLSTM-HMM model performance. homophemes can be mitigated by using whole words instead
of characters (as contextual information can disambiguate ho-
Index Terms: Bidirectional Long Short-Term Memory
mophemes). Accordingly, we observe significant improvements
(BLSTM), Connectionist Temporal Classification (CTC), ASR,
using words as output labels.
HMM, CNN
In addition to the above two approaches, we also discuss
the importance of feature duplication in BLSTM-HMM hybrid
1. Introduction model. Feature duplication was discussed in [1, 10, 13] but in
The process of using only visual information of lip movements the context of audio-visual feature fusion. This was done to
to convert speech to text is called machine lipreading. The match the visual frame-rate to the acoustic frame-rate. How-
visual information such as lip movements, facial expression, ever, in our work we show that even for pure lipreading feature
tongue and teeth movements helps us in understanding other duplication can be useful.
person’s speech when audio is interrupted or corrupted. Thus The paper is organized as follows: In section 2, we present
visual information acts as complimentary information to speech the related work. Section 3 explains the architecture of 3D-
signal and helps in improving speech recognition systems in 2D-CNN-BLSTM network. The first approach and second ap-
noisy environment [1, 2, 3]. Machine lipreading has several proach are presented in subsections 3.2 and 3.3 respectively.
practical applications such as silent movie transcription, aids The explanation about datasets and experiments is given in sec-
for hearing-impaired persons, speech recognition in noisy envi- tion 4. Section 5 presents the results and analysis. Section 6
ronment. gives the conclusion.
For such reasons, machine lipreading has been a subject of
research for last few decades [4] and has gained greater promi- 2. Related Work
nence with the advent of deep learning.
One of the fundamental challenges in lipreading is to distin- Two decades ago, lipreading is seen as a word classification
guish the words which are visually similar (called homophemes) problem, where each input video is classified to one of the lim-
such as pack, back and mac (they sound different but have sim- ited words. The authors in [14] do word classification using
ilar lip movements). Such visual similarity acts as a hurdle for different variations of 3D CNN architectures. Word classifi-
human lipreaders resulting in as low as 20% accuracy [5]. On cation using CNNs followed by RNNs or HMMs is presented
the other hand, recent advances in deep learning has shown re- in number of different papers [4, 15, 16]. Later same authors
markable improvements in performance of machine lipreading proposed Watch, Listen, Attend and Spell (WLAS) network in
[6, 7, 8]. [6], which uses encoder-decoder type of architecture for audio-
visual sentence level speech recognition. They have also intro- Table 1: Detailed architecture of 3D-2D-CNN-BLSTM Net-
duced curriculum learning [17], a strategy to accelerate training work. N: number of frames in video, L: number of labels
and reduce overfitting. We have adopted curriculum learning
from this paper, which has resulted in faster convergence. Layer Output size Kernel/Stride/Pad
The end-to-end sentence-level lipreading with 3D-CNN- Input Nx100x50x3
RNN based model (LipNet) with CTC loss on character labels batchnorm Nx100x50x3
is proposed in paper [7]. We also propose end-to-end sentence-
level lipreading with a new 3D-2D-CNN-BLSTM network ar- 3D-Conv1 Nx50x25x32 3x5x5/1,2,2/1,2,2
chitecture in this paper which has fewer parameters compared batchnorm/relu Nx50x25x32
to lipnet. We also train our model with CTC loss on word 3D-Pool1 Nx25x12x32 1x2x2/1,2,2
labels. The paper [18] presents different experiments which 3D-Conv2 Nx25x12x64 4x5x5/1,1,1/1,2,2
use DCT and AAM visual features in traditional speech-style
batchnorm/relu Nx25x12x64
GMM-HMM models for audio-visual speech recognition. But
in this paper we use 3D-2D-CNN-BLSTM network features in- 3D-Pool2 Nx12x6x64 1x2x2/1,2,2
stead of DCT or AMM visual features in RNN-HMM context. 2D-Conv1 Nx6x3x128 5x5/2,2/2,2
Several papers [19] have tried phoneme or viseme labels for batchnorm/relu Nx6x3x128
CTC loss, followed by WFST with language model. Two com- 2D-Conv2 Nx3x2x8 3x3/2,2/2,2
monly used cost functions in end-to-end based sequential mod-
batchnorm/relu Nx3x2x8
els are CTC-loss and sequence-to-sequence loss. Comparison
between CTC-loss and sequence-to-sequence loss in different BLSTM1 Nx400
conditions is shown in [2] using 3D-CNN and LSTM based ar- BLSTM2 Nx400
chitectures. The conditional independence assumption in CTC Linear NxL
loss is said to be one of cons for CTC based sequential models. Softmax NxL
The LCANet proposed in [8] has used highway network with
bidirectional GRU layers after 3D CNN layers with cascaded
attention-CTC to overcome conditional independence assump-
tion of CTC. of the network is a sequence of 100 × 50 × 3 dimensional RGB
The CTC loss with word labels is explored for ASR task in lip images. In the first approach CTC loss is computed on se-
[20, 11, 12] but was not attempted for lipreading. In this pa- quence of character labels and in second approach we compute
per we attempt lipreading with CTC loss on word labels and CTC loss on sequence of word labels.
discuss its limitations. In [1] we used a technique of feature du- This network is used for obtaining bottleneck lip-features
plication for DNN-HMM model with DCT features of lip im- from 2D-Conv2 layer (described in Table 1) for BLSTM-HMM
age, to match the frame rate of audio signal. But we haven’t training. The bottleneck features are of 48 (= 3 × 2 × 8) di-
explored the significance of feature duplication in HMM based mensions obtained by linearizing outputs of 2D-Conv2. The
models. In this paper we show how feature duplication in con- lower dimensional lip-features will enable us to train a better
text of HMM based models helps in improving performance. In BLSTM-HMM hybrid model.
another paper [10] we used CTC loss on character labels with
DCT features of lip as input to RNN layers. 3.1.1. Curriculum Learning
Curriculum learning has been shown to provide better perfor-
3. Models mance and faster convergence[17]. In curriculum learning, the
strategy is to train the network with segmented words for initial
3.1. 3D-2D-CNN-BLSTM Network epochs, then in later epochs the network is trained on full sen-
In this section we describe the architecture of proposed 3D-2D- tences. In our approach, each segmented word in the training
CNN-BLSTM network. This network contains two 3D CNN set is used for training in a given epoch eph with a probabil-
layers, two 2D CNN layers followed by two BLSTM layers as ity max(0, 1 − eph/20). Thus with each epoch the expected
shown in the Figure 1. In 3D convolution kernel moves along number of segmented words reduce linearly. First epoch (epoch
time, height and width dimensions of input. Whereas in 2D con- 0) includes only segmented words. And from second epoch all
volution kernel only moves along height and width dimensions. full sentences are included. Curriculum learning has helped us
We apply batchnorm layer on input directly instead of applying in faster convergence, for 3D-2D-CNN-BLSTM w-CTC (de-
any mean or variance normalization as in [7], so that batchnorm scribed in subsection 3.3) model with curriculum learning it
layer can learn best suitable normalization parameters for the took 45 epochs to converge whereas without curriculum learn-
dataset. The detailed architecture is presented in Table 1. ing it took 89 epochs to converge. Similarly for other exper-
iments also convergence has become twice as faster by using
The 3D-2D-CNN-BLSTM network has spatiotemporal
curriculum learning, as faster as by not using curriculum learn-
convolution in fist two layers to consider contextual informa-
ing..
tion in temporal dimension. Every 3D CNN and 2D CNN layer
is followed by batchnorm layer. We don’t use any pooling layer
after 2D CNN layer. There is 3D maxpool layer after every 3D 3.2. BLSTM-HMM Hybrid Model
convolutional layer. The output of 2D-Conv2 is linearized and In this approach, the 48 dimensional bottleneck features are ex-
fed to two bidirectional LSTM layers. Softmax activation is ap- tracted for each input frame from 3D-2D-CNN-BLSTM net-
plied on final CTC labels to get probabilities. The cell size for work trained with ch-CTC. Then each feature is duplicated
LSTM layers is 200. The hyperbolic tangent activation function four times. Thus input sequence of length l is converted to
(Tanh) is used in LSTM layers. a sequence of length 4 × l. Following the standard hybrid
The network is trained end-to-end with CTC loss. The input model training pipeline, a GMM-HMM model is trained on
e
tim

linearize
bottleneck
features

2D-Conv1 2D-Conv2
3D-Pool2 BLSTM1 BLSTM2 linear
3D-Pool1 3D-Conv2 with softmax
Input video 3D-Conv1

Figure 1: Architecture of 3D-2D-CNN-BLSTM network

these features for context-independent phones (mono-phone Table 2: Description of datasets used in our experiments
GMM-HMM model). Bootstraped by the alignments generated
from mono-phone model tri-phone GMM-HMM (a model with Dataset Vocabulary #utterances #speakers
context-dependent phones) is trained. Then GMM-HMM with
LDA transformed features is trained. Grid 51 33000 33
Indian
Finally, HMM tied-state labels obtained from GMM-HMM English 236 14073 81
trained with LDA transforms are used for training BLSTM-
HMM hybrid model with cross-entropy loss. The preprocessing
involves applying cumulative mean-variance normalization and
LDA producing 40 dimensional features. Then ∆, ∆∆ features 3.3. Word-CTC Model
are appended to get 120 dimensional features. The BLSTM- The 3D-2D-CNN-BLSTM w-CTC (word-CTC) model is the
HMM model is trained on these 120d input features. The stan- 3D-2D-CNN-BLSTM model trained with CTC objective func-
dard HCLG WFST is used for decoding. The complete pipeline tion on word labels. Each unique word in the given dataset is
is shown in Figure 2. considered as different label. Additionally we have two more
labels for space and blank. For each input video frame, network
associates a word label or a space or a blank label. The resulting
text output output will be a sequence of words. The motivation for this ap-
proach is to overcome the conditional independence assumption
HCLG WFST decoder of CTC loss. The outputs of CTC are conditionally independent
hence sometimes resulting in meaningless sequence of charac-
BLSTM ters when character labels are used.
BLSTM
LDA transformation and 4. Experiments
append ∆, ∆∆ 4.1. Datasets
bottleneck features 4.1.1. Grid
The Grid audio-visual dataset is the widely used data for audio-
Figure 2: Training pipeline of BLSTM-HMM model with bottle- visual or visual speech recognition tasks [22]. Each sentence
neck features obtained from 3D-2D-CNN-BLSTM as input of Grid has a fixed structure with six words structured as: com-
mand + color + preposition + letter + digit + adverb. For
example ”set red with m six please”. The dataset has 51 unique
words. The 51 words include four commands, four colors, four
3.2.1. Feature Duplication prepositions, 25 letters, ten digits, and four adverbs. Each sen-
tence is randomly chosen combination of these words. The du-
We observed that duplicating each input feature exactly 4 times ration of each utterance is 3 seconds. The Table 2 gives more
gives dramatic improvements for hybrid models. The Table 4 details about Grid corpus. For proper comparison, we employ
shows how feature duplication results in nearly 50% relative same methods as of [7] for splitting train and test sets. For over-
improvement for all methods. We believe that the duplicated lapped speakers (seen speakers), we randomly select 255 utter-
frames help in disambiguating lip movements for each phoneme ances from each speaker to test set and remaining utterances are
under different context. used for training. For speaker independent case (unseen speak-
ers), we held out two male speakers (1 and 2) and two female
It is known that videos with higher frame rate give better speakers (20 and 22) for evaluation and remaining speakers are
performance for lipreading [21]. Because of this, we believe used for training.
BLSTM-HMM trained with feature duplication has shown sig-
nificant improvement in performance. We intend to explore
4.1.2. Indian English Dataset
more on feature duplication in our futur work by duplicating it
different number of times (like twice, thrice, etc,.) and analize Indian English dataset is an audio-visual dataset we collected
it’s effect on performance. where 81 speakers have spoken 180 utterances each. This
dataset is collected by asking users to hold phone in their Table 3: Performance on Grid and Indian English (In-En)
hand in comfortable manner, with only condition that their face datasets. NA: results not available.
should be visible in the video. Each speaker has to speak the
utterance shown on the screen, video is recorded through front Grid In-En
camera. Speakers are even allowed to move and talk. The ut- (WER %) (WER %)
Method
terances are choosen such that they are phonetically balanced. seen unseen unseen
Hence Indian English Dataset (In-En) is more wilder dataset
than Grid audio-visual dataset. The length of utterance vary DCT
from 2−5 seconds. Table 2 gives more details about the dataset. BLSTM-HMM 8.9 16.5 20.8
For training we use 65 speakers and for evaluation we use 16 3D-2D-CNN-BLSTM
speakers out of which 4 are female and 12 are male speak- ch-CTC 3.2 15.2 19.6
ers. All the experiments on Indian English dataset presented in 3D-2D-CNN
this paper are done on gray-scale videos with cumulative mean- BLSTM-HMM 5.2 13.6 16.4
variance normalization.
LIPNET [7] 4.8 11.4 NA

4.2. Implementation Details WAS (WLAS) [6] 3.0 NA NA


For extracting the ROI (Region of Interest), we perform lip de-
LCANet [8] 2.9 NA NA
tection using YOLO v2 model [23] trained to detect lip region.
The ROI is then resized to size 100 × 50. 3D-2D-CNN-BLSTM
w-CTC 1.3 8.6 12.3
We implemented 3D-2D-CNN-BLSTM in Tensorflow [24].
For 3D-2D-CNN-BLSTM network training we augment data
with horizontally mirrored video frames. To make the network
resilient to frame rate, we randomly drop or repeat the frames tested BLSTM-HMM model with bottleneck features obtained
with probability 0.05. We use ADAM optimizer [25] with ini- from 3D-2D-CNN-BLSTM w-CTC model and results are very
tial learning rate of 0.0001 and batch size of 32. We use native similar to the model trained on bottlenect features from 3D-2D-
tensorflow CTC beam search decoder with beam width of 200. CNN-BLSTM ch-CTC.
For both ch-CTC model and w-CTC model same parameters
are used. Before computing WER for ch-CTC model we cor- Table 4: Effect of feature duplication on BLSTM-HMM model
rect spellings of words in predicted output by replacing it with
closest word in dictionary using edit distance. Without With
GMM-HMM models are trained using Kaldi toolkit [26]. Method Dataset Duplication Duplication
Preprossesing steps involve LDA dimensionality reduction to DCT
40d from 11 × 48d (5 contextual frames on both sides). Then BLSTM-HMM Grid 25.5 8.9
∆ and ∆∆ features are appended and BLSTM is trained on 3D-2D-CNN
these features. The BLSTM in BLSTM-HMM is trained using BLSTM-HMM Grid 8.6 5.3
Tensorflow. The ADAM optimizer with initial learning rate of 3D-2D-CNN Indian
0.001 and batch size 32 is used. Two layer BLSTM is trained BLSTM-HMM English 29.7 16.4
with cell dimension of 150. The outputs of BLSTM are pos-
terior probabilities of HMM states. The DCT BLSTM-HMM
model is the baseline model where top 10 × 10 DCT features As discussed in subsection 3.2 feature duplication in con-
of 100 × 50 lip image are used as features in BLSTM-HMM text of BLSTM-HMM models has shown relative improvement
model, same as in [1]. of 53% and 44% on Grid and Indian English datasets respec-
tively as shown in Table 4.
5. Results Eventhough w-CTC model has given best performance, this
approach is not scalable with vocabulary size and addition of
Results are reported on seen test set and on unseen test set for new words to vocabulary is difficult. For such realword prob-
Grid dataset. For Indian English (In-En) dataset we report re- lems with vocabulary size of 1 million ch-CTC or sub-word la-
sults on unseen dataset. The results for our proposed approaches bels should be used. We intend to extend our work to sub-word
compared with baseline DCT BLSTM-HMM and with LIP- label CTC loss in future and explore its potential.
NET, LCANet, WAS (WLAS) are presented in Table 3.
It is shown that 3D-2D-CNN-BLSTM w-CTC approach 6. CONCLUSION
has achieved state-of-the-art performance with 55.1% relative
improvement over LCANet on Grid Dataset (on seen test set) We proposed new 3D-2D-CNN-BLSTM architecture, which is
and relative improvement of 24.5% on unseen test set as com- comparable to LCANet on Grid when ch-CTC loss is used. The
pared to LIPNET. The 3D-2D-CNN BLSTM-HMM ch-CTC proposed 3D-2D-CNN-BLSTM w-CTC has given state-of-the-
model has shown significant relative improvement over base- art results with relative improvement of 55% and 24.5% on Grid
line DCT BLSTM-HMM model on Grid dataset. The 3D-2D- seen and unseen test sets with 1.3% WER and 8.6% WER re-
CNN-BLSTM w-CTC is shown to perform relatively 25.0% spectively. We also demonstrated that 3D-2D-CNN BLSTM-
better than 3D-2D-CNN BLSTM-HMM model on Indian En- HMM models performs better than 3D-2D-CNN-BLSTM ch-
glish dataset. The Indian English dataset is wilder dataset where CTC and DCT BLSTM-HMM models. The importance of du-
speakers are allowed to move and talk, unlike Grid. Our pro- plicating features in context of HMM based models is also dis-
posed approaches show similar trends in performance on both cussed. We have also mentioned the limitations of w-CTC mod-
datasets, proving that our approaches are robust. We have also els and possible alternatives.
7. References [20] H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer:
Acoustic-to-word lstm model for large vocabulary speech recog-
[1] A. Thanda and S. M. Venkatesan, “Multi-task learning of deep nition,” arXiv preprint arXiv:1610.09975, 2016.
neural networks for audio visual automatic speech recognition,”
arXiv preprint arXiv:1701.02477, 2017. [21] A. G. Chiţu and L. J. Rothkrantz, “The influence of video sam-
pling rate on lipreading performance,” in 12-th International Con-
[2] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zis- ference on Speech and Computer (SPECOM’2007), 2007, pp.
serman, “Deep audio-visual speech recognition,” arXiv preprint 678–684.
arXiv:1809.02108, 2018.
[22] M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audio-
[3] R. Aralikatti, D. Margam, T. Sharma, T. Abhinav, and S. M. visual corpus for speech perception and automatic speech recog-
Venkatesan, “Global snr estimation of speech signals using en- nition,” The Journal of the Acoustical Society of America, vol.
tropy and uncertainty estimates from dropout networks,” arXiv 120, no. 5, pp. 2421–2424, 2006.
preprint arXiv:1804.04353, 2018.
[23] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,”
[4] Z. Zhou, G. Zhao, X. Hong, and M. Pietikäinen, “A review of re- arXiv preprint, 2017.
cent advances in visual speech decoding,” Image and vision com-
puting, vol. 32, no. 9, pp. 590–605, 2014. [24] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,
M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: a
[5] S. Hilder, R. W. Harvey, and B.-J. Theobald, “Comparison of hu- system for large-scale machine learning.” in OSDI, vol. 16, 2016,
man and machine-based lip-reading.” in AVSP, 2009, pp. 86–89. pp. 265–283.
[6] J. S. Chung, A. W. Senior, O. Vinyals, and A. Zisserman, “Lip [25] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
reading sentences in the wild.” in CVPR, 2017, pp. 3444–3453. mization,” arXiv preprint arXiv:1412.6980, 2014.
[7] Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas, [26] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
“Lipnet: End-to-end sentence-level lipreading,” arXiv preprint N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al.,
arXiv:1611.01599, 2016. “The kaldi speech recognition toolkit,” in IEEE 2011 workshop
on automatic speech recognition and understanding, no. EPFL-
[8] K. Xu, D. Li, N. Cassimatis, and X. Wang, “Lcanet: End-to-end CONF-192584. IEEE Signal Processing Society, 2011.
lipreading with cascaded attention-ctc,” in Automatic Face & Ges-
ture Recognition (FG 2018), 2018 13th IEEE International Con-
ference on. IEEE, 2018, pp. 548–555.
[9] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Con-
nectionist temporal classification: labelling unsegmented se-
quence data with recurrent neural networks,” in Proceedings of
the 23rd international conference on Machine learning. ACM,
2006, pp. 369–376.
[10] A. Thanda and S. M. Venkatesan, “Audio visual speech recog-
nition using deep recurrent neural networks,” in IAPR Workshop
on Multimodal Pattern Recognition of Social Signals in Human-
Computer Interaction. Springer, 2016, pp. 98–109.
[11] K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, and D. Na-
hamoo, “Direct acoustics-to-word models for english conver-
sational speech recognition,” arXiv preprint arXiv:1703.07754,
2017.
[12] H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer:
Acoustic-to-word lstm model for large vocabulary speech recog-
nition,” arXiv preprint arXiv:1610.09975, 2016.
[13] J. Huang and B. Kingsbury, “Audio-visual deep learning for noise
robust speech recognition,” in 2013 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing. IEEE, 2013,
pp. 7596–7599.
[14] J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Asian
Conference on Computer Vision. Springer, 2016, pp. 87–103.
[15] S. Gergen, S. Zeiler, A. H. Abdelaziz, R. M. Nickel, and
D. Kolossa, “Dynamic stream weighting for turbo-decoding-
based audiovisual asr.” in INTERSPEECH, 2016, pp. 2135–2139.
[16] S. Petridis, Z. Li, and M. Pantic, “End-to-end visual speech recog-
nition with lstms,” in Acoustics, Speech and Signal Processing
(ICASSP), 2017 IEEE International Conference on. IEEE, 2017,
pp. 2592–2596.
[17] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curricu-
lum learning,” in Proceedings of the 26th annual international
conference on machine learning. ACM, 2009, pp. 41–48.
[18] C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Ver-
gyri, J. Sison, and A. Mashari, “Audio visual speech recognition,”
IDIAP, Tech. Rep., 2000.
[19] B. Shillingford, Y. Assael, M. W. Hoffman, T. Paine, C. Hughes,
U. Prabhu, H. Liao, H. Sak, K. Rao, L. Bennett et al., “Large-scale
visual speech recognition,” arXiv preprint arXiv:1807.05162,
2018.

You might also like