Professional Documents
Culture Documents
Lip Reading Sentences Using Deep Learning With Only Visual Cues
Lip Reading Sentences Using Deep Learning With Only Visual Cues
Lip Reading Sentences Using Deep Learning With Only Visual Cues
ABSTRACT In this paper, a neural network-based lip reading system is proposed. The system is lexicon-free
and uses purely visual cues. With only a limited number of visemes as classes to recognise, the system is
designed to lip read sentences covering a wide range of vocabulary and to recognise words that may not be
included in system training. The system has been testified on the challenging BBC Lip Reading Sentences 2
(LRS2) benchmark dataset. Compared with the state-of-the-art works in lip reading sentences, the system has
achieved a significantly improved performance with 15% lower word error rate. In addition, experiments with
videos of varying illumination have shown that the proposed model has a good robustness to varying levels
of lighting. The main contributions of this paper are: 1) The classification of visemes in continuous speech
using a specially designed transformer with a unique topology; 2) The use of visemes as a classification
schema for lip reading sentences; and 3) The conversion of visemes to words using perplexity analysis. All
the contributions serve to enhance the accuracy of lip reading sentences. The paper also provides an essential
survey of the research area.
INDEX TERMS Deep learning, lip reading, neural networks, perplexity analysis, speech recognition.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
215516 VOLUME 8, 2020
S. Fenghour et al.: Lip Reading Sentences Using Deep Learning With Only Visual Cues
• They often require curriculum learning-based strate- including pre-processing, visual feature extraction, viseme
gies [27], [28] which involve further pre-processing, classification and word detection are given. In Section IV,
whereby the videos of individuals speaking in the train- the classification results for the overall lip reading sys-
ing data have to be clipped so that the models can be tem are discussed and compared followed by concluding
trained on single word examples initially, with the length remarks given in Section V along with suggestions for further
of the sentences being gradually incremented. research.
This paper focuses on improving the accuracy of lip read-
ing sentences and this is achieved by using visemes as a II. LITERATURE REVIEW
very limited number of classes for classification, a specially Automated lip reading systems initially focused on classify-
designed deep learning model with its own network topol- ing isolated speech segments in the form of digits and letters
ogy for classifying visemes, and a conversion of recognised [13], [14], [15], [16], [17], and then eventually moved on
visemes to possible words using perplexity analysis. to longer speech segments in the form of words. The suc-
Using visemes for lip reading sentences has some unique cess of automated lip reading was previously constrained by
advantages. The use of visemes as classes in comparison to the available training data, as initially, the only audio-visual
the use of either words or ASCII characters as classes requires datasets available were those with isolated speech segments,
an overall smaller number of classes which alleviates bottle- i.e., digits, alphabet and words [17], [18], [19]. Subsequently
neck in the computation. In addition, using visemes does not every speech segment was treated as a class to recognise.
require pre-trained lexicons, meaning that a viseme-based lip Thanks in part to the availability of larger audio-visual
reading system can be used to classify words that have not datasets with continuous speech, later lip reading systems
presented in the training phase, and they can be generalised to have focused on classifying entire sentences utilising a wider
different languages because many different languages share range of vocabulary and so have opted for ASCII-based
the same visemes. class systems [5], [6], [7], [8], [9], [10]. Sentences are spelt
On the other hand, there are some specific issues to be using ASCII characters as opposed to including a class for
considered when designing a viseme-based lip reading sys- every single word, which allows for the use of fewer classes
tem for sentences. The general classification performance for and avoids the creation of computational bottleneck [30].
individual segmented visemes has been less satisfactory in ASCII characters also allow for the modelling of natural
comparison to the classification of words due to the fact that language due to the conditional probability relationships that
visemes tend to have a shorter duration than words. This exist between ASCII characters making it easier to predict
results in there being less temporal information available to characters and words.
distinguish between different classes, as well as there being Very good accuracies have been attained in some of the
more visual ambiguity when it comes to class recognition most recent neural network-based lip reading systems that are
[24]. One possible way to address this problem is to signifi- trained to classify individual words on word-based lip reading
cantly increase the training data available to enhance the sys- datasets like LRW [7] and LRW-1000 [47]. LRW is a very
tem’s ability to distinguish between classes, and this is why a taxing dataset since it consists of more than 1000 speakers
high volume of training videos have been utilised. Moreover, with large variations in head pose and illumination. LRW-
there is a direct conversion of recognised ASCII characters to 1000 is an even more tricky Mandarin lip reading dataset,
possible words in a one-to-one mapping relationship, whereas due to its large variations in scale, resolution and background
this one-to-one mapping relationship does not exist when clutter.
using visemes, because one set of visemes can map to multi- Notable performances have been recorded for lip reading
ple different sounds or phonemes. This also means that once systems that predict entire sentences, such as those predict-
visemes have been classified, there is still the need to perform ing phrases from the GRID [48] and OuluVS [49] datasets.
a viseme-to-word conversion. This approach also helps to However, sentences in datasets like GRID and OuluVS are
distinguish between homopheme words or words that look the simple, repetitive and follow standard sequences unlike those
same when spoken but sound different [11], a phenomenon contained within the LRS2 corpus which are more random
that exists because of the one-to-many mapping relationship and varied. A summary of the most recent state-of-the-art lip
between visemes and phonemes. reading models and their performances is given in Table 1.
The proposed automated lip reading system contains a Other alternative classification schemas for neural
component to classify spoken visemes from people speaking network-based lip reading include phonemes which have
in silent videos, and a component to perform viseme-to-word been used in audio and acoustic speech recognition sys-
conversions using perplexity analysis [12]. The proposed tems [20]. Shillingford et al. [10] used a neural network
model also has a good robustness to varying levels of lighting. architecture consisting of a spatial-temporal convolutional
The rest of the paper is organised as follows: First in neural network(CNN) and a Long-Short Term Memory Net-
Section II, the different classification schema for auto- work (LSTM) to classify sequences of phonemes from silent
mated lip reading are discussed along with their advan- videos where phonemes were then mapped to words using a
tages and limitations. Then in Section III, details of all Finite-state transducer [30]. However, with phonemes, there
the components that make up the whole lip reading system is still the one-to-many mapping problem where different
phonemes map to the same viseme thus producing identical for the lip reading sentence system proposed in this paper
lip movements. based entirely on visual cues.
To the best of our knowledge, there is no lip reading sen- No official standard convention for defining precise
tences system that has decoded entire sequences of visemes, visemes or even the precise total number of visemes exists
although there has been a lot of work on classifying indi- and different approaches to viseme classification have used
vidual segmented visemes in the form of images or groups varying numbers of visemes as part of their conventions with
of image frames [22], [23], [24], [25]. If visemes are to different phoneme-to-viseme mappings [29], [30], [31], [32],
be classified, they should be classified in the context of [33], [34]. All the different conventions consist of consonant
continuous speech in order to perform viseme classification visemes, vowel visemes and one silent viseme; but Lee and
in real-time. There is one paper about an LSTM that takes Yook’s mapping convention of [29] appears to be the most
visemes as an input and predicts the words that were spoken favoured for speech classification and it is the one that has
by individuals from a limited dataset with some satisfac- been utilised for this paper. However, it is accepted that there
tory results [26], though the individual visemes were already are multiple phonemes that are visually identical on any given
known. speaker [35], [36].
In addition to being treated as individual segments, visemes The different automated lip reading approaches sum-
can also be modelled in the form of clusters like ‘‘visual marised in Table 1 indicate many challenges still hindering
words’’ where groups of visemes that make up a word can the success of automated lip reading systems. One of these
be segmented. Whilst approximately 50% of the words in the challenges is the lack of temporal information required to
English language share identical viseme clusters, there are distinguish between segments of speech which is why some
words that have unique visemes and can be classified when of the approaches tasked to classify shorter segments, such
performing automated lip reading using solely visual infor- as visemes and digits, have not attained as good accuracies
mation. For words that share visemes, clusters of visemes in as those tasked to classify words. This problem however can
combination would need to be analysed to determine which be compensated for by increasing the training data available
combination is most linguistically probable. This is the basis and when a small limited number of speech segments are to
be classified, such as in the case of digits or visemes, the per- The entire process consists of different stages, starting off
formance of such systems can be enhanced by generating as with a Data Preprocessing stage where the region of interest
much training data as possible to train the networks. is extracted from the videos using facial landmark detection
To apply such an approach is not feasible for the case of to provide the input to the Visual Frontend. The components
words where the number of possible words that can be spoken of the overall architecture include: a spatial-temporal visual
is unlimited so it is necessary to use a discrete class system to frontend that inputs a sequence of images of loosely cropped
cover general speech such as in the case of ASCII characters. lip regions, and outputs one feature vector per frame; a
However, the use of ASCII characters in lip reading relies on sequence processing module known as the viseme classifier
the conditional dependence relationship that exists between that inputs the sequence of per-frame feature vectors and
the characters, and ASCII symbols are not always phonetic outputs a sequence of visemes, and finally a module that
because of silent letters and digraphs, so to train a network matches visemes to words and predicts the uttered sentence
to decode speech in real time requires training to have been using perplexity analysis. The performance of the system is
done on an extensive range of vocabulary. evaluated by comparing the sentences predicted by the lip
Lip reading systems tasked for predicting sentences from reading system to the ground truth of the spoken sentences
sentence datasets such as GRID and OuluVS have been more and measuring the edit distance. In the following Sections,
fruitful in terms of accuracy compared with those tasked details of the systems components are discussed.
to recognise sentences from more challenging datasets like
LRS2. One of the main reasons that a dataset like LRS2 is so A. ARCHITECTURE
difficult is because it contains sentences that randomly cover The overall system used for decoding speech consists of two
a vocabulary of over 40,000 words, which is very different separate neural network architectures used to perform two
to the circumstances of the datasets GRID and OuluVS that different tasks. The first architecture is used for the task
contain repetitive sentences following a standard sequence, of viseme classification and consists of a spatial-temporal
and that only cover a small range of vocabulary. Lip reading visual frontend in tandem with an attention-based transformer
systems that use ASCII characters as classes are designed to and the predicted visemes provide the input of the next
predict words as combinations of ASCII characters and so architecture. The second architecture, also an attention-based
to recognise any set of words, such words will need to have transformer, is used to predict the spoken words given the
appeared in the training phase. As of present an ASCII-based uttered visemes using a calculated metric called perplexity.
lip reading systems are not be able to decode words that have As illustrated in Figure 2, each of these modules are briefly
not presented in training. The low accuracy of lip reading described along with the overall framework for the lip reading
systems designed for lip reading sentences can be explained system. Both the viseme classifier and the word detector
by the inability to generalize to a wide range of vocabulary consist of common blocks including fully connected layers,
whilst using a limited number of classes. self-attention layers and feed-forward layers and the break-
Training ASCII-based lip reading systems to generalise down of these three blocks is given in Figure 3.
to a wide range of vocabulary remains an ongoing obstacle The attention-transformer structure used in [39] has been
to tackle. One alternative to having a lip reading system changed to fit visemes, and this will be discussed in III-E.
designed for decoding speech that covers a given vocabulary Unlike [39], there is no embedding layer, and the Decoder has
range is to recognise lip movements and map them to possible been altered with the final softmax layer trained on visemes
words because there are distinct number of visemes that instead of ASCII characters.
can be uttered by someone speaking. However, because of
the one-to-many mapping relationship that exists between
visemes and phonemes, one would still need to determine B. DATA
which combination of words have been uttered. The dataset used in this research is the BBC LRS2 dataset [6].
It consists of approximately 46,000 videos covering over
2 million word instances and a vocabulary range of over
III. METHODOLOGY 40,000 words. The video with the longest duration has a
Given a silent video of a talking face, the objective here length of 180 frames with every video have frame rate
is to predict the sentences being spoken by extracting their of 25 frames per second. The dataset contains sentences of
lip movements. In this Section, an overall architecture is up to 100 ASCII characters from BBC videos, with a range of
proposed for decoding visual speech illustrated in Figure 1. facial poses from frontal to profile. The dataset is extremely
FIGURE 2. The breakdown stages of how sentences are predicted from silent videos.
FIGURE 3. The different transformer components with the fully connected layer on the left, self-attention in the middle and feed-forward
on the right.
TABLE 2. Statistics of BBC LRS2 dataset. TABLE 3. Details of spatial-temporal network for visual front-end.
1
Ĥ = −ln P(w1 , w2 , . . . , wN ) (9) SAR is a binary metric as expressed in Eq. 15, where the
N value is 1 if the predicted sentence PP is equal to the ground
1
PP = P(w1 , w2 , . . . , wN )− N (10) truth PT , otherwise it would take the value of 0:
(
A language model, i.e., a probability distribution over 1, PP = PT
SAR = (15)
sequences of words, can be measured on the basis of the 0, PP 6= PT
entropy of its output from the field of information theory [42].
Perplexity is a measure of the quality of a language model, H. ILLUMINATION
because a good language model will generate sequences of To test the proposed lip reading system’s robustness to
words with a larger probability of occurrence resulting in a changes in lighting, the overall architecture, once trained, has
smaller perplexity. been evaluated on videos from the testing set under levels
The Transformer model used for the word detector is the of illumination. Illumination has been applied by varying
pre-trained Generative Pre-Training (GPT) Transformer [41] the pixel brightness. It is after the video sampling stage
- a multi-layer decoder and a variant of the transformer used of the pre-processing described in III-C that illumination is
in [39]. It consists of repeated blocks of multi-headed self- applied to the image frames. The overall process is described
attention followed by position-wise feedforward layers. The in Figure 8.
architecture is typically used for sentence prediction; how-
ever, the architecture itself here is not used for direct classi-
fication, rather its purpose is for perplexity calculations that
are required for word selection where visemes are converted
to words. Visemes from the previous step are sequentially FIGURE 8. Stages for applying illumination.
matched to words and the most probable sentence is chosen
according to that with the minimum perplexity score. The Image frames of videos from the dataset consist of red,
perplexity score is calculated by taking the exponentiation blue and green pixel components with numerical values rang-
of the cross-entropy loss when the GPT is evaluated on a ing from minimum intensity 0 to maximum intensity 255.
sentence and like in [27], a beam width of 50 has been used. Pixel normalisation is the first stage of the procedure and
this involves minimum-maximum normalisation of all pixel
G. SYSTEMS PERFORMANCE MEASURES values where pixel values are mapped from the range [0,255]
The measures that have been used to evaluate the lip reading to [0,1]. Once this is done, a gamma correction is applied
sentence system are edit distance-based metrics and are com- where pixel values are corrected according to Eq. 16, where
puted by calculating the normalized edit distance between I is a matrix of pixels, γ is scalar value and O is the resulting
the ground truth and a predicted sentence. Metrics reported matrix of pixels after the gamma correction has been applied:
in this paper include Viseme Error Rate (VER), Character
Error Rate (CER), Word Error Rates (WER) and Sentence O = I 1/γ (16)
Accuracy Rate (SAR).
Error rate metrics used for evaluating accuracy are given Values of γ that are less than 1.0 will cause images to
by calculating the overall edit distance. In determining mis- darken whereas values of γ that are greater than 1.0 cause
classifications, one has to compare the decoded speech to the images to brighten. Figure 9 gives examples of images with
actual speech. The equation for calculating Error Rate (ER) is the standard image (γ = 1.0) on the left, the darkened image
given in Eq. 11 with N being the total number of characters in in the middle (γ = 0.5) and the brightened image on the right
the ground truth, S being the number of characters substituted (γ = 1.5). The gamma corrections applied in this paper have
for wrong classifications, I being the number of characters utilised γ values ranging from 0.5 to 1.5.
inserted for those not picked up and D being the number of
deletions being made for decoded characters that should not
be present. CER, WER and VER are all calculated this way
with the expressions given in Eqs. 12, 13 and 14 where C, W
and V correspond to characters, words and visemes.
S +D+I
ER = (11)
N
CS + CD + CI FIGURE 9. Images under varying illumination with standard image on the
CER = (12)
CN left, darkened image in the middle and brightened image on the right.
WS + WD + WI
WER = (13)
WN After applying the gamma correction, pixels undergo
VS + VD + VI re-normalisation where all pixels values are mapped back
VER = (14) from from the range [0,1] to the range [0,255].
VN
215524 VOLUME 8, 2020
S. Fenghour et al.: Lip Reading Sentences Using Deep Learning With Only Visual Cues
FIGURE 15. Word confusion matrix for this lip reading system.
with a VER of only 4.6%. The confusion matrices by both Table 7 gives the performance metrics for how the pro-
visemes and ASCII characters are given in Figures 12 and 13, posed lip reading system and Afouras et al’s model [8]
respectively. performed when videos in the validation set were subjected
TABLE 8. Examples of perplexity calculations for sentences from the test set.
to different levels of illumination, applied to in accordance perplexity calculations, and the viseme clusters correspond-
with III-H. It can be seen that the proposed lip reading system ing to each predicted word. Table 9 gives the full details
is generally robust to varying levels of illumination, like that of how those sentences were decoded by listing their corre-
of Afouras et al. [8] and this is expected given that videos sponding visemes, the predicted visemes, the decoded sen-
in the BBC LRS2 corpus were recorded in varying lighting tences and their corresponding metric performance results.
conditions. A stratified sampling strategy was used to select the most
In order to attain a good overall accuracy for classifica- frequently appearing 154 words in the BBC LRS2 training set
tion of words, both the viseme classification performance that begin with each letter of the alphabet. For the selected
and the viseme-to-word conversion performance need to be 154 words, a comparison of the accuracy in terms of ratio
good. The VER is very low and any misclassifications that of how many times a word was correctly decoded to how
have occurred during the validation phase appeared to be many times it appeared in the testing phase has been presented
influenced by the class imbalance of visemes present in the in Figures 14 and 15. Figure 14 shows the word accuracy for
training data. When visemes are misclassified, they are most Afouras et al.’s model and Figure 15 shows the accuracy for
likely to be decoded as one of ‘‘AH’’, ‘‘K’’ or ‘‘T’’ because this lip reading system. A better word precision is noticeable
such visemes appear most frequently in training data and in Figure 15.
obscure classes such as ‘‘AA’’ and ‘‘CH’’ are the most likely It should be noted that, whilst the VER was low, the WER
to be misclassified. was still high although it has been significantly improved
Table 8 gives examples of sentences from the BBC compared to other existing works. To further reduce the
LRS2 dataset along with the decoded visemes, the word error rate, the viseme-to-word conversion would need to be
combinations that were outputted at each iteration of the optimised. Many misclassifications have been caused by the
TABLE 9. Examples of how sentences from the test set were decoded.
presence of local optima during the implementation of the reading sentences. As shown in the experiments, although the
local beam search, whereby at each iteration of the viseme classification accuracy of visemes achieved by the proposed
sequence during the perplexity calculation stage, the words system was very high (over 95%), the classification accu-
that make up the ground truth are not included within the racy of words was significantly dropped after the conversion
top 50 results. A large beam with would invariably result in (65.5%). As such, it is important to explore any other possible
a greater conversion rate, but at the expense of using more approaches for the conversion. For perplexity analysis-based
computational overhead and an exhaustive search would not conversion, different global optimisation methods need to be
even be viable. Further work needs to be done to ensure that considered while also limiting the computational overhead
the global optimum combinatorial solution is selected more required.
frequently during the Perplexity Calculation stage to further
improve on word accuracy. REFERENCES
[1] Z. Zhou, G. Zhao, X. Hong, and M. Pietikäinen, ‘‘A review of recent
advances in visual speech decoding,’’ Image Vis. Comput., vol. 32, no. 9,
V. CONCLUSION pp. 590–605, Sep. 2014.
A neural network-based lip reading system has been devel- [2] A. Fernandez-Lopez and F. M. Sukno, ‘‘Survey on automatic lip-reading
in the era of deep learning,’’ Image Vis. Comput., vol. 78, pp. 53–72,
oped to predict sentences covering a wide range of vocab- Oct. 2018.
ulary in silent videos from people speaking. The system is [3] G. Potamianos, C. Neti, and I. Matthews, ‘‘Audio-visual automatic speech
lexicon-free, uses only visual cues represented by visemes of recognition: An overview,’’ in Issues in Audio-Visual Speech Processing,
G. Bailly, E. Vatikiotis-Bateson, P. Perrier, Eds. Cambridge, MA, USA:
a limited number of distinct lip movements, and is robust to MIT Press, 2004.
different levels of lighting. Verified on the BBC LRS2 data [4] K. S. Talha, K. Wan, S. K. Za’ba, Z. M. Razlan, and A. B. Shahriman,
set, the system has demonstrated a significant improvement ‘‘Speech analysis based on image information from lip movement,’’ IOP
Conf. Ser., Mater. Sci. Eng., vol. 53, Dec. 2013, Art. no. 012016.
on classification accuracy of words compared to the state-of- [5] T. Stafylakis and G. Tzimiropoulos, ‘‘Combining residual networks with
the-art works. LSTMs for lipreading,’’ in Proc. Interspeech, Aug. 2017, pp. 3652–3656.
Future research includes investigating a more suitable neu- [6] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, ‘‘Lip reading
sentences in the wild,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
ral network architecture in order to enable the system to have (CVPR), Jul. 2017, pp. 3444–3453.
a good generalisation capability with a higher ratio of the [7] J. S. Chung and A. Zisserman, ‘‘Lip reading in the wild,’’ in Proc. Asian
number of training samples to the number of test samples. Conf. Comput. Vis., 2015, pp. 87–103.
[8] T. Afouras, J. S. Chung, and A. Zisserman, ‘‘Deep lip reading: A compari-
In addition, an efficient conversion of visemes to words is son of models and an online application,’’ in Proc. Interspeech, Sep. 2018,
crucial when using visemes as classification scheme for lip pp. 3514–3518.
[9] Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas, ‘‘Lip- [33] T. J. Hazen, K. Saenko, C.-H. La, and J. R. Glass, ‘‘A segment-based
Net: End-to-end sentence level lipreading,’’ in Proc. ICLR Conf., 2016, audio-visual speech recognizer: Data collection, development, and initial
pp. 1–13. experiments,’’ in Proc. 6th Int. Conf. Multimodal Interfaces ICMI, 2004,
[10] B. Shillingford, Y. Assael, M. Hoffman, T. Paine, C. Hughes, U. Prabhu, pp. 235–242.
H. Liao, H. Sak, R. Hasim, H. Rao, L. Bennett, M. Mulville, B. Coppin, [34] E. Bozkurt, C. E. Erdem, E. Erzin, T. Erdem, and M. Ozkan, ‘‘Comparison
B. Laurie, A. Senior, and N. Freitas, ‘‘Large-scale visual speech recogni- of phoneme and viseme based acoustic units for speech driven realistic lip
tion,’’ in Proc. INTERSPEECH, 2018. animation,’’ in Proc. 3DTV Conf., May 2007, pp. 1–4.
[11] A. J. Goldschen, O. N. Garcia, and E. D. Petajan, ‘‘Rationale for phoneme- [35] C. G. Fisher, ‘‘Confusions among visually perceived consonants,’’
viseme mapping and feature selection in visual speech recognition,’’ in J. Speech Hearing Res., vol. 11, no. 4, pp. 796–804, Dec. 1968.
Speechreading by Humans and Machines (NATO ASI Series F: Computer [36] F. DeLand, ‘‘The story of lip-reading, its genesis and development,’’ Volta
and Systems Sciences), vol. 150. Berlin, Germany: Springer, 1996. Bureau, Tech. Rep., 1931.
[12] P. F. Borwn, S. A. D. Pietra, V. J. D. Pietra, J. C. Lai, and R. L. Mercer, [37] W. B. Dolan and C. Brockett, ‘‘Automatically constructing a corpus of
‘‘An estimate of an upper bound for the entropy of english,’’ Comput. sentential paraphrases,’’ in Proc. 3rd Int. Workshop Paraphrasing, 2005,
Linguistics, vol. 18, no. 1, pp. 31–40, 1992. pp. 1–8.
[38] S. Gray, A. Radford, and K. P. Diederik, ‘‘Gpu kernels for block-sparse
[13] Y. Fu, X. Zhou, M. Liu, M. Hasegawa-Johnson, and T. S. Huang, ‘‘Lipread-
weights,’’ OpenAI, San Francisco, CA, USA, Tech. Rep., 2017.
ing by locality discriminant graph,’’ in Proc. IEEE Int. Conf. Image Pro-
[39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
cess., Oct. 2007, p. 325.
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. NIPS,
[14] K. Kumar, T. Chen, and R. M. Stern, ‘‘Profile view lip reading,’’ in 2017, pp. 5998–6008.
Proc. IEEE Int. Conf. Acoust., Speech Signal Process. ICASSP, Apr. 2007, [40] R. Treiman, B. Kessler, and S. Bick, ‘‘Context sensitivity in the spelling of
p. 429. english vowels,’’ J. Memory Lang., vol. 47, no. 3, pp. 448–468, Oct. 2002.
[15] P. J. Lucey, G. Potamianos, and S. Sridharan, ‘‘A unified approach to multi- [41] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, ‘‘Improving lan-
pose audio-visual ASR,’’ in Proc. Interspeech, 2007, pp. 650–653. guage understanding by generative pre-training,’’ OpenAI, San Francisco,
[16] E. Marcheret, V. Libal, and G. Potamianos, ‘‘Dynamic stream weight CA, USA, Tech. Rep., 2018.
modeling for audio-visual speech recognition,’’ in Proc. IEEE Int. Conf. [42] C. E. Shannon, ‘‘A mathematical theory of communication,’’ Bell Syst.
Acoust., Speech Signal Process. ICASSP, Apr. 2007, p. 945. Tech. J., vol. 5, no. 1, pp. 3–55, 1948.
[17] S. J. Cox, R. Harvey, Y. Lan, J. L. Newman, and B. J. Theobald, ‘‘The [43] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimization,’’
challenge of multispeaker lip-reading,’’ in Proc. Int. Conf. Auditory-Vis. in Proc. ICLR, 2015, pp. 1–15.
Speech Process., Sep. 2008, pp. 179–184. [44] S. Fenghour, D. Chen, P. Xiao, and K. Guo, ‘‘Disentangling homophemes
[18] I. Matthews, T. F. Cootes, J. A. Bangham, S. Cox, and R. Harvey, ‘‘Extrac- in lip reading using perplexity analysis,’’ 2020, arXiv:submit/3487157.
tion of visual features for lipreading,’’ IEEE Trans. Pattern Anal. Mach. [Online]. Available: https://arxiv.org/abs/3487157
Intell., vol. 24, no. 2, pp. 198–213, Aug. 2002. [45] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and
[19] B. Lee, M. Hasegawa-Johnson, C. Goudeseune, S. Kamdar, S. Borys, A. C. Berg, ‘‘SSD: Single shot multibox detector,’’ in Proc. Eur. Conf.
M. Liu, and T. S. Huang, ‘‘AVICAR: Audio-visual speech corpus in a car Comput. Vis. (ECCV). Amsterdam, The Netherlands: Springer, 2016,
environment,’’ in Proc. Interspeech, Jan. 2004, pp. 1–4. pp. 21–37.
[20] H. Hofmann, S. Sakti, R. Isotani, H. Kawai, S. Nakamura, and W. Minker, [46] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, ‘‘300 faces in-
‘‘Improving spontaneous english ASR using a joint-sequence pronunci- the-Wild challenge: The first facial landmark localization challenge,’’ in
ation model,’’ in Proc. 4th Int. Universal Commun. Symp., Oct. 2010, Proc. IEEE Int. Conf. Comput. Vis. Workshops, Dec. 2013, pp. 397–403.
pp. 58–61. [47] S. Yang, Y. Zhang, D. Feng, M. Yang, C. Wang, J. Xiao, K. Long, S. Shan,
and X. Chen, ‘‘LRW-1000: A naturally-distributed large-scale benchmark
[21] V. I. Levenshtein, ‘‘Binary codes capable of correcting deletions, inser-
for lip reading in the wild,’’ in Proc. 14th IEEE Int. Conf. Autom. Face
tions, and reversals,’’ Soviet Phys. Doklady, vol. 10, no. 8, pp. 707–710,
Gesture Recognit. (FG), May 2019, pp. 1–8.
1996.
[48] M. Cooke, J. Barker, S. Cunningham, and X. Shao, ‘‘An audio-visual
[22] K. Saenko, K. Livescu, M. Siracusa, K. Wilson, J. Glass, and T. Darrell, corpus for speech perception and automatic speech recognition,’’ J. Acoust.
‘‘Visual speech recognition with loosely synchronized feature streams,’’ Soc. Amer., vol. 120, no. 5, pp. 2421–2424, Nov. 2006.
in Proc. 10th IEEE Int. Conf. Comput. Vis. (ICCV), vol. 1, Oct. 2005, [49] I. Anina, Z. Zhou, G. Zhao, and M. Pietikainen, ‘‘OuluVS2: A multi-
pp. 1424–1431. view audiovisual database for non-rigid mouth motion analysis,’’ in Proc.
[23] O. Koller, H. Ney, and R. Bowden, ‘‘Deep learning of mouth shapes for 11th IEEE Int. Conf. Workshops Autom. Face Gesture Recognit. (FG),
sign language,’’ in Proc. IEEE Int. Conf. Comput. Vis. Workshop (ICCVW), May 2015, pp. 1–5.
Dec. 2015, pp. 85–91. [50] J. S. Son and A. Zisserman, ‘‘Lip reading in profile,’’ in Proc. Brit. Mach.
[24] A. B. Mattos, D. Oliveira, and E. Morais, ‘‘Improving viseme recognition Vis. Conf., 2017, pp. 1–11.
using GAN-based frontal view mapping,’’ in Proc. Analysis and Modeling [51] S. Petridis, Z. Li, and M. Pantic, ‘‘End-to-end visual speech recognition
of Faces and Gestures (CVPR), Jun. 2018, pp. 2148–2155. with LSTMS,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
[25] K. Thangthai, H. L. Bear, and R. Harvey, ‘‘Comparing phonemes and (ICASSP), Mar. 2017, pp. 2592–2596.
visemes with DNN-based lipreading,’’ in Proc. Brit. Mach. Vis. Conf., [52] S. Petridis, Y. Wang, Z. Li, and M. Pantic, ‘‘End-to-end audiovisual fusion
2017, pp. 1–11. with LSTMs,’’ in Proc. 14th Int. Conf. Auditory-Visual Speech Process.,
[26] S. Fenghour, D. Chen, and P. Xiao, ‘‘Decoder-encoder LSTM for lip read- Aug. 2017, pp. 1–5.
ing,’’ in Proc. 8th Int. Conf. Softw. Inf. Eng. - ICSIE, 2019, pp. 162–166. [53] S. Petridis, Y. Wang, Z. Li, and M. Pantic, ‘‘End-to-end multi-view lipread-
[27] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, ‘‘Curriculum learn- ing,’’ in Proc. Brit. Mach. Vis. Conf., 2017, pp. 1–17.
ing,’’ in Proc. 26th Annu. Int. Conf. Mach. Learn., Jun. 2009, pp. 41–48. [54] I. Fung and B. Mak, ‘‘End-to-end low-resource lip-reading with maxout
CNN and LSTM,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
[28] J. L. Elman, ‘‘Learning and development in neural networks: The impor-
(ICASSP), Apr. 2018, pp. 2511–2515.
tance of starting small,’’ Cognition, vol. 48, no. 1, pp. 71–99, Jul. 1993.
[55] S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and M. Pantic,
[29] S. Lee and D. Yook, ‘‘Audio-to-visual conversion using Hidden Markov ‘‘End-to-end audiovisual speech recognition,’’ in Proc. IEEE Int. Conf.
models,’’ in Proc. 7th Pacific Rim Int. Conf. Artif. Intell., Trends Artif. Acoust., Speech Signal Process. (ICASSP), Apr. 2018, pp. 6548–6552.
Intell., 2002, pp. 563–570. [56] S. Petridis, J. Shen, D. Cetin, and M. Pantic, ‘‘Visual-only recognition of
[30] A. Botev, B. Zheng, and D. Barber, ‘‘Complementary sum sampling for normal, whispered and silent speech,’’ in Proc. IEEE Int. Conf. Acoust.,
likelihood approximation in large scale classification,’’ in Proc. 20th Int. Speech Signal Process. (ICASSP), Apr. 2018, pp. 6219–6223.
Conf. Artif. Intell. Statist. (PMLR), vol. 54, 2017, pp. 1030–1038. [57] M. Wand, J. Schmidhuber, and N. T. Vu, ‘‘Investigations on end- to-
[31] J. Jeffers and M. Barley, Speechreading (Lipreading). Springfield, IL, end audiovisual fusion,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal
USA: Charles C Thomas, 1971. Process. (ICASSP), Apr. 2018, pp. 3041–3045.
[32] C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, [58] K. Xu, D. Li, N. Cassimatis, and X. Wang, ‘‘LCANet: End-to-end lipread-
J. Sison, and A. Mashari, ‘‘Audio visual speech recognition,’’ IDIAP, ing with cascaded attention-CTC,’’ in Proc. 13th IEEE Int. Conf. Autom.
Martigny, Switzerland, Tech. Rep., 2000. Face Gesture Recognit. (FG), May 2018, pp. 548–555.
[59] C. Wang, ‘‘Multi-grained spatio-temporal modelling for lip-reading,’’ in KUN GUO received the bachelor’s degree in
Proc. Brit. Mach. Vis. Conf., 2019, pp. 1–11. detection, guidance, and control technology and
[60] X. Weng and K. Kitani, ‘‘Learning spatio-temporal features with two- the master’s degree in systems engineering from
stream deep 3D CNNs for lip reading,’’ in Proc. Brit. Mach. Vis. Conf., Northwestern Polytechnical University, Xi’an,
2019, pp. 1–13. China, in 2007 and 2010, respectively, and
[61] B. Martinez, P. Ma, S. Petridis, and M. Pantic, ‘‘Lipreading using temporal the Ph.D. degree in engineering systems and
convolutional networks,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal design from London South Bank University, U.K.,
Process. (ICASSP), May 2020, pp. 6319–6323.
in 2013. From 2014 to 2016, he worked as a Senior
[62] T. Afouras, J. S. Chung, and A. Zisserman, ‘‘LRS3-TED: A large-scale
Data Analyst with ZTE, Xi’an, China. Since 2016,
dataset for visual speech recognition,’’ 2018, arXiv:1809.00496. [Online].
Available: https://arxiv.org/abs/1809.00496 he has been working with Xi’an VANXUM Elec-
[63] A. Britto Mattos, D. A. Borges Oliveira, and E. da silva Morais, ‘‘Improv- tronics Technology Company Ltd., China, as a Senior Algorithm Engineer.
ing CNN-based viseme recognition using synthetic data,’’ in Proc. IEEE His research interests include applications of deep learning in computer
Int. Conf. Multimedia Expo (ICME), Jul. 2018, pp. 1–6. vision and natural language processing, data mining, and computer graphics
compression.