Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

The 7th IEEE International Conference on E-Health and Bioengineering - EHB 2019

Grigore T. Popa University of Medicine and Pharmacy, Iasi, Romania, November 21-23, 2019

Enhancing the Accessibility of Hearing Impaired


to Video Content through Fully Automatic
Dynamic Captioning
AuthoBogdan Mocanu1,2, Ruxandra Tapu1,2, Titus Zaharia1
1 ARTEMIS Department, Institute Mines - Télécom/Télécom SudParis, UMR CNRS 5157
SAMOVAR Evry, France
2 Department of Telecommunications, Faculty of ETTI, University “Politehnica” of Bucharest

Bucharest, Romania

Abstract—In this paper we introduce an automatic subtitle and the associated subtitles simultaneously, while enhancing
positioning approach designed to enhance the video accessibility the video consumption experience.
of deaf and hearing impaired people to multimedia documents. The rest of the paper is organized as follows. In Section II
By using a dynamic subtitle and captioning approach, which we briefly review the state of the art in the field. Section III
exploits various computer vision techniques, including face introduces the proposed architecture, and details the video and
detection, tracking and recognition, video temporal segmentation audio analyzers considered. The experimental evaluation is
into shots and scenes and active speaker recognition, we are able presented in Section IV. Finally, Section V concludes the paper
to position each video subtitle segment in the near vicinity of the and opens some perspectives of further work.
active speaker. The experimental evaluation performed on 30
video elements validates our approach with average F1-scores II. RELATED WORK
superior to 92%.
Efforts on accommodating the hearing impaired people in
Keywords—dynamic subtitle, active speaker recognition, deaf accessing video content can be traced from early 1970’s where
and hearing impaired people. the US National Television has demonstrated the efficiency of
transmitting close captions together with the video stream [3].
I. INTRODUCTION However, the issue of dynamic subtitling has been
The proliferation of audio-visual production has led to the addressed only recently by the scientific community. Hong et
development of efficient and effective techniques designed to al. [4] are one of the first to propose a technique for adjusting
understand and organize video content to facilitate the access to the position of each text segment relatively to the video
information. However, the millions of people that are suffering content. Their system receives as input a video stream, its
from hearing impairment (HI) are facing important difficulties associated script and subtitle files. A mapping between the
in the comprehension of video content [1]. The incorporation character’s face and its corresponding script is performed with
of subtitles/close captions in the multimedia content is today the help of face detection and recognition techniques. Then, a
the most largely used solution to this problem [2]. visual saliency approach is exploited in order to determine a
Conventionally, the subtitle is positioned in a fixed non-intrusive region, around the speaker face, that is further
location; at the bottom part of the TV screen. Although the used to present the script. The method is extended in [5], where
hearing impaired audience can retrieve certain information a novel lip motion analysis algorithm for active speaker
from the associated script, the existent positioning techniques detection is introduced. The position of the subtitles is
are far from satisfactory in assisting the deaf or HI users. They determined with the help of an optimization procedure that
notably suffer from the following limitations: (1) Confusion jointly takes into account the presence/absence of the speaker
between various actors – for scenes with multiple active within the scene, the cross-frame coherence and the screen
characters the HI users need to judge to which one the script is layout. Even though the usability studies performed have
associated to; (2) Tracking the subtitle displayed on the shown an increase in terms of quality of experience and content
screen: for characters speaking rapidly the deaf and HI people comprehension, the method still presents several limitations
can miss parts of the sentence. Moreover, for various active that need to be overcome: (1) The system functions solely for
speakers the subtitle’s display time can vary over a wide active speakers facing the video camera (frontal faces). For
range. profile faces or, more generally, for characters with
In order to deal with such limitations, in this paper we unfavorable poses, no information regarding the identity of the
introduce a novel, completely automatic, dynamic subtitle speaker can be determined.(2) The strategy of placing the video
positioning methodology. The proposed approach is designed subtitle in the near surrounding of the active speaker mouth is
to assist the HI user to comfortable follow the video content not optimal in all cases. Thus, for characters moving within the

978-1-7281-2603-6/19/$31.00 ©2019 IEEE


scene, the subtitle position fluctuates around the speaker The face recognition process is based on the VGG16 CNN
mouth, which makes it difficult/tiresome to follow. In addition, architecture [13] that is extended with a learning-based weight
in some cases, the subtitle may occlude some relevant parts of optimization scheme [14] that determines the relevance of
the video content (e.g., faces of the other characters in the each face instance to the final global descriptor associated to a
scene or useful, incrusted text elements that may be present in face track. The set of weights is adaptively determined
the background). Moreover, such a strategy is convenient for depending on the frame’s degree of noise and motion, face
relatively short text segments. For long phrases covering two poses or viewing angles.
lines of text, the subtitle fluctuation becomes very disturbing.
Other approaches such as those introduced in [6] and [7] B. Video temporal segmentation into shots and scene
concern the so-called gaze-based subtitle positioning Initially, the video stream is divided into shots using the
techniques. The method introduced in [6] aims at reducing the graph partition strategy presented in [15]. Then, the shots are
viewer's eye movement, without interfering with the target used to create video scenes that satisfy a certain homogeneity
region of interest (ROI). A ROI is defined as a salient region in with respect to a semantic criterion (i.e., the presence of the
the video scene that should not be occluded by subtitles. For set of faces). Using the video shots and the recognized
each video frame, the ROI is determined using the eye-tracking characters, we construct a connected graph, developed at the
data extracted from multiple viewers. The method is further global level of the video sequence. The cut-edges of the graph
extended in [7] with an active speaker recognition framework are used in order to form the set of scene boundaries. The
based on audio and video information. However, even though output of the module is a list of characters associated to each
the subjective evaluation performed shows promising results, video scene.
both methods assume that the active speaker eyes and mouth C. Active speaker recognition
are visible in the scene, which is not always true. Furthermore,
The active speaker recognition process involves different
the gaze information is difficult to obtain, since it requires the
stages, starting with a segmentation of the audio signal into
set-up of a complex acquisition protocol that involves multiple
segment correlated with the textual subtitles.
users.
In [8], the authors introduce a study regarding the user 1. Identification of subtitle-related audio segments.
experience when watching videos together with the associated Using the synchronized subtitle file associated to each video
subtitle dynamically positioned on the screen. First, the eye document, we extract the timestamps during which a specific
data is analyzed in order to determine the difference between text segment is displayed. Let us note that the extracted
gaze patterns obtained for both subtitled and non-subtitled timestamps are not perfectly correlated with the audio stream,
videos. Then, the users were asked to view videos with because: (1) In order to facilitate the user reading, the subtitle
dynamic subtitles and to express their attitude towards it. As a is always displayed longer time intervals than the actual
general conclusion, it can be highlighted that dynamic subtitle speaking moments. (2) The subtitle may contain one or two
approaches create gaze patterns that are closer to the baseline lines of text that can correspond to multiple active speakers.
than those induced by regular subtitles. The work is extended Initially, we divide the audio stream into smaller segments,
in [9], where a comparative evaluation study of users watching denoted by audio chunks, using the timestamps extracted from
videos, with and without subtitles, is presented. The focus is the subtitle file. Then, based on the observation that a subtitle
put on tracking the user’s eyes. The study pointed out that segment belonging to two different speakers contains two
static subtitles make it easier to look around, but more difficult dialog lines, we propose to divide, in such cases, the
to understand the video content. corresponding audio chunk into two sub-segments. The begin
The analysis of the literature shows that the existent time ( and ) and the end time ( and ) of each
approaches are still far from being satisfactory in assisting the audio sub-segment can be computed using the initial subtitle
hearing impaired users in video content understating. Most
segment timestamps and its total number of letters ( ), as
systems fail in identifying the speakers in the case where the
described in the following equations:
active character is not facing the video camera or not visible in
the scene. In addition, all methods are sensitive to visual −
= , (1)
appearances of relatively poor quality. In this paper we
introduce a novel framework designed to overcoming such
limitations and detailed in the following section. = ; = + ∗ , (2)

III. PROPOSED APPROACH = ; = , (3)

A. Face detection, tracking and recognition where and are the beginning and respectively, the
The face detection is performed using the Faster R-CNN end time of the current audio chunk, while denotes the
architecture [10], extended with the region proposal networks total number of letters from the first line of the text.
[11]. The faces are detected in every frame of the video Finally, as recommended in [16], we apply traditional
stream. Then, for each face its associated track is determined speech processing techniques including voice activity
with the help of the ATLAS algorithm [12], extended here to detection, silence and unvoiced speech removal, in order to
work on multiple moving instances. filter out the non-relevant segments. The extracted audio
segments are further used for speech recognition purposes, as performed also in the fully connected layer where we impose a
described in the following. maximum size of 512.
Since the active speaker recognition task is treated as an
2. Speaker recognition process. image classification problem, the output of the last layer of the
Fig.1 presents the synoptic scheme of the speaker CNN is fed into a softmax operator in order to produce a
recognition framework. distribution over all the considered classes. The top-3 results
for each audio chunk are validated, in descending order of
probability, using the list of characters available for each
scene. We finally retain as a correct prediction the active
speaker with the highest probability score. The final stage
concerns the set-up of a subtitle positioning strategy.

3. Subtitle placement.
Based on the information regarding the identity of the
active speaker, the objective is to position its associated
subtitle segment in the subject’s near vicinity.
In contrast with other state of the art techniques as [4], [5]
that propose using speech bubbles to place video subtitles, we
define 15 potential candidate regions for placing each speech
segment. The proposed locations are defined by considering 3
levels of subdivision along the horizontal axis and 5 levels on
the vertical one. In the vertical direction the subtitle can admit
an offset relative to the top part of the screen between 60%
and 80% with a step size of 5%. In the horizontal direction we
admit an offset relative to the left part of the screen of 0%,
20% and 40%.
The proposed approach has the advantage of minimizing the
subtitle fluctuation when the subject is moving. In our case,
for each subtitle segment its horizontal position is determined
Fig. 1. The proposed active speaker recognition framework using the active speaker’s face centroid, based on a majority
location voting scheme, applied over the whole set of frames
The previously determined audio segments are first where the subtitle is present. The vertical location is computed
converted into a single-channel 16bits stream at a sampling so that no overlap with other faces that are present in the scene
rate of 16 kHz. We treat the speaker recognition problem as a occurs. The default position considered has an 80% vertical
multi-category classification task. offset location and we continue the analysis in ascending
For each audio chunk, we generate spectrograms that are order.
represented as vectors of size (257 x T x 1), where 257 denotes
the number of spectral components used in order to computed IV. EXPERIMENTAL EVALUATION
the short-time Fourier transform, T is the temporal length of The experimental evaluation has been conducted on a data
each audio segment (expressed in seconds) and 1 represents set of 30 video streams with an average duration of 20
the number of channels used to represent the spectrogram. minutes. Twenty videos have been selected from the France
Mean and variance normalization are performed on each Télévisions TV series “Un si grand soleil” and ten from the
frequency bin of the spectrum. movies “Friends” and “The Big Bang Theory”.
From each spectrogram, a set of features is extracted using For the CNN training process involved in the active speaker
the residual-network (ResNet-34) CNNs architecture [17]. We recognition process, we have considered a database of 110
have modified the ResNet-34 in a fully convolutional way in known characters. In order to extract the audio segments
order to adapt it to 2D spectrogram inputs [18]. required for training, we have used the first 100 broadcasted
In order to obtain a compact (i.e., low-dimensional) output episodes of the “Un si grand soleil” series while for the US
descriptor at the end of the ResNet-34 a NetVLAD layer for series we have used all episodes from season I. First, we
feature aggregation along the temporal axis has been added. applied the face tracking and recognition, in order to identify
So, the ResNet-34 architecture maps the input spectrograms to the potential active speakers. Then, we have performed a
frame level descriptors. The NetVLAD layer takes dense speech to mouth verification using the SyncNet [19] approach
descriptors as input and produces a single matrix V of size K x in order to verify if there is a correlation between the audio
D, where K refers to the total number of chosen clusters and D track and the mouth motion in videos. In this way, we have
refers to the dimensionality of each cluster. developed a training dataset with 13000 appearances of the
The final output of the network is obtained by performing 110 characters involved. The lowest length of any speech
the L2 normalization and concatenation. In addition, in order segment in the database is 1.25ms.
to reduce the processing burden a dimensionality reduction is
Fig. 2. Experimental results of the proposed subtitle positioning system

For the testing dataset, 15208 utterances have been REFERENCES


extracted. Let us underline that more than 3500 speech
[1] B. Safadi, M. Sahuguet, and B. Huet, “When textual and visual
segments belong to unknown individuals. information join forces for multimedia retrieval”. Proceedings of
For evaluation purposes, we have retained traditional International Conference on Multimedia Retrieval, pp. 265-272, 2014
performance metrics, including accuracy (A), recognition rate [2] A. Tamayo, F. Chaume, “Subtitling for d/Deaf and hard-of-hearing
children: current practices and new possibilities to enhance language
(R) and F1 norm. In addition, we have compared our results development”. Brain Science; 7(7): 75, 2017.
with state of the art techniques [4] and [5]. Some visual [3] https://signlanguageco.com/a-brief-history-of-closed-captioning/ -
examples of the results obtained are illustrated in Fig. 2. The Accessed on 2 September 2019.
[4] R. Hong, M. Wang, M. Xu, S. Yan, T.-S. Chua, "Dynamic captioning:
global objective performance scores are summarized in Table I. Video accessibility enhancement for hearing impairment", Proc. ACM
TABLE I. OBJECTIVE PERFORMANCE SCORES Conf. Multimedia, pp. 421-430, 2010.
[5] Y. Hu, J. Kautz, Y. Yu, W. Wang, "Speaker-following video subtitles",
Method ACM Transactions on Multimedia Computing, Communications, and
Accuracy Recognition rate F1 norm
Applications, vol. 11, no. 2, pp. 1-17, 2014.
Hong [4] 81.25% 78.47% 79.83% [6] W. Akahori, T. Hirai, S. Kawamura, and S. Morishima, “Region-of-
Hu [5] 88.14% 81.84% 84.87% interest-based subtitle placementusing eye-tracking data of multiple
viewers”. Proc. ACM International Conference on Interactive
Ours 98.57% 87.79% 92.85%
Experiences for TV and Online Video, pp.123–128, 2016.
[7] W. Akahori, T. Hirai and S. Morishima, “Dynamic Subtitle Placement
As it can be observed, the proposed approach returns gains
Considering the Region of Interest and Speaker Location”, Proc. of the
of more than 8% in both precision and recall scores. This 12th International Joint Conference on Computer Vision, Imaging and
behavior is explained by the robustness of the proposed Computer Graphics Theory and Applications, pp. 102-109, 2017.
technique to face pose variation, camera/background motion, [8] A. Brown, R. Jones, M. Crabb, J. Sandford, M. Brooks, M. Armstrong,
C. Jay, "Dynamic subtitles: the user experience", Proceedings of the
and to various types of noise such as background chatter,
ACM International Conference on Interactive Experiences for TV and
laughter or overlapping speech. Online Video, pp. 103-112, 2015.
[9] K. Kurzhals, E. Cetinkaya, Y. Hu, W. Wang, and D. Weiskopf “Close to
V. CONCLUSIONS the Action: Eye-Tracking Evaluation of Speaker-Following Subtitles”
Proceedings of CHI Conference on Human Factors in Computing
In this paper we have introduced a novel automatic system Systems. ACM, USA, 6559–6568, 2017.
designed to enhance video viewing experience by dynamically [10] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real- time
place the video subtitle in the vicinity of the active speaker. object detection with region proposal networks,” in NIPS, 2015.
[11] H. Jiang and E. G. Learned-Miller, “Face detection with the faster R-
The proposed framework is based on a multimodal fusion of CNN,” 12th IEEE International Conference on Automatic Face &
information obtained from multiple media channels involved in Gesture Recognition, pp. 650-657, 2017.
a video document: visual, audio and text. The experimental [12] B. Mocanu, R. Tapu and T. Zaharia, "Single object tracking using
evaluation performed on 30 multimedia documents validates the offline trained deep regression networks," in IPTA, pp. 1-6, 2017.
[13] O. M. Parkhi, A. Vedaldi, and A. Zisserman. “Deep face recognition”. In
proposed methodology that returns an average F1 score superior British Machine Vision Conference, vol. 1, pp. 6, 2015.
to 92% and gains compared with state of the art techniques of [14] B. Mocanu, R. Tapu and T. Zaharia, "DEEP-SEE FACE: A Mobile Face
more than 8%. For further work, we envisage performing a Recognition System Dedicated to Visually Impaired People," in IEEE
Access, vol. 6, pp. 51975-51985, 2018.
large usability study on deaf and hearing impaired people. [15] R. Tapu and T. Zaharia, "A complete framework for temporal video
segmentation," 2011 IEEE International Conference on Consumer
ACKNOWLEDGMENT Electronics, Berlin, pp. 156-160, 2011.
Part of this work has been supported by a mobility project [16] R. Yin, H. Bredin, and C. Barras, “Speaker change detection in
broadcast TV using bidirectional long short-term memory net-works,” in
of the Romanian Ministry of Research and Innovation, CNS- Proceedings of Interspeech, pp. 3827–3831, 2017.
UEFISCDI, project number PN-III-P1-1.1-MCD-2019-0157, [17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
within PNCDI III” and by French PIA (Plan d’Investissement recognition,” arXiv preprint arXiv:1512.03385, 2015.
d’Avenir) SUBTIL project. [18] W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance-level
Aggregation For Speaker Recognition In The Wild,” in Proceedings of
ICASSP, 2019.
[19] J. S. Chung, A. Zisserman, "Out of time: Automated lip sync in the
wild", in Proceedings of ACCV, pp. 251-263, 2016.

You might also like