Professional Documents
Culture Documents
Enhancing The Accessibility of Hearing Impaired To Video Content Through Fully Automatic Dynamic Captioning
Enhancing The Accessibility of Hearing Impaired To Video Content Through Fully Automatic Dynamic Captioning
Grigore T. Popa University of Medicine and Pharmacy, Iasi, Romania, November 21-23, 2019
Bucharest, Romania
Abstract—In this paper we introduce an automatic subtitle and the associated subtitles simultaneously, while enhancing
positioning approach designed to enhance the video accessibility the video consumption experience.
of deaf and hearing impaired people to multimedia documents. The rest of the paper is organized as follows. In Section II
By using a dynamic subtitle and captioning approach, which we briefly review the state of the art in the field. Section III
exploits various computer vision techniques, including face introduces the proposed architecture, and details the video and
detection, tracking and recognition, video temporal segmentation audio analyzers considered. The experimental evaluation is
into shots and scenes and active speaker recognition, we are able presented in Section IV. Finally, Section V concludes the paper
to position each video subtitle segment in the near vicinity of the and opens some perspectives of further work.
active speaker. The experimental evaluation performed on 30
video elements validates our approach with average F1-scores II. RELATED WORK
superior to 92%.
Efforts on accommodating the hearing impaired people in
Keywords—dynamic subtitle, active speaker recognition, deaf accessing video content can be traced from early 1970’s where
and hearing impaired people. the US National Television has demonstrated the efficiency of
transmitting close captions together with the video stream [3].
I. INTRODUCTION However, the issue of dynamic subtitling has been
The proliferation of audio-visual production has led to the addressed only recently by the scientific community. Hong et
development of efficient and effective techniques designed to al. [4] are one of the first to propose a technique for adjusting
understand and organize video content to facilitate the access to the position of each text segment relatively to the video
information. However, the millions of people that are suffering content. Their system receives as input a video stream, its
from hearing impairment (HI) are facing important difficulties associated script and subtitle files. A mapping between the
in the comprehension of video content [1]. The incorporation character’s face and its corresponding script is performed with
of subtitles/close captions in the multimedia content is today the help of face detection and recognition techniques. Then, a
the most largely used solution to this problem [2]. visual saliency approach is exploited in order to determine a
Conventionally, the subtitle is positioned in a fixed non-intrusive region, around the speaker face, that is further
location; at the bottom part of the TV screen. Although the used to present the script. The method is extended in [5], where
hearing impaired audience can retrieve certain information a novel lip motion analysis algorithm for active speaker
from the associated script, the existent positioning techniques detection is introduced. The position of the subtitles is
are far from satisfactory in assisting the deaf or HI users. They determined with the help of an optimization procedure that
notably suffer from the following limitations: (1) Confusion jointly takes into account the presence/absence of the speaker
between various actors – for scenes with multiple active within the scene, the cross-frame coherence and the screen
characters the HI users need to judge to which one the script is layout. Even though the usability studies performed have
associated to; (2) Tracking the subtitle displayed on the shown an increase in terms of quality of experience and content
screen: for characters speaking rapidly the deaf and HI people comprehension, the method still presents several limitations
can miss parts of the sentence. Moreover, for various active that need to be overcome: (1) The system functions solely for
speakers the subtitle’s display time can vary over a wide active speakers facing the video camera (frontal faces). For
range. profile faces or, more generally, for characters with
In order to deal with such limitations, in this paper we unfavorable poses, no information regarding the identity of the
introduce a novel, completely automatic, dynamic subtitle speaker can be determined.(2) The strategy of placing the video
positioning methodology. The proposed approach is designed subtitle in the near surrounding of the active speaker mouth is
to assist the HI user to comfortable follow the video content not optimal in all cases. Thus, for characters moving within the
A. Face detection, tracking and recognition where and are the beginning and respectively, the
The face detection is performed using the Faster R-CNN end time of the current audio chunk, while denotes the
architecture [10], extended with the region proposal networks total number of letters from the first line of the text.
[11]. The faces are detected in every frame of the video Finally, as recommended in [16], we apply traditional
stream. Then, for each face its associated track is determined speech processing techniques including voice activity
with the help of the ATLAS algorithm [12], extended here to detection, silence and unvoiced speech removal, in order to
work on multiple moving instances. filter out the non-relevant segments. The extracted audio
segments are further used for speech recognition purposes, as performed also in the fully connected layer where we impose a
described in the following. maximum size of 512.
Since the active speaker recognition task is treated as an
2. Speaker recognition process. image classification problem, the output of the last layer of the
Fig.1 presents the synoptic scheme of the speaker CNN is fed into a softmax operator in order to produce a
recognition framework. distribution over all the considered classes. The top-3 results
for each audio chunk are validated, in descending order of
probability, using the list of characters available for each
scene. We finally retain as a correct prediction the active
speaker with the highest probability score. The final stage
concerns the set-up of a subtitle positioning strategy.
3. Subtitle placement.
Based on the information regarding the identity of the
active speaker, the objective is to position its associated
subtitle segment in the subject’s near vicinity.
In contrast with other state of the art techniques as [4], [5]
that propose using speech bubbles to place video subtitles, we
define 15 potential candidate regions for placing each speech
segment. The proposed locations are defined by considering 3
levels of subdivision along the horizontal axis and 5 levels on
the vertical one. In the vertical direction the subtitle can admit
an offset relative to the top part of the screen between 60%
and 80% with a step size of 5%. In the horizontal direction we
admit an offset relative to the left part of the screen of 0%,
20% and 40%.
The proposed approach has the advantage of minimizing the
subtitle fluctuation when the subject is moving. In our case,
for each subtitle segment its horizontal position is determined
Fig. 1. The proposed active speaker recognition framework using the active speaker’s face centroid, based on a majority
location voting scheme, applied over the whole set of frames
The previously determined audio segments are first where the subtitle is present. The vertical location is computed
converted into a single-channel 16bits stream at a sampling so that no overlap with other faces that are present in the scene
rate of 16 kHz. We treat the speaker recognition problem as a occurs. The default position considered has an 80% vertical
multi-category classification task. offset location and we continue the analysis in ascending
For each audio chunk, we generate spectrograms that are order.
represented as vectors of size (257 x T x 1), where 257 denotes
the number of spectral components used in order to computed IV. EXPERIMENTAL EVALUATION
the short-time Fourier transform, T is the temporal length of The experimental evaluation has been conducted on a data
each audio segment (expressed in seconds) and 1 represents set of 30 video streams with an average duration of 20
the number of channels used to represent the spectrogram. minutes. Twenty videos have been selected from the France
Mean and variance normalization are performed on each Télévisions TV series “Un si grand soleil” and ten from the
frequency bin of the spectrum. movies “Friends” and “The Big Bang Theory”.
From each spectrogram, a set of features is extracted using For the CNN training process involved in the active speaker
the residual-network (ResNet-34) CNNs architecture [17]. We recognition process, we have considered a database of 110
have modified the ResNet-34 in a fully convolutional way in known characters. In order to extract the audio segments
order to adapt it to 2D spectrogram inputs [18]. required for training, we have used the first 100 broadcasted
In order to obtain a compact (i.e., low-dimensional) output episodes of the “Un si grand soleil” series while for the US
descriptor at the end of the ResNet-34 a NetVLAD layer for series we have used all episodes from season I. First, we
feature aggregation along the temporal axis has been added. applied the face tracking and recognition, in order to identify
So, the ResNet-34 architecture maps the input spectrograms to the potential active speakers. Then, we have performed a
frame level descriptors. The NetVLAD layer takes dense speech to mouth verification using the SyncNet [19] approach
descriptors as input and produces a single matrix V of size K x in order to verify if there is a correlation between the audio
D, where K refers to the total number of chosen clusters and D track and the mouth motion in videos. In this way, we have
refers to the dimensionality of each cluster. developed a training dataset with 13000 appearances of the
The final output of the network is obtained by performing 110 characters involved. The lowest length of any speech
the L2 normalization and concatenation. In addition, in order segment in the database is 1.25ms.
to reduce the processing burden a dimensionality reduction is
Fig. 2. Experimental results of the proposed subtitle positioning system