Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

2020 International Conference on Power, Instrumentation, Control and Computing (PICC)

Transformer Network for video to text translation


*
2020 International Conference on Power, Instrumentation, Control and Computing (PICC) | 978-1-7281-7590-4/20/$31.00 ©2020 IEEE | DOI: 10.1109/PICC51425.2020.9362374

Mubashira N Dr.Ajay James


Department of Computer Assistant Professor,Department of
Science and Engineering Computer Science and Engineering
Government Engineering College Thrissur Government Engineering College Thrissur
Kerala,India Kerala, India
mubashiranechikkat@gmail.com ajay@gectcr.ac.in

Abstract—Recently generation of natural language descriptions driving,video subtitling, procedure generation for instructional
for videos has created a lot of focus in computer vision and nat- videos, video surveillance software for visually impaired
ural language processing research. Video understanding involves people and understanding sign language are the real world
detecting scene’s visual and temporal elements and reasoning
it for description generation.Several real world implementations applications.
such as video indexing and retrieval,video to sign language Convolutional neural networks(CNN) provides more sophis-
translation etc, are based on this.Because of the complicated ticated feature representations by doing series of convolution
nature and diversified content, the captioning problem becomes operations over images or videos(ie, series of frames).These
more challenging.This is a machine translation problem which convolutions are for comparing the visual data frames against
uses encoder decoder architecture of GRU or LSTM to deal
with this kind of problem.But here the decoding starts with some specific patterns(filters) that the network looking for.As
the final hidden state of encoder as input.This can’t be a good the network perform more convolutions, it could identify spe-
summary of the input sequence,because all the intermediate cific object.This is done by large amount of labelled data.It is
states of encoder are ignored.This paper proposes a transformer clear that CNN can only identify the spatial features or visual
network with deep attention based encoder and decoder to data of an image,but can’t handle temporal features, that is
generate the natural language description for video sequence
data. This network processes the sequences as a whole and learns how a frame related to one before it.Temporal sensitive model
relationship between each elements in the sequence by providing can be vector-to-sequence,sequence to-vector,or sequence-to-
attention. sequence.Here is the importance of temporal sensitive models
Index Terms—Video captioning,CNN,Transformer like Recurrent neural networks(RNN),Long short term mem-
network,Attention Mechanism,LSTM,RNN ory(LSTM),Auto encoders and Transformer networks which
I. I NTRODUCTION takes the output of CNN to get the output which might be a
vector or sequence based on the model.Attention mechanism
Understanding the contents of visual data is a complex
helps to which part of input should focus to yield more
task.Nowadays, Machine learning techniques allows us to
accurate outcome.Video captioning is kind of sequence to
train the context for a dataset,so that an algorithm can
sequence modeling,it takes series of frames in, to generate
understand what are the content in a video.Problems re-
textual description.
lated describing visual contents, computer vision and natu-
This process of collecting the interactions between ob-
ral language processing is taking an increasingly complex
jects,identifying the fine changes of video contents in temporal
challenges and it seeing the accuracy comparable to hu-
dimension, prioritizing the activities captured in the video
man observations.Studying and analysing the content of a
makes the captioning problem more challenging.
video is the important research area of multimedia.The task
In this paper Section II discuss the Related works of differ-
of generating contextual description of visual contents is a
ent methodology proposed for video captioning.A transformer
challenging and essential task in computer vision.This gen-
network proposed for video captioning described at section
erally includes feature extraction and generating descriptions
III. Different datasets used for training the models presents in
based on the feature vectors extracted.Semantic concepts like
section IV and Section V explains about evaluation metrics.
scenes,objects,actions,interactions between objects and order-
ing of the events in temporal dimension should consider to II. R ELATED WORKS
design an efficient solution architecture for the captioning
The methodologies can be categorized into two methods
problem.Moreover, this requires translation of the extracted
that are template-based methods, sequence learning methods.
visual information into grammatically correct natural language
by preserving the semantic concepts.Content based recom- A. Template based method
mendation and retrieval, human-robot interaction,autonomous
Template based method[1] are based on set of specific
grammar rules.First the sentences are divided in to three

978-1-7281-7590-4/20/$31.00 ©2020 IEEE

Authorized licensed use limited to: VIT University. Downloaded on March 30,2023 at 10:18:46 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Power, Instrumentation, Control and Computing (PICC)

types of fragments:subject, verb and object by following the Another study based on a reconstruction network for video
grammar.visual contents detects word into objects, actions captioning[7] is based on dual learning approach[8].It is a
and attributes categories.Then and each fragment is associated learning framework that leverages the (primal dual)structure of
with detected words.Then generated fragments are composed AI task to obtain effective feedback or regularization signals
to a sentence with predefined language template.this method to enhance the learning or inference process.Regularization
focused mainly on the detection of predefined entities and techniques is to avoid overfitting.This happens when the model
events separately.Describing open domain videos using this trying to capture the noise in the training data.The architecture
method found unrealistic or too expensive because of its consist of three modules:a CNN based encoder which extract
computational complexity. the semantic representations of the video frames.A LSTM
based decoder which generates natural language for visual
B. Sequence learning method content description.Then the reconstructor which exploits the
Deep RNN based model[2] to translate videos to natural backward flow from caption to visual contents to reproduce the
language is a naive approach.The frame features are extracted frame representations.From this reconstructed representation it
using CNN.Applying mean pooling along the features ex- provides a constrain for decoder to embed more information
tracted across the entire video to get a single video descrip- in to the input video representations.
tor,which is then fed into the LSTM network to generate tex- Fused GRU with semantic-temporal attention for video cap-
tual description.But this mean pooling causes loss of temporal tioning[9],this methodology provides two types of attention,
information. semantic and temporal.Temporal attention involves directing
Another work[3] of video to text discusses end-to-end attention to specific instant of time while decoding the video
sequence to sequence-video to text(S2VT) model to generate representation in to textual description.Semantic attention is
captions for videos.this method is similar to machine trans- the ability to provide the representations of semantically
lation between natural languages.A two stack LSTM first important objects that are needed when they are needed.The
encodes the feature vectors generated by CNN for RGB different modules of this architecture are a CNN based video
images or optical flow images.Then the decoder generates encoder for feature extraction,semantic concept prediction
sentences.Decoding starts only when all the features are en- network and hierarchical semantic decoder network.
coded.So the decoder gets only the previous output and hidden Video Captioning via Hierarchical Reinforcement Learn-
states as input,which works well for short sequences but can’t ing[10] is the work of incorporating reinforcement learning
memorize long term dependencies. techniques into video captioning.A high-level Manager module
An attention based LSTM and semantic consistency used in learns to design sub-goals and a low-level Worker module
the work of[4] for video captioning.This framework integrates recognizes the primitive actions to fulfill the sub-goal.
attention mechanism with LSTM to capture salient structures Transformer network is a major breakthrough towards more
of the video and also explores correlation between multimodal sophisticated sequence learning models.Transformer[11] is a
representations(ie text and visual data) to generate sentences deep learning model introduced,which are designed to solve
with rich semantic content.Inception v3[5] CNN used here the problem of sequence transduction, such as machine trans-
to extract more meaningful spatial features.This architecture lation and text summarizing.This is better than all other
allows to do the convolution along with pooling operation architecture because they totally avoid recursion by processing
in a single layer of CNN and stack up the feature maps to sequences as a whole and by learning relationship between
get a single volume output.The extracted features fed into an input elements by providing attention based encoder and
attention based long short term memory encoder-decoder. decoder. Attention mechanism which provides the advantage
A dual stream recurrent neural network architecture based to look entire input and the target sequence generated so far
on the work on video captioning[6] discusses about both given as input to the encoder and to the decoder respec-
visual and semantic stream.This architecture includes a visual tively.The softmax induces the probability distribution among
descriptor and a semantic descriptor which encodes both visual output.Transformers not only provides the attention but also
and semantic features respectively.Visual descriptor encodes make parallel the work of processing sequences.All these
frame representations of video and the semantic descriptor properties of transformers benefited a lot in the studies of
encodes each video frame with a high level representation image and video captioning.
of semantic concept like objects,actions, and interactions.This
visual descriptor and semantic descriptor are different modal- III. P ROPOSED SYSTEM
ities.Dual stram RNNs are used because of this two asyn-
chronous modalities to flexibly exploit the hidden states of The proposed system based on the transformer network.A
each stream.Finally it integrates the hidden state representa- CNN module designed to extract features form video frames.A
tions of visual and semantic descriptor.Then a dual stream deep self attention based encoder in transformer converts the
decoder is deployed to perform dual stream hidden state fusion feature vectors from CNN in to set of vectors with context and
for sentence generation.An attentive multi grained encoder order information included.The decoder predicts the next word
module to enhance the local feature learning with global by using the output from encoder and the predicted words so
semantics feature for each modality. far.Figure 1 shows the block diagram of the whole framework.

Authorized licensed use limited to: VIT University. Downloaded on March 30,2023 at 10:18:46 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Power, Instrumentation, Control and Computing (PICC)

Fig. 1. Proposed transformer network for video captioning

B. Attention based deep sequence encoder


The output from the inception network will be a feature
vector of some dimension ’d’.The extracted visual feature does
not represent the context and order information of the frames.
1) Attention mechanism: This is to take context of
each vector associated with all other vectors in the se-
quence.Consider the input embedding X= x1 , x2 , ..., xN .Then
follow the process below for all the frames from 1,...,N.
• Keys(K), Values(V) and queries(Q): Create three vectors
from each of the input feature vectors.These vectors are
created by multiplying the embedding by three weight
matrices Wk , Wv and Wq which will trained during the
training process.
Fig. 2. An inception module • k1 , v1 , q1 are the keys,queries and values generated
from (x1 .Wk ),(x1 .Wv ) and (x1 .Wq ).Then K
= {k1 , k2 , .., kN },V={v1 , v2 , .., vN } and Q =
{q1 , q2 , .., qN }.dk is the dimension of k,v and q.
A. Feature extraction using Inception Network • Inner product:Inner product of qi and kj , (qi .kj ) for every
j= 1,2,...N. is to quantify how similar all inputs to the the
vector xi
A single specific operation is defined between layers while • Softmax function Normalizes to get relative similarity
designing a CNN architecture.The inception network is an between Query qi and key kj vector, ri→j
exception, which is modelled so that it could perform convo-
lution along with pooling in a single layer.The output feature eqi .kj
ri→j = PN (1)
map is then stacked up into a single volume.Filters with qi .kl
l=1 e
different dimension can be used,but the output dimension will
not change in terms of height and width because it performs This measures how similar the ith input to the j th input,
’same convolution’ and padding in pooling.A bottleneck layer relative to all other N input in the sequence. ri→j always
PN
is introduced to reduce the computational cost without hurting positive,between 0 and 1 and j=1 ri→j = 1
the performance.Figure 2 shows an inception module of the • Multiply each relative relationship with corresponding
network, Value vector, (eg: ri→1 ) with v1 .Add all these together

Authorized licensed use limited to: VIT University. Downloaded on March 30,2023 at 10:18:46 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Power, Instrumentation, Control and Computing (PICC)

to get refined code zi for xi ,Then Z = z1 , z2 , ..., zN C. Deep Self and cross-attention based decoder
N
X Input of the decoder is the words are generated thus far
zi = ri→l .vl (2) marked as output sequence in the decoder network in figure
l=1
4.Left most word is the most recently predicted word.Every
The above process is only provides the context factor time a new word predicted ,input sequence on the bottom
irrespective of the frame ordering.But the ordering does mat- shifts to the right by one position.The positional embedding
ters.So positional embedding is for include the ordering. added to get the order.In cross attention(Encoder-Decoder
2) Positional Embedding(PE): attention) the keys K and values V are generated from the
• Constitute a ’d’ dimensional positional embedding each output of the encoder and query Q is the generated from
embedding dimension having a sinusoidal function. the output wordvector predicted thus far.Multihead attention
• The frequency of sine wave directly proportional to allows the attention mechanism to attend different aspects
embedding dimension. of characteristics of sequence.There is a feed forward neural
• Associate a number to each of ’d’ dimension in the PE network same as in the encoder.This operations performed J
connected to the sine wave at that dimension with respect times to form Deep Self and cross-attention based decoder A
to the the position.
The simple block diagram of deep sequence encoder shown
in the figure 3.The process performed by the encoder are:
• Positional embedding added to this feature vectors to
provide the order information.
• Then the attention network takes in to account the context
of the video frames.Every feature vector corresponding to
frames plays the roll of keys K,values V and queries Q.
• A skip connection provided for not to loose the original
feature vectors and add it into output of attention network.
• This added and normalized output then fed in to a
feed forward neural network to provide regularization
or structure on this network, and the hyperbolic tanh
function restricts the output of the network to be between
-1 and 1.
• Repeat the above process for K times for deep sequence
encoder.

Fig. 4. Deep self and cross attention based decoder

softmax on the top of this network is to predict the next word


in the sequence.This will be the next left most input to the
decoder network and previously generated sequence shifted to
right.

IV. DATASETS

The roll of datasets are crucial for structuring a machine


Fig. 3. Deep self attention encoder learning model.A dataset which contains enormous training
instances and discussing a wide range of contexts would con-
Self attention is for modifying the vector representation of tribute for the effective learning process.This section describes
each frames to take in to account the characteristics of the datasets that are commonly used for image and video
surrounding frames. captioning.

Authorized licensed use limited to: VIT University. Downloaded on March 30,2023 at 10:18:46 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Power, Instrumentation, Control and Computing (PICC)

A. Image Datasets A. BLEU Score(Bilingual evaluation Understudy)[18]


1) Microsoft COCO[12]: Short for common objects in BLEU is the most commonly used algorithm to evaluate
contexts is a large image classification/recognition, object the quality of the text description generated to the reference
detection, segmentation, and captioning dataset. It contain a sentences.The evaluation approach is by counting the matching
total of about 165k training images, 81k of validation data, n-grams in the candidate translation to n-grams in the reference
and 81k test images with 91 categories with 5 captions per text.The unigram would be each token bigram comparison
image. would be each word pair.All this comparisons are made
2) Flickr30K dataset[13]: This dataset itself is a bench- regardless of order.It is a modified precision measure in which
mark for sentence based image description, which augments each word credit only up to the maximum number of times it
the 158k captions from Flickr30k, contains 244k coreference appears in the reference sentences.
chains and 276k manually annotated bounding boxes for each B. METEOR[19]
of the 31k images and 5 English captions for each image in
the original dataset.coreference chains means, each image in Metric for Evaluation of Translation with Explicit Ordering,
the dataset has a txt file in the ”Sentences” folder. Every line is calculated as an average mean of precision and recall
of this file contains a caption with annotated phrases blocked by performing stemming and looking up for synonyms in
off with brackets. wordNet[20].It relies on finding an optimal visual to text
alignment. This metric evaluates machine generated output by
aligning it to the reference translations to find sentence level
B. Video Dataset similarity scores.The words are considered as matched if the
1) Montreal Video Annotation Dataset (M-VAD)[14]: It surface forms of words are identical(exact),stems are identical.
consist of 49k video clips taken from DVD movies. 39k video Match phrases if they are listed as paraphrases in a language
clips from the dataset used for training the captioning model. appropriate paraphrase table.
49k are allocated for validation and 5,000 video clips are used C. ROUGE[21]
for testing.A single sentence description provided even though
it is difficult and challenging to describe a movie in single This is a Recall-Oriented Understudy for Gisting Evalua-
sentence.This video dataset follows face tracks to the semi- tion.It is an intrincic metric for evaluating summaries based
automatic annotation process.Tracking face only because of it on BLEU scores.Given the automatic description generated by
directly depends on the body movements. machine and set of reference descriptions.ROUGE-N is to find
that what percentage of N-grams from these human generated
2) MPII Movie Description Corpus (MPII-MD)[15]: As
references occur in machine generated description.
name indicates it is a movie description dataset.The video
snippets are taken from 94 HD hollywood movies and over VI. C ONCLUSION
68k sentence descriptions in parellal.37k video clips are taken
The purpose of this study is to design a transformer network
from 55 movies.Audio descriptions also included along with
model for translating video clips in to natural language sen-
31k video clips are taken from the 49 movies.
tence.A deep self attention based encoder encodes the frame
3) Microsoft Research Video Description Corpus
using keys,values, and queries.Self and cross attention based
(MSVD)[16]: This dataset consist of 1970 video clips each of
deep decoder generates sentences.This network is different
length about 10 seconds created by Amazon Mechanical Turk
from LSTM and other sequential learning models,because de-
(AMT).This dataset splits in to training,validation and test
pendency every vectors in sequence with surrounding vectors
dataset each having 1200,100,670 video clips respectively.
can be perform parallel.
Every video snippet are annotated in multiple description
in different language with single sentence. Each video clips R EFERENCES
annotated by 40 different sentences in English.
[1] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Doll´ar,
4) MSR Video to Text (MSR-VTT)[17]: new large-scale J.Gao, X. He, M. Mitchell, J. C. Platt, et al. “From captions to
video benchmark for describing video contents in texts.It visual concepts and back,” in Proc. Conf. Computer Vision and Pattern
Recognition, pp. 1473–1482, 2015.
contains 10000 web based video clips and 200k single sen- [2] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, K.
tence description.A single video may have multiple descrip- Saenko, Translating Videos to Natural Language Using Deep Recurrent
tions.Here 20 captions included per video.The dataset splits in Neural Networks, Conference of the North American Chapter of the
Association for Computational Linguistics (NAACL), 2015, pp. 1494-
to 6513 training,2990 testing,497 validating samples. 1504.
[3] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, K.
Saenko, Sequence to sequence-video to text, Proceedings of the IEEE
V. E VALUATION M ETRICS International Conference on Computer Vision, 2015, pp. 4534-4542.
[4] L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen, “Video captioning
There could be multiple equally good descriptions for a sinle with attention-based LSTM and semantic consistency,” IEEE Trans.
video clip.The following methods are commonly used machine Multimedia, vol. 19, no. 9, pp. 2045–2055, Sep. 2017.
[5] C. Szegedy,W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
translation systems to calculate the accuracy. A machine V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In
translation system can have more than one good answers. CVPR, 2015.

Authorized licensed use limited to: VIT University. Downloaded on March 30,2023 at 10:18:46 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Power, Instrumentation, Control and Computing (PICC)

[6] N. Xu, A.-A. Liu, Y. Wong, Y. Zhang, W. Nie, Y. Su, and M.


Kankanhalli,“Dual-stream recurrent neural network for video caption-
ing,” IEEE Transactions on Circuits and Systems for Video Technology
(TCSVT), 2018.
[7] Wang, B., Ma, L., Zhang, W., Liu, W.: ”Reconstruction network for
video captioning”.In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. (2018) 7622–7631
[8] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W.-Y. Ma. Dual
learning for machine translation. In NIPS, pages 820–828, 2016.
[9] L. Gao, X. Wang, J. Song, Y. Liu, Fused GRU with semantic-
temporal attention for video captioning, Neurocomputing (2019)
https://doi.org/10.1016/j.neucom. 2018.06.096 .
[10] X. Wang, W. Chen, J. Wu, Y.-F. Wang, and W. Y. Wang. Video caption-
ing via hierarchical reinforcement learning. In The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2018.
[11] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. ”At-
tention is all you need”. In Advances in Neural Information Processing
Systems, pages 6000–6010.
[12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Common objects in
context. In ECCV. 2014.
[13] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image
descriptions to visual denotations: New similarity metrics for semantic
inference over event descriptions,” ACL, vol. 2, pp. 67–78, 2014.
[14] A. Torabi, C. Pal, H. Larochelle, and A. Courville, “Using descriptive
video services to create a large data source for video annotation
research,” arXiv preprint arXiv:1503.01070, 2015.
[15] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, “A dataset for
movie description,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2015, pp. 3202–3212.
[16] D. L. Chen and W. B. Dolan, “Collecting highly parallel data for
paraphrase evaluation,” in ACL: Human Language Technologies-Vol.
1. Association for Computational Linguistics, 2011, pp.190–200
[17] J. Xu, T. Mei, T. Yao, and Y. Rui. MSR-VTT: A large video description
dataset for bridging video and language. In CVPR, 2016.
[18] K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, “BLEU: a method for
automatic evaluation of machine translation”, in Proceedings of the 40th
Annual Meeting on Association for Computational Linguistics (ACL
’02). Association for Computational Linguistics, Stroudsburg, PA, USA,
311-318, 2002.
[19] D. Elliott and F. Keller, “ Image description using visual depedency
representations,” in Proc. Empirical Methods Natural Lang. Process.
2013, vol. 13, pp. 1292-1302.
[20] Miller, G., Beckwith, R, Fellbaum, C., Gross, D., and Miller, K. 1990.
Introduction to WordNet: An on-line lexical database. International
Journal of Lexicography (special issue), 3(4):235-312.
[21] Lin CY, “ROUGE: a package for automatic evaluation of summaries”,
in Proceedings of the workshop on text summarization branches out,
Barcelona, Spain, (WAS2004) 2004

Authorized licensed use limited to: VIT University. Downloaded on March 30,2023 at 10:18:46 UTC from IEEE Xplore. Restrictions apply.

You might also like